Job Search and Career Advice Platform

Enable job alerts via email!

Vice President of SRE EMEA

Mesh-AI Limited

United Kingdom

Hybrid

GBP 120,000 - 160,000

Full time

30+ days ago

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A forward-thinking tech company in the United Kingdom is looking for a VP of Site Reliability Engineering to lead its global SRE function. You'll define and execute the SRE strategy, build and mentor distributed teams, and drive automation initiatives to ensure top-tier reliability and performance across GPU cloud infrastructures. Ideal candidates will have substantial experience in SRE, strong leadership skills, and expertise in Kubernetes and cloud technologies.

Benefits

Encouragement of diverse applications
Openness and transparency culture

Qualifications

  • 10+ years of experience in SRE or Infrastructure roles.
  • Experience building and leading distributed SRE teams.
  • Deep expertise with Linux, Kubernetes, cloud-native platforms.
  • Proven experience with SLOs, SLIs, and error budgets.

Responsibilities

  • Lead global Site Reliability Engineering strategy.
  • Build and scale SRE teams, including mentoring talent.
  • Drive automation and operational excellence across infrastructure.
  • Define incident management practices to minimize MTTR.

Skills

SRE experience
Leadership
Kubernetes expertise
Linux systems
Automation and Infrastructure-as-Code
Communication and influence

Tools

Prometheus
Grafana
Terraform
Ansible
Job description

Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business outcomes, including cost management, rapid innovation, and environmental responsibility.

At Nscale, our Engineering team plays a critical role in driving the deployment and then subsequent management of our infrastructure and software platforms.

We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you’ll build trust through openness and transparency, where everyone is inspired to do their best work. If you join our team, you’ll be contributing to building the technology that powers the future.

About the Role

We are seeking a VP of Site Reliability Engineering (SRE) to lead Nscale’s global reliability function. You will own the strategy, execution, and leadership of our SRE organisation, ensuring our GPU-accelerated cloud operates with world-class reliability, observability, and operational excellence.

You’ll be responsible forbuilding and scaling SRE teams, defining reliability practices, and driving automation and resilience across infrastructure. This is a high-impact role that will partner closely with Product, Engineering, Infrastructure, and Operations leadership to deliver a secure, performant, and reliable platform at hyperscale.

What you\'ll be doing

  • Define and execute Nscale’sglobal SRE strategy, aligning reliability goals with business outcomes.
  • Build, scale, and lead a world-class SRE organisation, includinghiring, mentoring, and developing talentacross multiple regions.
  • Ownservice reliability frameworks, including SLOs, SLIs, and error budgets, embedding them into engineering culture.
  • Drive the design, automation, and operation of infrastructure platforms acrossbare-metal, OpenStack, Kubernetes, and Slurm environments.
  • Establish best-in-classincident management practices—minimising MTTR and maximising learning from post-mortems.
  • Partner with Observability, Infrastructure, and Product teams todeliver 360° visibilityacross GPU clusters, fabrics, and services.
  • Guidecapacity planning and scaling strategies, ensuring platform resilience as Nscale expands globally.
  • Champion automation-first principles across provisioning, monitoring, CI/CD, and operational workflows.
  • Provideexecutive-level reportingon reliability, operational performance, and capacity to senior leadership.
  • Stay ahead of industry trends in SRE, automation, and AIOps, applying them to Nscale’s infrastructure at scale.

About you

  • 10+ years of experience in SRE, Infrastructure, or Reliability Engineering, including 3+ years in a leadership role.
  • Proven track record building and leadingdistributed SRE or infrastructure operations teams.
  • Deep expertise withLinux systems, Kubernetes, and cloud-native platforms.
  • Strong background inbare-metal and datacentre operations, including provisioning (PXE, IPMI), networking, and hardware lifecycle.
  • Demonstrated experience in defining and enforcingSLOs/SLIs and error budgets.
  • Strong knowledge ofautomation and Infrastructure-as-Code(Terraform, Ansible, Crossplane).
  • Experience driving observability best practices using Prometheus, Grafana, and related tools.
  • Skilled communicator with the ability toinfluence cross-functional teams and report at executive level.

Preferred Qualifications

  • Prior experience withOpenStack(OVN networking, KVM virtualization) or HPC environments (Slurm, RDMA, InfiniBand).
  • Contributions toopen-source communitiesin SRE, infrastructure, or cloud-native spaces.
  • Experience embeddingsecure and compliant operational practices(SOC2, ISO 27001, GDPR).
  • Background scaling infrastructure forAI, GPU workloads, or HPC environments.

In all we do, ourcore valuesguide us.

Relentless Innovation

Ownership and Accountability

Openness and Transparency

Customer-Centric Focus

Sustainability

We strongly encourage applications from people of colour, the LGBTQ+ community, people with disabilities, neurodivergent people, parents, carers, and people from lower socio-economic backgrounds.

If there’s anything we can do to accommodate your specific situation, please let us know.

The responsibilities outlined in this job description are not exhaustive and are intended to provide a general overview of the position. The employee may be required to perform additional duties, tasks, and responsibilities as assigned by management, consistent with the skills and qualifications required for the role.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.