Enable job alerts via email!

Site Reliability Engineer

Customair

Manchester

On-site

GBP 100,000 - 125,000

Full time

30+ days ago

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A technology firm in Greater Manchester is seeking a Site Reliability Engineer to ensure the reliability and performance of its digital ecosystem. The candidate will monitor systems, lead SRE strategy, and collaborate with various teams to drive improvements. Strong experience with observability tools and cloud environments is essential. The role offers a competitive salary and company benefits including a discretionary bonus and a pension scheme.

Benefits

Competitive Basic Salary

Discretionary Bonus Scheme

Company Shares Option Plan

Contributory pension scheme

Life insurance (4× basic salary)

Simply Health Cash Plan

Holiday entitlement (33 days inclusive of bank holidays)

Study Support and opportunity for progression and development

24/7 365 employee assistance helpline

Collaborative office environment with free parking

Qualifications

Deep understanding of system reliability and performance optimization.
Strong hands-on experience with modern observability tools.
Experience in cloud environments, ideally AWS.

Responsibilities

Monitor live production systems and optimize system performance.
Lead development of SRE strategy and best practices.
Collaborate with teams to ensure smooth deployment and monitoring.

Skills

System reliability

Performance optimization

Cloud-native architectures

Observability tools (Grafana, Prometheus)

Infrastructure as code (Terraform)

Container orchestration (Kubernetes)

Programming (Go, .NET)

CustomAir – Greater Manchester, England, United Kingdom

Site Reliability Engineer

What You’ll Be Doing

In this role, you’ll be at the forefront of ensuring the reliability, performance, and scalability of Tote’s digital ecosystem. You’ll monitor live production systems, using observability tools to detect potential issues before they impact users, and take proactive steps to optimise system performance and stability. You’ll analyze telemetry data, identify bottlenecks, and drive improvements across our infrastructure and applications.

You’ll lead the development of our SRE strategy, defining standards, best practices, and ways of working that embed reliability into everything we build. Working closely with engineering, operations, and product teams, you’ll help shape our SLAs, SLOs, and error budgets to align with business priorities. Performance and resilience will be at the heart of what you do. You’ll design and implement performance testing strategies to simulate peak traffic and ensure our systems remain stable during major racing events. You’ll build intuitive dashboards, refine alerting systems, and create tools that provide clear visibility into system health, enabling the wider business to make confident, data‑driven decisions.

Collaboration is key in this role. You’ll work alongside software engineers to design scalable solutions, with compliance teams to meet internal and regulatory standards, and with operations to ensure smooth deployment and monitoring. You’ll also play a crucial role in incident management, from leading real‑time response efforts to conducting thorough post‑incident reviews that identify root causes and long‑term improvements.

What We Are Looking For

We’re looking for an engineer who thrives on solving complex challenges and improving how systems perform at scale. You’ll have a deep understanding of system reliability, performance optimisation, and cloud‑native architectures. You’ll bring strong hands‑on experience with modern observability tools such as Grafana, Prometheus, and OpenTelemetry, as well as a solid grasp of distributed systems and networking fundamentals.

You’ll be confident working with infrastructure‑as‑code tools (like Terraform) and container orchestration platforms such as Kubernetes. Experience in cloud environments, ideally AWS, will be highly beneficial. You’ll also be comfortable coding in at least one modern programming language such as Go or .NET, using your technical expertise to automate processes, build internal tools, and debug complex issues.

Beyond your technical skills, you’ll bring a calm, analytical mindset to high‑pressure situations. You’ll be the kind of person who thrives during live incidents, focused, clear‑headed, and methodical, ensuring our teams can respond quickly and effectively. You’ll also be an advocate for modern engineering practices, championing DevOps culture, CI/CD pipelines, and automation wherever possible.

Finally, communication is key. You’ll work closely with engineers, operations specialists, and product managers, translating technical insights into clear, actionable information for all stakeholders. You’ll value transparency, collaboration, and the shared goal of keeping our systems reliable and our customers delighted.

What’s in it for you?

At the Tote you can expect a friendly working environment with a strong sense of teamwork and pride in what we do. Within this role you’ll develop a broad range of skills and experiences that can enhance your career at the Tote. Additionally, our company benefits package includes:

Competitive Basic Salary
Discretionary Bonus Scheme
Company Shares Option Plan
Contributory pension scheme
Life insurance (4× basic salary)
Simply Health Cash Plan
Holiday entitlement (33 days inclusive of bank holidays)
Study Support and opportunity for progression and development
Confidential 24/7 365 employee assistance helpline
Agile and collaborative office environment with free parking, fruit, biscuits, and drinks

Regular social events, charity events and volunteering opportunities.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Top cities

Top companies

Popular jobs