Enable job alerts via email!

Site Reliability Engineer

Xceptor

Greater London

On-site

GBP 55,000 - GBP 75,000

Full time

4 days ago

Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A data solutions company located in Greater London seeks a Site Reliability Engineer to enhance service reliability and performance through automation and incident management. This role involves collaboration with engineering teams, implementing observability standards, and participating in incident response. Candidates should have a background in cloud operations, scripting, and DevOps practices. The company values diversity and is committed to creating an inclusive workplace.

Benefits

Diversity and Inclusion Programs

Background check compliance support

Flexible working arrangements

Qualifications

Experience as an SRE / DevOps / Production Engineer (typically 2-5 years).
Strong academic background in STEM or equivalent experience.
Experience supporting cloud services and operational automation.

Responsibilities

Ensure reliability, performance, and security of services.
Participate in incident response and recovery efforts.
Automate operational tasks to reduce toil and improve efficiency.

Skills

Incident Management

Automation

Cloud Operations

Observability

Scripting

Education

Bachelor's degree in STEM

Tools

Azure

PowerShell

CI/CD

Overview

Find your next career at Xceptor. Join us in transforming data into insightful solutions. London.

Site Reliability Engineer at Xceptor

ABOUT XCEPTOR: Data is at the heart of everything we do: Xceptor has been designed around data manipulation in its broadest sense. We source data from wherever it flows. We curate, normalise, validate, repair, and enrich that data so it reaches its destination in a reliable and consistent format. Data coming out of Xceptor is data our clients can trust. We are recognised as an expert in the Financial Services vertical, which strongly aligns with Business Users in Middle and Back-Office teams. We enable these users to solve their data challenges by themselves, rather than through a technology-led project.

Our mission is to empower business users within financial institutions to build automated processes that deliver trusted data.

Our values are: Client Centricity, One Team, Impactful.

Responsibilities

What You’ll Be Doing:

Site Reliability Engineering (SRE) is a cross-cutting function that partners with tribes across Xceptor to make our services reliable, performant, secure, and operable in production. We set and evolve standards for SLOs/SLIs, observability, incident response, and operational controls, and we build automation that reduces toil and enables teams to ship safely at pace across cloud and on-prem deployments.
Xceptor operates with an AI-first PDLC. AI agents are a digital delivery partner and a member of the team, accelerating how we design, build, test, document, deploy, and operate our services. Reliability is engineered in through standards, automation, and measurable signals, with humans providing intent, constraints, verification, and accountability.
As an SRE, you contribute at tribe level to reliability, performance, and operability. You help build and run the reliability system: observability standards, incident response practices, runbooks, and automation that reduces toil and improves service health over time.
You partner closely with Software Engineering, QA, Platform Engineering, and Senior/Lead SREs to embed reliability into delivery without becoming a bottleneck. You will own well-scoped operational improvements end-to-end (design, implement, test, roll out, measure) and steadily increase your scope and independence.
This is an AI-first SRE role. You use AI routinely to accelerate investigation, diagnostics, runbook creation, infrastructure automation, and operational reporting, while staying accountable for verification and safe operation. This role exists to make reliability measurable and repeatable, reduce operational toil through automation, and enable fast delivery without compromising safety, control, or customer trust.

Who we’re looking for

Reliability Engineering (Build reliability into the system)

Contributes to defining and improving SLIs/SLOs and service health signals, aligned to customer outcomes.
Implements reliability improvements within established patterns (timeouts, retries, graceful degradation, safe failure modes).
Supports capacity and performance work: basic baselining, load investigation, and scaling hygiene.
Helps maintain operational quality across production and staging, and improves environment consistency where possible.

Incident Management & Operational Excellence

Participates in incident response and on-call (as applicable), contributing to triage, mitigation, and recovery.
Produces clear post-incident notes and supports root cause analysis, focusing on actions that prevent recurrence.
Creates and improves runbooks/playbooks so incidents are faster and more consistent to resolve.
Helps improve change safety through practical release/readiness checks and operational guardrails.

Observability & Production Signals

Implements and improves observability for services: logs, metrics, traces, dashboards, and alerting aligned to standards.
Tunes alerts to reduce noise and improve actionability; helps manage flakiness and false positives.
Builds and maintains service health dashboards that support quick diagnosis and release confidence.
Works with QA and Engineering to align operational signals with end-to-end journey health.

Automation & Tooling (Make the right thing easy)

Automates repetitive operational tasks and reduces toil through scripts, tooling, and pipeline improvements.
Contributes to deployment automation and reliability guardrails in CI/CD, working with Platform Engineering.
Implements and maintains IaC changes under guidance, ensuring changes are safe, reviewed, and measured.
Improves diagnostics and “day 2” operations to make support and troubleshooting easier.

AI-First Operations (How you run SRE)

Uses AI routinely to accelerate operational tasks (investigation, diagnostics, runbooks, automation drafts) with explicit verification.
Works effectively in an “agents draft, humans verify” model for operational artefacts (scripts, dashboards, alerts, incident summaries).
Applies safe operational controls when using AI (no unsafe remediation; careful handling of sensitive data).
Learns from production outcomes and improves automation and guardrails based on real incidents and trends.

Collaboration & Enablement

Partners effectively with engineering teams to embed reliability into delivery without becoming a bottleneck.
Communicates reliability risks and operational impacts clearly, escalating early when needed.
Contributes to shared platform practices and standards across tribes (templates, runbooks, alerting patterns).
Builds strong working relationships with stakeholders to support customer outcomes.

Key Competencies

Technical

Experience supporting and improving production services with reliability and performance expectations.
Working knowledge of cloud and cloud-native operations (Azure preferred), and the fundamentals of running services safely.
Experience with IaC and automation (tooling/framework aligned to your stack), with good review and change discipline.
Familiarity with CI/CD and deployment practices; able to improve pipelines and release safety under guidance.
Practical observability skills: logs/metrics/traces, dashboards, and alert tuning.
Comfortable scripting and automation (e.g., PowerShell, CLI tooling).

AI-First SRE (Must Have)

Uses AI to accelerate investigation, automation drafts, and runbook creation, and verifies outputs before use.
Can follow and contribute to repeatable operational workflows and templates that improve reliability over time.
Understands and mitigates AI risks in operations (unsafe actions, false confidence, confidentiality).

Non Technical

Calm, pragmatic, and reliable; communicates clearly during incidents and operational issues.
Outcome-focused with a bias for automation and systemic fixes over manual effort.
Collaborative and receptive to feedback; grows quickly in a high-tempo environment.
Customer-aware mindset suitable for regulated, mission-critical environments.

Required Education & Experience

Experience as an SRE / DevOps / Production Engineer (Typically 2–5 years).
Experience supporting cloud services and operational automation in production environments; Azure experience beneficial.
Experience contributing to CI/CD, IaC, and observability practices in a delivery team.
Strong academic background, including a degree in a STEM subject discipline, or equivalent experience.

How Success Will be Measured

This role is measured on outcomes and how they’re achieved: improving reliability and operational signal quality, reducing toil through automation, and supporting controlled change in an AI-first operating model.

Reliability: SLO attainment, availability/performance trends, incident frequency/severity trend, and MTTR improvements
Change safety: change failure rate and rollback rate improve; releases become safer and more predictable
Observability: alert signal-to-noise improves (flake/noise down), coverage of key services/journeys increases, faster diagnosis from logs/metrics/traces
Toil reduction: automation increases, manual operational overhead reduces, runbooks/playbooks drive consistent response
Cost & capacity: capacity planning maturity improves; cost optimisation without risking SLOs
Behaviours: AI-first by default (agents draft, humans verify); strong verification discipline; reliable incident participation; automation mindset; control-aware and security-conscious decisions

Diversity & Inclusion at Xceptor

We believe great ideas come from everywhere — and that the best teams are made up of people with different backgrounds, experiences, and perspectives. At Xceptor, we’re committed to building a workplace where everyone feels welcome, valued, and empowered to be themselves.

We know that not everyone ticks every single box in a job description — and that’s okay. If you’re excited about this role and think you could make a difference, we’d love to hear from you. Your unique skills and experiences might be just what we need, even if you don’t meet every requirement.

We celebrate diversity and are dedicated to creating an inclusive environment for all employees — regardless of race, gender identity, sexual orientation, age, disability, religion, or background.

#LI-GL1 #LI-Hybrid

Application & Compliance

Please note:

Xceptor works with clients in financial services and our offers of employment are subject to the satisfactory completion of background checks, which includes criminal record checks, and credit reference checks.
If you have any employment gaps exceeding three months within the last six years, we will request additional information and evidence to clarify those periods.

Apply for this job

Create a Job Alert

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Site Reliability Engineer UK

AnaVation LLC

United Kingdom

Remote

GBP 70,000 - 90,000

Full time

30+ days ago

Site Reliability Engineering (SRE) Manager

SS&C

London

Hybrid

GBP 80,000 - 100,000

Full time

30+ days ago

Senior Site Reliability Engineer

Methodfi

United Kingdom

Remote

GBP 70,000 - 90,000

Full time

30+ days ago

Principal Site Reliability Engineer

Dubizzle Limited

Greater London

Hybrid

GBP 70,000 - 90,000

Full time

30+ days ago

Systems Reliability Engineer (SRE), Edge

CloudFlare

City of London

Hybrid

GBP 70,000 - 90,000

Full time

30+ days ago

Site Reliability Engineer

Wedo Technology Solutions Ltd.

Greater London

On-site

GBP 100,000 - 125,000

Full time

30+ days ago

Site Reliability Engineer

bet365 Group

Stoke-on-Trent

Hybrid

GBP 50,000 - 70,000

Full time

30+ days ago

Site Reliability Engineer

iManage

Belfast

Hybrid

GBP 55,000 - 75,000

Full time

30+ days ago

Engineering manager – Automation

Methodfi

City of Edinburgh

Hybrid

GBP 90,000 - 110,000

Full time

30+ days ago

CS Operations Manager

Evaluagent's

United Kingdom

Remote

GBP 65,000 - 80,000

Full time

30+ days ago

Top locations

Top companies

Top positions