Overview
A Site Reliability Engineer is responsible for transforming the SDLC environment with an engineering-focused role that emphasizes system reliability, automation, and performance in a non-production setting.
Role: Site Reliability Engineer (SRE)
Location: London
Work Mode: Hybrid
Contract Role
Responsibilities
- Automate environment lifecycle: Develop Infrastructure as Code (IaC) to automate provisioning, teardown, and configuration of test environments, integrating them with the CI/CD pipeline.
- Establish service level objectives (SLOs): Define and measure key service indicators (SLIs) for test environments to meet the needs of development and testing teams.
- Monitor environment health and performance: Use observability tools (e.g., Prometheus, Grafana) to track health, identify bottlenecks, and resolve issues proactively.
- Manage incident response: Lead incident management for test environment issues, conduct blameless post-mortems, and implement lasting fixes.
- Minimize toil: Automate manual, repetitive tasks related to test environments to free up engineering time.
- Drive continuous improvement: Analyze environment performance data, incident reports, and post-mortems to identify opportunities for improvement and innovation.
- Balance reliability and speed: Use an error budget approach for test environments to guide reliability versus feature development.
- Instil a reliability culture: Promote a blameless culture and shared ownership across development, QA, and SRE teams.
- Capacity planning: Anticipate future resource needs and ensure infrastructure can scale to meet demand.
- Advance test data management: Ensure test data is readily available, consistent, compliant, and provisioned with environments.
Technical Skills
- Monitoring and logging tools: Proficiency with Prometheus, Splunk, Grafana; CI/CD platforms (e.g., Jenkins, GitLab CI); and configuration management tools (e.g., Ansible, Terraform).
- Cloud infrastructure: Deep understanding of AWS, containerization (Docker, Kubernetes), and serverless computing.
- Scripting and programming: Strong scripting skills in Python or Bash.
- Systems and networking: Solid Linux, networking, and database management knowledge.
Soft Skills
- Leadership and influence: Ability to champion SRE practices and influence stakeholders across teams.
- Problem-solving: Strong analytical and debugging skills for complex issues under pressure.
- Communication: Excellent collaboration skills across development, QA, and operations.
- Adaptability: Proactive and adaptable mindset to evolving technology and methodologies.
Seniority level
Employment type
Job function
- Engineering and Information Technology
Industries
- IT Services and IT Consulting