Job Search and Career Advice Platform

Enable job alerts via email!

Reliability Engineer

In Cork

Portsmouth

Hybrid

GBP 75,000 - 95,000

Full time

Today
Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A technology firm is seeking a highly skilled Senior Site Reliability Engineer to join their dynamic team in Portsmouth, UK, with significant remote flexibility. The role involves ensuring the high availability and reliability of production systems, automating operational tasks, and collaborating with development teams. Ideal candidates have 5+ years' experience in SRE, strong cloud platform skills, and proficiency in Infrastructure as Code tools like Terraform. This is an excellent opportunity to drive improvements within a challenging environment.

Qualifications

  • 5+ years of experience in Site Reliability Engineering, DevOps, or Systems Engineering.
  • Strong experience with cloud platforms (AWS, Azure, or GCP).
  • Expertise in Infrastructure as Code tools (e.g., Terraform, Ansible).

Responsibilities

  • Ensure high availability, performance, and scalability of production systems.
  • Develop and maintain monitoring, alerting, and logging systems.
  • Collaborate with development teams to improve service reliability.

Skills

Cloud platform expertise
Proficiency in scripting languages
Problem-solving skills

Education

Bachelor's or Master's degree in Computer Science or related field

Tools

Terraform
Kubernetes
Prometheus
Job description
Job Summary

Our client is seeking a highly skilled and experienced Senior Site Reliability Engineer (SRE) to join their dynamic engineering team, based in Portsmouth, Hampshire, UK, but offering significant remote flexibility. This role is crucial for ensuring the availability, performance, scalability, and reliability of our production systems and services. You will be responsible for automating operational tasks, developing monitoring and alerting systems, responding to incidents, and driving improvements in system stability and efficiency. The ideal candidate will have a strong background in systems administration, software development, and a deep understanding of cloud infrastructure and distributed systems.

You will work on building and maintaining robust infrastructure, implementing Infrastructure as Code (IaC) using tools like Terraform or Ansible, and managing CI/CD pipelines. A key aspect of this role involves collaborating with development teams to ensure our services are designed for reliability and operability from the outset. You will be involved in capacity planning, performance tuning, and disaster recovery strategies. Experience with container orchestration platforms such as Kubernetes is highly desirable. This position requires a proactive approach to identifying potential issues, a strong understanding of networking, and the ability to troubleshoot complex system problems under pressure. You will be a key player in fostering a culture of reliability and operational excellence within the engineering organization. This is an excellent opportunity for a seasoned SRE looking to make a significant impact in a challenging and rewarding environment.

Responsibilities
  • Ensure the high availability, performance, and scalability of production systems.
  • Automate infrastructure provisioning, configuration, and deployment using IaC tools.
  • Develop and maintain robust monitoring, alerting, and logging systems.
  • Respond to and resolve production incidents, performing root cause analysis.
  • Collaborate with development teams to improve service reliability and operability.
  • Implement and manage CI/CD pipelines for efficient software delivery.
  • Conduct capacity planning and performance tuning.
  • Develop and test disaster recovery and business continuity plans.
  • Manage and optimize containerized environments (e.g., Kubernetes).
  • Contribute to architectural decisions related to system design and reliability.
Qualifications
  • Bachelor's or Master's degree in Computer Science, Engineering, or a related field, or equivalent experience.
  • Minimum of 5 years of experience in Site Reliability Engineering, DevOps, or Systems Engineering.
  • Strong experience with cloud platforms (AWS, Azure, or GCP).
  • Proficiency in at least one scripting or programming language (e.g., Python, Bash, Go).
  • Experience with Infrastructure as Code tools (e.g., Terraform, Ansible, Chef, Puppet).
  • Solid understanding of Linux/Unix operating systems.
  • Experience with containerization and orchestration technologies (Docker, Kubernetes).
  • Knowledge of networking concepts and protocols.
  • Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
  • Excellent problem-solving and troubleshooting skills.
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.