Job Search and Career Advice Platform

Enable job alerts via email!

DevOps EngineerLondon, UK

Entrepreneur First

City of London

On-site

GBP 80,000 - 100,000

Full time

30+ days ago

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A tech-driven company in the City of London is looking for a talented DevOps Engineer to help develop a web-based software platform for reinforcement learning. The role involves designing robust cloud infrastructure, maintaining CI/CD pipelines, and collaborating with cross-functional teams. Competitive salary, significant stock options, and flexible working options are offered.

Benefits

Competitive salary + significant stock options
30 days of holiday plus bank holidays
Flexible working from home
Enhanced parental leave
Learning budget of £500 per year
Company pension scheme
Regular team socials
Bike2Work scheme

Qualifications

  • Strong experience with managing compute-intensive workloads in cloud environments.
  • Proficiency in using containerisation technologies for ML workloads.
  • Solid understanding of CI/CD principles with experience in ML pipeline automation.

Responsibilities

  • Design and maintain cloud infrastructure for reinforcement learning workloads.
  • Build and optimise CI/CD pipelines for reliable deployments.
  • Implement containerisation strategies using Docker and Kubernetes.

Skills

Cloud platforms (AWS, GCP, Azure)
Containerisation (Docker, Kubernetes)
Scripting (Python, Bash)
Infrastructure as Code (Terraform, CloudFormation, Pulumi)
CI/CD principles and tools (GitHub Actions, GitLab CI, Jenkins)
Monitoring tools (Prometheus, Grafana)
Networking and security
MLOps practices
High-performance computing
Problem-solving skills

Education

Bachelor's degree or higher in Computer Science or related field
3+ years of relevant DevOps/infrastructure experience

Tools

Docker
Kubernetes
Terraform
CloudFormation
Prometheus
Grafana
Job description

We are seeking a talented and experienced DevOps Engineer to join our team. This engineer will contribute to the further development of Arena, a web-based software platform for reinforcement learning training and RLOps. As a DevOps Engineer, you will be responsible for designing, implementing, and maintaining the cloud infrastructure, CI/CD pipelines, and deployment systems that enable businesses to build and deploy reinforcement learning models at scale.

Responsibilities
  • Design and maintain robust, scalable cloud infrastructure to support high-performance reinforcement learning workloads and distributed training environments
  • Build and optimise CI/CD pipelines for both our open-source framework and Arena enterprise platform, ensuring reliable deployments and automated testing
  • Implement and manage containerisation strategies using Docker and Kubernetes for ML model training, deployment, and orchestration
  • Develop infrastructure as code (IaC) solutions using tools like Terraform, CloudFormation, or Pulumi to ensure reproducible and version-controlled infrastructure
  • Monitor system performance, implement alerting and logging solutions, and troubleshoot production issues across distributed ML training environments
  • Collaborate with ML engineers to optimise resource allocation and cost efficiency for compute-intensive RL training workloads
  • Implement security best practices, manage access controls, and ensure compliance with enterprise security requirements
  • Automate operational tasks including backup strategies, disaster recovery procedures, and system maintenance
  • Support the deployment and scaling of GPU clusters and distributed computing resources for reinforcement learning applications
  • Maintain high availability and performance of production systems serving ML models to external customers
Requirements
  • Bachelor's degree or higher in Computer Science, Engineering, or a related field, or 3+ years of relevant DevOps/infrastructure experience
  • Strong experience with cloud platforms (AWS, GCP, Azure) and their ML/AI services, with expertise in managing compute-intensive workloads
  • Proficiency in containerisation technologies (Docker, Kubernetes) and container orchestration for ML workloads
  • Experience with Infrastructure as Code tools (Terraform, CloudFormation, Pulumi) and configuration management solid understanding of CI/CD principles and tools (GitHub Actions, GitLab CI, Jenkins) with experience in ML pipeline automation
  • Knowledge of monitoring and observability tools (Prometheus, Grafana, OpenObserve) and their application to ML systems
  • Experience with GPU infrastructure management and distributed computing frameworks for machine learning
  • Familiarity with MLOps practices and tools for model deployment, versioning, and lifecycle management
  • Strong scripting skills in Python, Bash, or similar languages for automation tasks
  • Understanding of networking, security, and database management in cloud environments
  • Experience with high-performance computing environments and job scheduling systems is a plus
  • Knowledge of machine learning workflows and the unique infrastructure requirements of ML training and inference
  • Strong problem-solving skills and ability to work in a fast-paced, collaborative environment
  • Excellent communication skills and experience working with cross-functional teams
Compensation
  • Competitive salary + significant stock options.
  • 30 days of holiday, plus bank holidays, per year.
  • Flexible working from home and 6-month remote working policies.
  • Enhanced parental leave.
  • Learning budget of £500 per calendar year for books, training courses and conferences.
  • Company pension scheme.
  • Regular team socials and quarterly all-company parties.
  • Bike2Work scheme.

Join the fast-growing AgileRL team and play a key role in the development of cutting-edge reinforcement learning tooling and infrastructure.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.