Enable job alerts via email!

DevOps EngineerLondon, UK

Entrepreneur First

City of London

On-site

GBP 80,000 - 100,000

Full time

30+ days ago

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A tech-driven company in the City of London is looking for a talented DevOps Engineer to help develop a web-based software platform for reinforcement learning. The role involves designing robust cloud infrastructure, maintaining CI/CD pipelines, and collaborating with cross-functional teams. Competitive salary, significant stock options, and flexible working options are offered.

Benefits

Competitive salary + significant stock options

30 days of holiday plus bank holidays

Flexible working from home

Enhanced parental leave

Learning budget of £500 per year

Company pension scheme

Regular team socials

Bike2Work scheme

Qualifications

Strong experience with managing compute-intensive workloads in cloud environments.
Proficiency in using containerisation technologies for ML workloads.
Solid understanding of CI/CD principles with experience in ML pipeline automation.

Responsibilities

Design and maintain cloud infrastructure for reinforcement learning workloads.
Build and optimise CI/CD pipelines for reliable deployments.
Implement containerisation strategies using Docker and Kubernetes.

Skills

Cloud platforms (AWS, GCP, Azure)

Containerisation (Docker, Kubernetes)

Scripting (Python, Bash)

Infrastructure as Code (Terraform, CloudFormation, Pulumi)

CI/CD principles and tools (GitHub Actions, GitLab CI, Jenkins)

Monitoring tools (Prometheus, Grafana)

Networking and security

MLOps practices

High-performance computing

Problem-solving skills

Education

Bachelor's degree or higher in Computer Science or related field

3+ years of relevant DevOps/infrastructure experience

Tools

Docker

Kubernetes

Terraform

CloudFormation

Prometheus

Grafana

We are seeking a talented and experienced DevOps Engineer to join our team. This engineer will contribute to the further development of Arena, a web-based software platform for reinforcement learning training and RLOps. As a DevOps Engineer, you will be responsible for designing, implementing, and maintaining the cloud infrastructure, CI/CD pipelines, and deployment systems that enable businesses to build and deploy reinforcement learning models at scale.

Responsibilities

Design and maintain robust, scalable cloud infrastructure to support high-performance reinforcement learning workloads and distributed training environments
Build and optimise CI/CD pipelines for both our open-source framework and Arena enterprise platform, ensuring reliable deployments and automated testing
Implement and manage containerisation strategies using Docker and Kubernetes for ML model training, deployment, and orchestration
Develop infrastructure as code (IaC) solutions using tools like Terraform, CloudFormation, or Pulumi to ensure reproducible and version-controlled infrastructure
Monitor system performance, implement alerting and logging solutions, and troubleshoot production issues across distributed ML training environments
Collaborate with ML engineers to optimise resource allocation and cost efficiency for compute-intensive RL training workloads
Implement security best practices, manage access controls, and ensure compliance with enterprise security requirements
Automate operational tasks including backup strategies, disaster recovery procedures, and system maintenance
Support the deployment and scaling of GPU clusters and distributed computing resources for reinforcement learning applications
Maintain high availability and performance of production systems serving ML models to external customers

Requirements

Bachelor's degree or higher in Computer Science, Engineering, or a related field, or 3+ years of relevant DevOps/infrastructure experience
Strong experience with cloud platforms (AWS, GCP, Azure) and their ML/AI services, with expertise in managing compute-intensive workloads
Proficiency in containerisation technologies (Docker, Kubernetes) and container orchestration for ML workloads
Experience with Infrastructure as Code tools (Terraform, CloudFormation, Pulumi) and configuration management solid understanding of CI/CD principles and tools (GitHub Actions, GitLab CI, Jenkins) with experience in ML pipeline automation
Knowledge of monitoring and observability tools (Prometheus, Grafana, OpenObserve) and their application to ML systems
Experience with GPU infrastructure management and distributed computing frameworks for machine learning
Familiarity with MLOps practices and tools for model deployment, versioning, and lifecycle management
Strong scripting skills in Python, Bash, or similar languages for automation tasks
Understanding of networking, security, and database management in cloud environments
Experience with high-performance computing environments and job scheduling systems is a plus
Knowledge of machine learning workflows and the unique infrastructure requirements of ML training and inference
Strong problem-solving skills and ability to work in a fast-paced, collaborative environment
Excellent communication skills and experience working with cross-functional teams

Compensation

Competitive salary + significant stock options.
30 days of holiday, plus bank holidays, per year.
Flexible working from home and 6-month remote working policies.
Enhanced parental leave.
Learning budget of £500 per calendar year for books, training courses and conferences.
Company pension scheme.
Regular team socials and quarterly all-company parties.
Bike2Work scheme.

Join the fast-growing AgileRL team and play a key role in the development of cutting-edge reinforcement learning tooling and infrastructure.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Top cities

Top companies

Popular jobs