We are seeking a talented and experienced DevOps Engineer to join our team. This engineer will contribute to the further development of Arena, a web-based software platform for reinforcement learning training and RLOps. As a DevOps Engineer, you will be responsible for designing, implementing, and maintaining the cloud infrastructure, CI/CD pipelines, and deployment systems that enable businesses to build and deploy reinforcement learning models at scale.
Responsibilities
- Design and maintain robust, scalable cloud infrastructure to support high-performance reinforcement learning workloads and distributed training environments
- Build and optimise CI/CD pipelines for both our open-source framework and Arena enterprise platform, ensuring reliable deployments and automated testing
- Implement and manage containerisation strategies using Docker and Kubernetes for ML model training, deployment, and orchestration
- Develop infrastructure as code (IaC) solutions using tools like Terraform, CloudFormation, or Pulumi to ensure reproducible and version-controlled infrastructure
- Monitor system performance, implement alerting and logging solutions, and troubleshoot production issues across distributed ML training environments
- Collaborate with ML engineers to optimise resource allocation and cost efficiency for compute-intensive RL training workloads
- Implement security best practices, manage access controls, and ensure compliance with enterprise security requirements
- Automate operational tasks including backup strategies, disaster recovery procedures, and system maintenance
- Support the deployment and scaling of GPU clusters and distributed computing resources for reinforcement learning applications
- Maintain high availability and performance of production systems serving ML models to external customers
Requirements
- Bachelor's degree or higher in Computer Science, Engineering, or a related field, or 3+ years of relevant DevOps/infrastructure experience
- Strong experience with cloud platforms (AWS, GCP, Azure) and their ML/AI services, with expertise in managing compute-intensive workloads
- Proficiency in containerisation technologies (Docker, Kubernetes) and container orchestration for ML workloads
- Experience with Infrastructure as Code tools (Terraform, CloudFormation, Pulumi) and configuration management solid understanding of CI/CD principles and tools (GitHub Actions, GitLab CI, Jenkins) with experience in ML pipeline automation
- Knowledge of monitoring and observability tools (Prometheus, Grafana, OpenObserve) and their application to ML systems
- Experience with GPU infrastructure management and distributed computing frameworks for machine learning
- Familiarity with MLOps practices and tools for model deployment, versioning, and lifecycle management
- Strong scripting skills in Python, Bash, or similar languages for automation tasks
- Understanding of networking, security, and database management in cloud environments
- Experience with high-performance computing environments and job scheduling systems is a plus
- Knowledge of machine learning workflows and the unique infrastructure requirements of ML training and inference
- Strong problem-solving skills and ability to work in a fast-paced, collaborative environment
- Excellent communication skills and experience working with cross-functional teams
Compensation
- Competitive salary + significant stock options.
- 30 days of holiday, plus bank holidays, per year.
- Flexible working from home and 6-month remote working policies.
- Enhanced parental leave.
- Learning budget of £500 per calendar year for books, training courses and conferences.
- Company pension scheme.
- Regular team socials and quarterly all-company parties.
- Bike2Work scheme.
Join the fast-growing AgileRL team and play a key role in the development of cutting-edge reinforcement learning tooling and infrastructure.