Job Search and Career Advice Platform

Enable job alerts via email!

HPC Engineer

Hlx Life Sciences

Oxford

On-site

GBP 125,000 - 150,000

Full time

27 days ago

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading life sciences company in Oxford is seeking a Senior HPC Engineer. This role involves designing and optimising high-performance GPU clusters for machine learning workflows. Ideal candidates will have proven experience in ML infrastructure, a proactive systems design approach, and expertise in high-throughput storage systems. Applicants should have the right to work permanently in the UK.

Qualifications

  • Proven experience leading design of high-performance ML compute clusters.
  • Proactive approach to systems design, able to implement optimal solutions.
  • Experience with transitioning ML infrastructure to modern systems.

Responsibilities

  • Build and optimise high-performance GPU training and inference clusters.
  • Drive implementation for high-throughput data paths.
  • Benchmark and resolve performance bottlenecks.
  • Establish observability and security in research environments.
  • Forecast capacity and cost for GPU needs.

Skills

Design and operation of high-performance ML compute clusters
Proactive systems design
Exposure to containerised systems
Expertise with high-throughput storage systems
Understanding of GPU architecture
Knowledge of IaC and CI/CD practices

Tools

Terraform
Argo CD
Job description

We are seeking a Senior HPC Engineer to design, implement, and scale the infrastructure that supports high-performance machine learning and AI‑driven research workflows. You will play a critical role in bridging the gap between data science, bioinformatics, and engineering — ensuring seamless, secure, and reproducible deployment of ML models in production and research environments.

You’ll collaborate closely with AI Scientists, Data Engineers, and DevSecOps teams, building automation pipelines that accelerate model development and deployment across distributed, cloud‑native systems.

Key Responsibilities
  • Build, operate, and continuously optimise our high-performance GPU training and inference clusters, focusing on robust, high‑availability scheduling, isolation, and automated lifecycle management.
  • Drive systems design and implementation for high‑throughput data paths, optimising I/O, caching, and data locality across compute and storage (including our current Lustre implementation).
  • Proactively benchmark, profile, and resolve performance bottlenecks across the compute, network, and orchestration layers to maximise efficiency for distributed training and inference.
  • Establish comprehensive observability, resilience, and automated security controls to ensure compliance and robust operation of sensitive research environments.
  • Partner with Research, Data, and Applied teams to forecast capacity and cost for GPU and storage needs, setting quotas and streamlining ML experimentation pipelines.
Essential Skills and Experience
  • Proven experience leading the design, build, and operation of high‑performance ML compute clusters at scale
  • A proactive, autonomous approach to systems design and the proven ability and desire to ideate, co‑create and implement optimal solutions
  • Exposure to migrating or transforming ML infrastructure from traditional schedulers to modern, containerised systems
  • Expertise with high‑throughput storage systems for ML/HPC workloads
  • Expert‑level understanding of GPU architecture, high‑speed networking for distributed training, and performance profiling to resolve bottlenecks
  • A solid grasp of IaC and CI/CD practices (e.g., Terraform, Argo CD)
Terms of Appointment

Applicants must have the right to work permanently in the UK and be within commuting distance of Oxford.

Seniority level
  • Mid‑Senior level
Employment type
  • Full‑time
Job function
  • Business Consulting and Services and Biotechnology Research
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.