Job Search and Career Advice Platform

Enable job alerts via email!

Machine Learning Engineer London Office ·

Oriole Networks Ltd

City of London

On-site

GBP 60,000 - 80,000

Full time

28 days ago

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A technology company in the UK is seeking a talented Machine Learning Engineer to enhance performance of their AI/ML software stack. The role involves designing custom GPU communication kernels and collaborating on large-scale deep learning models, with a strong emphasis on GPU programming and optimization. Applicants should have proficiency in C++ and Python, and expertise in deep learning frameworks, including CUDA programming.

Qualifications

  • Expertise in high-performance computing or machine learning projects.
  • Strong understanding of GPU memory hierarchies and kernel optimization.
  • Solid experience in deploying and optimizing distributed deep learning workloads.

Responsibilities

  • Design and optimize custom GPU communication kernels.
  • Develop distributed communication frameworks for deep learning models.
  • Collaborate with hardware teams for integration of optimized kernels.

Skills

C++
Python
CUDA programming
GPU debugging
Communication libraries knowledge
Distributed deep learning frameworks

Tools

Cuda-gdb
Cuda Memcheck
NSight Systems
Docker
Kubernetes
SLURM
Job description

Oriole is seeking talented Machine Learning Engineers to help co‑optimize our AI/ML software stack with cutting‑edge network hardware. You’ll be a key contributor to a high‑impact, agile team focused on integrating middleware communication libraries and modelling the performance of large‑scale AI/ML workloads.

Key Responsibilities:
  • Design and optimize custom GPU communication kernels to enhance performance and scalability across multi‑node environments
  • Develop and maintain distributed communication frameworks for large‑scale deep learning models, ensuring efficient parallelization and optimal resource utilization.
  • Profile, benchmark, and debug GPU applications to identify and resolve bottlenecks in communication and computation pipelines.
  • Collaborate closely with hardware and software teams to integrate optimized kernels with Oriole’s next‑generation network hardware and software stack.
  • Contribute to system‑level architecture decisions for large‑scale GPU clusters, with a focus on communication efficiency, fault tolerance, and novel architectures for advanced optical network infrastructure.
Required Skills & Experience:
  • Proficient in C++ and Python, with a strong track record in high‑performance computing or machine learning projects.
  • Expertise in GPU programming with CUDA, including deep knowledge of GPU memory hierarchies and kernel optimization.
  • Hands‑on experience debugging GPU kernels using tools such as Cuda‑gdb, Cuda Memcheck, NSight Systems, PTX, and SASS.
  • Strong understanding of communication libraries and protocols, including NCCL, NVSHMEM, OpenMPI, UCX, or custom collective communication implementations.
  • Familiarity with HPC networking protocols/libraries such as RoCE, Infiniband, Libibverbs, and libfabric.
  • Experienced with distributed deep learning/MoE frameworks, including PyTorch Distributed, vLLM, or DeepEP.
  • Solid understanding of deploying and optimizing large‑scale distributed deep learning workloads in production environments, including Linux, Kubernetes, SLURM, OpenMPI, GPU drivers, Docker, and CI/CD automation.
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.