Sr. MLOps Engineer, GenAI (Based in Dubai)
2 weeks ago Be among the first 25 applicants
As the leading delivery platform in the region, talabat has a mission to scale and evolve machine learning capabilities, including Generative AI (GenAI) initiatives. This requires robust, efficient, and scalable ML platforms that empower teams to rapidly develop, deploy, and operate intelligent systems.
Overview
As an ML Platform Engineer, your mission is to design, build, and enhance the infrastructure and tooling that accelerates the development, deployment, and monitoring of traditional ML and GenAI models at scale. You will collaborate with data scientists, ML engineers, GenAI specialists, and product teams to deliver seamless ML workflows—from experimentation to production serving—with a focus on operational excellence across ML and GenAI systems.
Responsibilities
- Design, build, and maintain scalable, reusable, and reliable ML platforms and tooling that support the entire ML lifecycle for both traditional and GenAI models (data ingestion, model training, evaluation, deployment, monitoring).
- Develop standardized ML workflows and templates using MLflow and other platforms to enable rapid experimentation and deployment cycles.
- Implement robust CI/CD pipelines, Docker containerization, model registries, and experiment tracking to support reproducibility, scalability, and governance.
- Collaborate with GenAI experts to integrate and optimize GenAI technologies, including transformers, embeddings, vector databases (e.g., Pinecone, Redis, Weaviate), and retrieval-augmented generation (RAG).
- Automate and streamline ML and GenAI model training, inference, deployment, and versioning workflows for consistency and reliability.
- Ensure reliability, observability, and scalability of production ML and GenAI workloads with monitoring, alerting, and continuous performance evaluation.
- Integrate infrastructure components such as real-time model serving frameworks (TensorFlow Serving, NVIDIA Triton, Seldon), Kubernetes orchestration, and cloud solutions (AWS/GCP).
- Drive infrastructure optimization for GenAI use-cases, including inference techniques, fine-tuning, prompt management, and regular model updates at scale.
- Partner with data engineering, product, infrastructure, and GenAI teams to align ML platform initiatives with company goals, infrastructure strategy, and roadmap.
- Contribute to internal documentation, onboarding, and training to promote platform adoption and continuous improvement.
Requirements
- Strong software engineering background with experience in building distributed systems or platforms for ML and AI workloads.
- Expert-level Python; familiarity with ML frameworks (TensorFlow, PyTorch), infrastructure tooling (MLflow, Kubeflow, Ray), and APIs (Hugging Face, OpenAI, LangChain).
- Experience with modern MLOps practices: model lifecycle management, CI/CD, Docker, Kubernetes, model registries, and IaC tools (Terraform, Helm).
- Experience with cloud infrastructure (AWS/GCP), Kubernetes clusters (GKE/EKS), serverless architectures, and managed ML services (e.g., Vertex AI, SageMaker).
- Proven experience with GenAI technologies: transformers, embeddings, prompt engineering, fine-tuning vs. prompt-tuning, vector databases, and RAG systems.
- Experience designing real-time inference pipelines with feature stores, streaming data platforms (Kafka, Kinesis), and observability platforms.
- Familiarity with SQL and data warehousing; ability to manage complex queries and transformations.
- Strong understanding of ML monitoring, including drift, latency, cost management, and scalable API-based GenAI applications.