Infrastructure & DevOps
Kubernetes for ML Workloads: A Practical Playbook
GPU node pools, spot instance strategies, model serving with vLLM, and the autoscaling configuration that cut our inference costs by 65%.
Why ML Workloads Break Standard Kubernetes Patterns
Kubernetes is excellent at running stateless web services. ML workloads are neither stateless nor CPU-homogeneous — they have GPU affinity, burst memory requirements, long-running inference jobs, and wildly asymmetric scale patterns. Running them well requires departing from the patterns you learned running Rails apps.
GPU Node Pools: Separate and Isolated
GPU nodes are expensive (A100 spot instances run $2–4/hr on AWS) and should never be contaminated by CPU workloads. Use dedicated node pools with GPU taints and resource requests that require GPU limits. Without explicit GPU limits, pods that do not need GPUs will schedule onto GPU nodes and waste capacity.
Karpenter (EKS) or Node Auto-Provisioner (GKE) is essential for cost efficiency. Configure a fast scale-up trigger (80% utilisation, 60-second window) and a conservative scale-down (15 minutes idle to avoid GPU node thrash). GPU node startup takes 3–5 minutes — plan your buffer accordingly.
vLLM for Model Serving
vLLM's continuous batching and PagedAttention make it the highest-throughput open-source LLM serving framework available. On a single A100, vLLM serving Llama-3-70B achieves 3–5x the throughput of naive Hugging Face inference. Deployment pattern: one vLLM deployment per model, exposed via ClusterIP, load-balanced by NGINX that handles auth, rate limiting, and model routing.
The Autoscaling Configuration That Cut Our Costs by 65%
Standard Kubernetes HPA scales on CPU or memory — neither correlates well with LLM inference load. We scale on queue depth: Celery tasks pending per model endpoint, exposed as a custom metric via KEDA (Kubernetes Event-Driven Autoscaler). Scale up at 50 pending tasks, scale down at 5, with a 10-minute stabilisation window.
This single change reduced our GPU cluster bill by 65% compared to time-based scaling. Queue-depth scaling keeps nodes alive for actual load. For inference workloads with variable traffic, it is the right signal.
Monitoring That Actually Helps
Deploy NVIDIA DCGM Exporter for per-GPU utilisation, memory bandwidth, and temperature. Alert on sustained >85% GPU memory (OOM risk) and >95% VRAM (fragmentation risk). A GPU that looks healthy on CPU dashboards can be silently underperforming due to PCIe bandwidth saturation — you will only see it in DCGM data.
Deepak Kushwaha