A Jupyter notebook with 94% accuracy is not a product. The moment you close the laptop, that model serves nobody. MLOps is the discipline that bridges the gap between experimentation and production โ and it is where most AI teams struggle the hardest.
Notebooks encourage terrible engineering habits: hidden state from out-of-order cell execution, hardcoded file paths, no dependency management, and zero reproducibility. A model trained in a notebook cannot be retrained automatically when new data arrives, cannot be rolled back when it breaks, and cannot be monitored for degradation.
Production ML requires versioned code, versioned data, versioned models, and automated pipelines. Everything else is a demo.
The MLOps lifecycle is a continuous loop, not a linear pipeline. Each stage feeds back into the others, and the maturity of your organisation determines how automated these transitions are.
Google defines three MLOps maturity levels:
The stages of the loop:
A model registry is the version control system for your trained models. Tools like MLflow, Weights & Biases, and Neptune let you store model artefacts alongside their training metrics, hyperparameters, dataset versions, and promotion status (staging โ production โ archived).
The registry answers critical questions: Which model is currently in production? What data was it trained on? Who approved it? What were its evaluation metrics?
What is the primary purpose of a model registry?
Traditional CI/CD tests code. ML CI/CD must test three things: code, data, and models.
A pull request that changes a feature pipeline should trigger retraining and evaluation automatically. If the new model underperforms the baseline, the PR is blocked.
Docker is non-negotiable for production ML. Your training environment and serving environment must be identical โ the "it works on my machine" problem is amplified when GPU drivers, CUDA versions, and Python dependencies all need to match.
A typical ML Dockerfile layers: base CUDA image โ Python dependencies โ model artefacts โ serving framework. Multi-stage builds keep the final image lean by separating the build environment from the runtime.
Key practices for ML containerisation:
requirements.txt, not rangesnvidia/cuda@sha256:... instead of :latestHow you serve a model depends on your latency and throughput requirements:
| Tool | Best For | Key Feature | |------|----------|-------------| | TorchServe | PyTorch models | Built-in batching, model versioning | | Triton Inference Server | Multi-framework, GPU | Dynamic batching, concurrent model execution | | vLLM | Large language models | PagedAttention, continuous batching | | TensorFlow Serving | TF/Keras models | gRPC + REST, automatic model reload | | BentoML | Any framework | Pythonic API, built-in containerisation |
For LLM serving specifically, vLLM has become the industry standard because its PagedAttention mechanism reduces GPU memory waste by up to 90% compared to naive serving.
Why has vLLM become the preferred serving solution for large language models?
A model that was accurate at deployment will degrade over time. Data drift occurs when the input distribution changes (e.g., customer behaviour shifts post-pandemic). Concept drift occurs when the relationship between inputs and outputs changes (e.g., fraud patterns evolve).
Monitor these signals:
A feature store is a centralised repository for feature definitions and values, shared across training and serving. Tools like Feast, Tecton, and Hopsworks solve the critical problem of training-serving skew โ when features are computed differently during training versus inference.
Feature stores provide: an offline store (batch features for training), an online store (low-latency features for serving), and a registry (feature definitions, ownership, lineage).
Not every prediction needs to happen in real time. The cost difference is enormous:
A common pattern is to pre-compute batch predictions for the 95% of cases that are predictable, and route only the remaining 5% to a real-time endpoint.
Cost optimisation tip: autoscale inference endpoints based on request queue depth, not CPU utilisation. ML workloads are bursty โ scaling on queue depth prevents both over-provisioning and request timeouts during traffic spikes.
When is batch inference preferable to real-time inference?