AI EducademyAIEducademy
AcademicsLabBlogAbout
Sign In
AI EducademyAIEducademy

Free AI education for everyone, in every language.

Learn

  • Academics
  • Lessons
  • Lab
  • Dashboard
  • About

Community

  • GitHub
  • Contribute
  • Code of Conduct

Support

  • Buy Me a Coffee โ˜•

Free AI education for everyone

MIT Licence. Open Source

AI & Engineering Academicsโ€บ๐ŸŒฒ AI Forestโ€บLessonsโ€บMLOps and Deployment
โš™๏ธ
AI Forest โ€ข Advancedโฑ๏ธ 18 min read

MLOps and Deployment

MLOps and Deployment

A Jupyter notebook with 94% accuracy is not a product. The moment you close the laptop, that model serves nobody. MLOps is the discipline that bridges the gap between experimentation and production โ€” and it is where most AI teams struggle the hardest.

Why Notebooks Are Not Production

Notebooks encourage terrible engineering habits: hidden state from out-of-order cell execution, hardcoded file paths, no dependency management, and zero reproducibility. A model trained in a notebook cannot be retrained automatically when new data arrives, cannot be rolled back when it breaks, and cannot be monitored for degradation.

Production ML requires versioned code, versioned data, versioned models, and automated pipelines. Everything else is a demo.

The MLOps Lifecycle

The MLOps lifecycle is a continuous loop, not a linear pipeline. Each stage feeds back into the others, and the maturity of your organisation determines how automated these transitions are.

Google defines three MLOps maturity levels:

  • Level 0 โ€” Manual, script-driven process. Data scientists hand off models to engineers.
  • Level 1 โ€” Automated training pipelines. Retraining triggers automatically on new data.
  • Level 2 โ€” Full CI/CD for ML. Code, data, and model changes all flow through automated pipelines.

The stages of the loop:

  1. Data ingestion and validation โ€” Collect, clean, and validate incoming data against schemas
  2. Feature engineering โ€” Transform raw data into model-ready features via a feature store
  3. Training โ€” Run reproducible training jobs with tracked hyperparameters
  4. Evaluation โ€” Compare candidate models against baselines using held-out test sets
  5. Registry โ€” Store approved models with metadata, lineage, and approval status
  6. Deployment โ€” Serve the model behind an API with canary or blue-green rollout
  7. Monitoring โ€” Track prediction quality, latency, throughput, and data drift
  8. Retraining โ€” Trigger new training runs when performance degrades
MLOps lifecycle diagram showing the continuous loop from data to deployment to monitoring
The MLOps lifecycle is a loop โ€” monitoring feeds back into retraining.

Model Registries

A model registry is the version control system for your trained models. Tools like MLflow, Weights & Biases, and Neptune let you store model artefacts alongside their training metrics, hyperparameters, dataset versions, and promotion status (staging โ†’ production โ†’ archived).

The registry answers critical questions: Which model is currently in production? What data was it trained on? Who approved it? What were its evaluation metrics?

๐Ÿง Quick Check

What is the primary purpose of a model registry?

CI/CD for Machine Learning

Traditional CI/CD tests code. ML CI/CD must test three things: code, data, and models.

  • Code tests โ€” Unit tests for feature engineering logic, data transformations, and API contracts
  • Data tests โ€” Schema validation, distribution checks (is the new data within expected ranges?), and completeness checks (no missing columns)
  • Model tests โ€” Performance thresholds on validation sets, latency benchmarks, bias metrics, and comparison against the current production model

A pull request that changes a feature pipeline should trigger retraining and evaluation automatically. If the new model underperforms the baseline, the PR is blocked.

Containerisation for ML

Docker is non-negotiable for production ML. Your training environment and serving environment must be identical โ€” the "it works on my machine" problem is amplified when GPU drivers, CUDA versions, and Python dependencies all need to match.

A typical ML Dockerfile layers: base CUDA image โ†’ Python dependencies โ†’ model artefacts โ†’ serving framework. Multi-stage builds keep the final image lean by separating the build environment from the runtime.

Key practices for ML containerisation:

  • Pin every dependency โ€” Use exact versions in requirements.txt, not ranges
  • Use digest-based base images โ€” nvidia/cuda@sha256:... instead of :latest
  • Separate training and serving images โ€” Training images need compilers and dev tools; serving images need only the runtime
  • Scan for vulnerabilities โ€” ML images inherit CVEs from deep dependency trees
๐Ÿคฏ
The NVIDIA CUDA base image alone is over 3 GB. Production teams routinely spend days optimising Docker images to reduce cold-start times and container registry costs.

Serving Infrastructure

How you serve a model depends on your latency and throughput requirements:

| Tool | Best For | Key Feature | |------|----------|-------------| | TorchServe | PyTorch models | Built-in batching, model versioning | | Triton Inference Server | Multi-framework, GPU | Dynamic batching, concurrent model execution | | vLLM | Large language models | PagedAttention, continuous batching | | TensorFlow Serving | TF/Keras models | gRPC + REST, automatic model reload | | BentoML | Any framework | Pythonic API, built-in containerisation |

For LLM serving specifically, vLLM has become the industry standard because its PagedAttention mechanism reduces GPU memory waste by up to 90% compared to naive serving.

๐Ÿง Quick Check

Why has vLLM become the preferred serving solution for large language models?

Monitoring Model Drift

A model that was accurate at deployment will degrade over time. Data drift occurs when the input distribution changes (e.g., customer behaviour shifts post-pandemic). Concept drift occurs when the relationship between inputs and outputs changes (e.g., fraud patterns evolve).

Monitor these signals:

  • Prediction distribution โ€” Are outputs shifting? More high-confidence predictions than usual?
  • Feature distributions โ€” Statistical tests (KS test, PSI) comparing current data to training data
  • Ground truth feedback โ€” When labels arrive, compute live accuracy/F1 and compare to baseline
  • Latency and error rates โ€” Infrastructure health affects model quality indirectly
๐Ÿค”
Think about it:Your e-commerce recommendation model was trained pre-pandemic. Post-pandemic shopping patterns are completely different. How would you detect this drift automatically, and what would your retraining strategy look like?

Feature Stores

A feature store is a centralised repository for feature definitions and values, shared across training and serving. Tools like Feast, Tecton, and Hopsworks solve the critical problem of training-serving skew โ€” when features are computed differently during training versus inference.

Feature stores provide: an offline store (batch features for training), an online store (low-latency features for serving), and a registry (feature definitions, ownership, lineage).

Batch vs Real-Time Inference

Not every prediction needs to happen in real time. The cost difference is enormous:

  • Batch inference โ€” Run predictions on a schedule (hourly, daily). Cheap, uses spot instances. Ideal for recommendations, risk scoring, report generation.
  • Real-time inference โ€” Sub-second predictions via API. Expensive, requires always-on GPU instances. Required for chatbots, fraud detection, autonomous systems.
  • Near-real-time (streaming) โ€” Process events as they arrive via Kafka or Pub/Sub. Balances latency and cost for use cases like personalisation and anomaly detection.

A common pattern is to pre-compute batch predictions for the 95% of cases that are predictable, and route only the remaining 5% to a real-time endpoint.

Cost optimisation tip: autoscale inference endpoints based on request queue depth, not CPU utilisation. ML workloads are bursty โ€” scaling on queue depth prevents both over-provisioning and request timeouts during traffic spikes.

๐Ÿง Quick Check

When is batch inference preferable to real-time inference?

๐Ÿค”
Think about it:You are designing an ML platform for a large retailer. Which components (feature store, model registry, serving layer, monitoring) would you build first, and which would you adopt as managed services? What factors drive that decision?

๐Ÿ“š Further Reading

  • Designing Machine Learning Systems by Chip Huyen โ€” The definitive book on production ML systems design
  • MLOps Maturity Model by Google โ€” Google's framework for assessing MLOps maturity from level 0 to level 2
  • vLLM: Easy, Fast, and Cheap LLM Serving โ€” Official documentation for the leading open-source LLM serving engine
Lesson 6 of 100 of 10 completed
โ†Contributing to Open SourceEdge AIโ†’