AI EducademyAIEducademy
AcademicsLabBlogAbout
Sign In
AI EducademyAIEducademy

Free AI education for everyone, in every language.

Learn

  • Academics
  • Lessons
  • Lab
  • Dashboard
  • About

Community

  • GitHub
  • Contribute
  • Code of Conduct

Support

  • Buy Me a Coffee โ˜•

Free AI education for everyone

MIT Licence. Open Source

AI & Engineering Academicsโ€บ๐ŸŒฒ AI Forestโ€บLessonsโ€บEdge AI
๐Ÿ“ฑ
AI Forest โ€ข Advancedโฑ๏ธ 17 min read

Edge AI

Edge AI

Not every prediction should travel to the cloud and back. Edge AI runs models directly on the device โ€” your phone, your browser, a camera, a car. The benefits are compelling: zero network latency, full privacy (data never leaves the device), offline capability, and dramatically lower serving costs.

The challenge? You need a model that fits in megabytes, runs on limited hardware, and still delivers useful accuracy.

Why Edge Over Cloud?

The decision between edge and cloud is not binary โ€” most production systems use a hybrid approach. But understanding the trade-offs is critical:

| Factor | Cloud Inference | Edge Inference | |--------|----------------|----------------| | Latency | 100-500ms (network round trip) | 5-50ms (on-device) | | Privacy | Data sent to server | Data stays on device | | Cost | Per-request GPU cost | Zero marginal cost | | Offline | Requires connectivity | Works anywhere | | Scalability | Server bottleneck | Scales with devices | | Model Size | Unlimited | Constrained by device memory |

For applications like real-time translation, keyboard prediction, face unlock, or health monitoring, the edge is not just preferable โ€” it is the only viable option.

Diagram comparing cloud inference with network round trip versus edge inference running directly on device
Edge inference eliminates the network round trip entirely โ€” critical for real-time applications.

Model Compression Techniques

Production models are too large for edge devices. A GPT-class model at FP32 precision would consume gigabytes of memory. Compression techniques make deployment feasible.

Quantisation

Quantisation reduces the numerical precision of model weights from 32-bit floating point (FP32) to smaller representations:

  • FP16 โ€” Half precision. 2ร— memory reduction, minimal accuracy loss. Standard for GPU inference.
  • INT8 โ€” 4ร— memory reduction. The sweet spot for most edge deployments.
  • INT4 โ€” 8ร— memory reduction. Aggressive, but viable for large language models with careful calibration.

Post-training quantisation (PTQ) converts an already-trained model. Quantisation-aware training (QAT) simulates low precision during training for better accuracy retention.

๐Ÿคฏ
Apple's Neural Engine on the iPhone 15 Pro processes INT8 operations at 35 TOPS (trillion operations per second) โ€” roughly equivalent to a mid-range data centre GPU from just five years ago.

Pruning

Pruning removes weights (or entire neurons) that contribute least to the model's output. Unstructured pruning zeroes out individual weights; structured pruning removes entire channels or attention heads for actual speedups on hardware.

A well-pruned model can remove 50-90% of parameters with less than 1% accuracy degradation โ€” but the pruned model must be fine-tuned afterwards to recover performance.

Knowledge Distillation

Train a small "student" model to mimic a large "teacher" model's outputs. The student learns not just the correct labels but the teacher's full probability distribution โ€” including which wrong answers are "almost right." This soft information transfers surprisingly well.

DistilBERT, for instance, retains 97% of BERT's accuracy at 60% of its size and 2ร— the speed.

๐Ÿง Quick Check

Which compression technique trains a smaller model to replicate a larger model's output distribution?

ONNX: The Universal Format

Open Neural Network Exchange (ONNX) is a framework-agnostic model format. Train in PyTorch, export to ONNX, deploy anywhere โ€” TensorFlow Lite, Core ML, ONNX Runtime, or the browser.

ONNX Runtime provides optimised inference across CPUs, GPUs, and NPUs with a single API. It handles operator fusion, memory planning, and hardware-specific acceleration automatically.

Platform-Specific Runtimes

TensorFlow Lite (Mobile / IoT)

Google's runtime for Android, iOS, and microcontrollers. TFLite models are flat buffers โ€” small, fast to load, no dynamic memory allocation. The interpreter is under 1 MB. Delegate APIs offload computation to GPU, DSP, or NPU hardware.

Core ML (Apple Ecosystem)

Apple's framework automatically routes computation across CPU, GPU, and the dedicated Neural Engine. If you ship an iOS app with ML features, Core ML is the path of least resistance. It supports model encryption for IP protection.

WebAssembly + ONNX Runtime Web

Run ML models in the browser with near-native performance. ONNX Runtime Web compiles models to WebAssembly and uses WebGL or WebGPU for GPU acceleration. No server, no installation, no data leaves the browser.

Use cases: real-time background removal in video calls, on-page content moderation, and browser-based document OCR.

๐Ÿง Quick Check

What is the primary advantage of ONNX as a model format?

On-Device Large Language Models

Running LLMs on consumer hardware seemed impossible two years ago. Today, quantised models under 4 billion parameters run comfortably on flagship phones:

  • Gemma 2B/7B (Google) โ€” Optimised for mobile via TFLite and MediaPipe
  • Phi-3 Mini (Microsoft) โ€” 3.8B parameters, strong reasoning for its size
  • Llama 3.2 1B/3B (Meta) โ€” Purpose-built for on-device deployment
  • Gemini Nano (Google) โ€” Integrated into Android 14+ via AICore

These models handle summarisation, translation, smart replies, and code completion without any network call.

๐Ÿค”
Think about it:On-device LLMs mean your private messages, health data, and browsing history never leave your phone. But they also mean less visibility for safety monitoring. How should device manufacturers balance privacy with content safety?

Hardware Accelerators at the Edge

Modern devices include dedicated silicon for neural network inference:

  • Apple Neural Engine โ€” 16-core NPU in M-series and A-series chips. Tight Core ML integration.
  • Qualcomm Hexagon NPU โ€” Powers Android AI on Snapdragon chips. Supports INT4 natively.
  • Google Tensor TPU โ€” Custom chip in Pixel phones for on-device speech, translation, and photos.
  • Intel Movidius โ€” Low-power VPU for IoT cameras and drones.

The trend is clear: every major chipmaker now treats AI acceleration as a first-class feature, not an afterthought.

Real-World Edge AI Applications

  • Keyboard prediction (GBoard, SwiftKey) โ€” Next-word prediction runs entirely on-device using personalised models
  • Real-time translation (Google Translate offline) โ€” Full neural machine translation without connectivity
  • Face recognition (Face ID) โ€” Apple's TrueDepth system processes 3D face maps on the Neural Engine
  • Smart cameras โ€” Retail analytics, wildlife monitoring, and traffic management at the edge
  • Hearing aids โ€” On-device noise cancellation and speech enhancement with sub-millisecond latency
๐Ÿง Quick Check

Why is edge deployment essential for a hearing aid's noise cancellation feature?

๐Ÿค”
Think about it:You are building an AI-powered quality inspection system for a factory floor with no reliable internet. What model architecture, compression strategy, and hardware would you choose? How would you handle model updates?

๐Ÿ“š Further Reading

  • ONNX Runtime Documentation โ€” Official guide to the cross-platform inference engine
  • TensorFlow Lite Guide โ€” Google's comprehensive guide to on-device ML deployment
  • A Survey on Model Compression for LLMs โ€” Academic overview of quantisation, pruning, and distillation for large models
Lesson 7 of 100 of 10 completed
โ†MLOps and DeploymentAI Regulationโ†’