Not every prediction should travel to the cloud and back. Edge AI runs models directly on the device โ your phone, your browser, a camera, a car. The benefits are compelling: zero network latency, full privacy (data never leaves the device), offline capability, and dramatically lower serving costs.
The challenge? You need a model that fits in megabytes, runs on limited hardware, and still delivers useful accuracy.
The decision between edge and cloud is not binary โ most production systems use a hybrid approach. But understanding the trade-offs is critical:
| Factor | Cloud Inference | Edge Inference | |--------|----------------|----------------| | Latency | 100-500ms (network round trip) | 5-50ms (on-device) | | Privacy | Data sent to server | Data stays on device | | Cost | Per-request GPU cost | Zero marginal cost | | Offline | Requires connectivity | Works anywhere | | Scalability | Server bottleneck | Scales with devices | | Model Size | Unlimited | Constrained by device memory |
For applications like real-time translation, keyboard prediction, face unlock, or health monitoring, the edge is not just preferable โ it is the only viable option.
Production models are too large for edge devices. A GPT-class model at FP32 precision would consume gigabytes of memory. Compression techniques make deployment feasible.
Quantisation reduces the numerical precision of model weights from 32-bit floating point (FP32) to smaller representations:
Post-training quantisation (PTQ) converts an already-trained model. Quantisation-aware training (QAT) simulates low precision during training for better accuracy retention.
Pruning removes weights (or entire neurons) that contribute least to the model's output. Unstructured pruning zeroes out individual weights; structured pruning removes entire channels or attention heads for actual speedups on hardware.
A well-pruned model can remove 50-90% of parameters with less than 1% accuracy degradation โ but the pruned model must be fine-tuned afterwards to recover performance.
Train a small "student" model to mimic a large "teacher" model's outputs. The student learns not just the correct labels but the teacher's full probability distribution โ including which wrong answers are "almost right." This soft information transfers surprisingly well.
DistilBERT, for instance, retains 97% of BERT's accuracy at 60% of its size and 2ร the speed.
Which compression technique trains a smaller model to replicate a larger model's output distribution?
Open Neural Network Exchange (ONNX) is a framework-agnostic model format. Train in PyTorch, export to ONNX, deploy anywhere โ TensorFlow Lite, Core ML, ONNX Runtime, or the browser.
ONNX Runtime provides optimised inference across CPUs, GPUs, and NPUs with a single API. It handles operator fusion, memory planning, and hardware-specific acceleration automatically.
Google's runtime for Android, iOS, and microcontrollers. TFLite models are flat buffers โ small, fast to load, no dynamic memory allocation. The interpreter is under 1 MB. Delegate APIs offload computation to GPU, DSP, or NPU hardware.
Apple's framework automatically routes computation across CPU, GPU, and the dedicated Neural Engine. If you ship an iOS app with ML features, Core ML is the path of least resistance. It supports model encryption for IP protection.
Run ML models in the browser with near-native performance. ONNX Runtime Web compiles models to WebAssembly and uses WebGL or WebGPU for GPU acceleration. No server, no installation, no data leaves the browser.
Use cases: real-time background removal in video calls, on-page content moderation, and browser-based document OCR.
What is the primary advantage of ONNX as a model format?
Running LLMs on consumer hardware seemed impossible two years ago. Today, quantised models under 4 billion parameters run comfortably on flagship phones:
These models handle summarisation, translation, smart replies, and code completion without any network call.
Modern devices include dedicated silicon for neural network inference:
The trend is clear: every major chipmaker now treats AI acceleration as a first-class feature, not an afterthought.
Why is edge deployment essential for a hearing aid's noise cancellation feature?