AI EducademyAIEducademy
Get Started
AI EducademyAIEducademy

MIT Licence. Open Source

Learn

  • Academics
  • Lessons
  • Lab

Community

  • GitHub
  • Contribute
  • Code of Conduct
  • About

Support

  • Buy Me a Coffee โ˜•
AI & Engineering Academicsโ€บ๐ŸŒฒ AI Forestโ€บLessonsโ€บAI Infrastructure
๐Ÿ–ฅ๏ธ
AI Forest โ€ข Advancedโฑ๏ธ 18 min read

AI Infrastructure

AI Infrastructure

Behind every AI breakthrough is an enormous infrastructure story. Training GPT-4 reportedly cost over $100 million in compute alone. Understanding the hardware, cloud platforms, and optimisation techniques that power modern AI is essential for anyone building at scale.

Why GPUs Beat CPUs for AI

CPUs are optimised for sequential tasks โ€” a few powerful cores handling complex logic. GPUs are optimised for parallelism โ€” thousands of simple cores executing the same operation on different data simultaneously.

Neural network training is fundamentally matrix multiplication at scale. A single training step for a large model involves multiplying matrices with billions of elements. A CPU processes these sequentially. A GPU processes thousands of elements in parallel.

An NVIDIA H100 GPU delivers roughly 4,000 TFLOPS of FP8 compute. A high-end server CPU manages perhaps 2 TFLOPS. For the embarrassingly parallel workloads of deep learning, GPUs win by three orders of magnitude.

Comparison diagram showing CPU with few powerful cores versus GPU with thousands of parallel cores for matrix operations
GPUs process thousands of matrix operations simultaneously โ€” the fundamental advantage for AI workloads.

NVIDIA's Dominance

NVIDIA controls roughly 80-90% of the AI accelerator market. Their moat is not just hardware โ€” it is the CUDA ecosystem. Nearly every ML framework (PyTorch, TensorFlow, JAX) is built on CUDA. Switching to a competitor means rewriting low-level kernels and accepting potential incompatibilities.

The GPU Lineup

| GPU | Memory | FP8 Performance | Use Case | |-----|--------|-----------------|----------| | A100 | 80 GB HBM2e | 624 TFLOPS | Workhorse of current data centres | | H100 | 80 GB HBM3 | 3,958 TFLOPS | Frontier model training | | H200 | 141 GB HBM3e | 3,958 TFLOPS | Memory-bound LLM inference | | B200 | 192 GB HBM3e | 9,000 TFLOPS | Next-generation training and inference |

The H100 to B200 jump represents a 2.3ร— performance increase โ€” but the real bottleneck is often memory bandwidth, not raw compute. The H200's 141 GB of HBM3e memory specifically targets LLM inference where the KV cache dominates memory usage.

๐Ÿคฏ
A single NVIDIA B200 GPU costs approximately $30,000-$40,000. A DGX B200 server with eight of these GPUs costs over $300,000. Meta reportedly ordered 350,000 H100 GPUs for Llama training โ€” a hardware investment exceeding $10 billion.

Google TPUs

Google's Tensor Processing Units are custom ASICs designed specifically for matrix operations. Unlike GPUs, which handle graphics and compute, TPUs are purpose-built for neural networks.

  • TPU v5p โ€” Google's latest training chip, available in pods of up to 8,960 chips connected via high-speed interconnect
  • TPU v5e โ€” Cost-optimised for inference workloads
  • Key advantage: tight integration with JAX and the XLA compiler
  • Key limitation: only available through Google Cloud โ€” no on-premise option

Google trains all its foundation models (Gemini, PaLM) on TPUs, proving they compete with NVIDIA at the frontier.

AMD and the Competition

AMD's MI300X is the most credible NVIDIA alternative:

  • 192 GB HBM3 memory (more than the H100)
  • Strong ROCm software stack (improving but still behind CUDA)
  • Significant price advantage โ€” roughly 30-40% cheaper than equivalent NVIDIA
  • Growing ecosystem of ML frameworks with native ROCm support (PyTorch, JAX)

Microsoft, Meta, and Oracle have all adopted MI300X for inference workloads. The memory advantage is particularly compelling for large language models where the KV cache is the primary bottleneck.

Intel's Gaudi 3 accelerator is also entering the mix, offering competitive training performance with a focus on enterprise price-to-performance ratios. The AI accelerator market is finally becoming a multi-vendor ecosystem.

๐Ÿง Quick Check

Why is NVIDIA's competitive moat not just about hardware performance?

Cloud Providers Comparison

No company outside the hyperscalers can afford to build its own GPU cluster for frontier AI. The cloud battle is fierce:

| Platform | Key Offering | Strength | |----------|-------------|----------| | AWS SageMaker | End-to-end ML platform | Broadest GPU selection, Inferentia custom chips | | GCP Vertex AI | Managed ML with TPU access | TPU availability, tight BigQuery integration | | Azure ML | Enterprise ML platform | OpenAI partnership, enterprise compliance | | Lambda Labs | GPU cloud for AI | Simplicity, competitive H100 pricing | | CoreWeave | GPU-native cloud | Purpose-built for AI, NVIDIA partnership |

Spot instances can reduce training costs by 60-90%. The trade-off: your training job can be interrupted at any time. Frameworks like PyTorch Lightning and DeepSpeed support checkpointing to resume training seamlessly after preemption.

The GPU Shortage and Geopolitics

AI infrastructure is now a geopolitical issue:

  • TSMC (Taiwan) fabricates nearly all advanced AI chips โ€” for NVIDIA, AMD, Apple, and Google. A disruption to TSMC would halt AI progress globally.
  • US export controls restrict China's access to advanced chips (A100, H100 banned). China is investing heavily in domestic alternatives (Huawei Ascend, Biren).
  • NVIDIA's China-specific chips (H20) offer reduced performance to comply with export rules whilst maintaining market access.
  • The CHIPS Act (US) and similar programmes in the EU and Japan aim to diversify semiconductor manufacturing.
๐Ÿค”
Think about it:Advanced AI chips are manufactured almost exclusively in Taiwan. If geopolitical tensions disrupted this supply chain, how would it impact AI development globally? Which countries and companies are most vulnerable?

Custom Silicon Startups

Several startups are challenging the GPU paradigm entirely:

  • Groq โ€” LPU (Language Processing Unit) architecture delivers deterministic, ultra-low-latency inference. No batching required.
  • Cerebras โ€” Wafer-scale chip (the size of a dinner plate) eliminates memory bandwidth bottlenecks for training.
  • SambaNova โ€” Reconfigurable dataflow architecture optimised for enterprise AI.
  • Graphcore (acquired by SoftBank) โ€” IPU architecture with massive on-chip SRAM for graph-structured workloads.

These are not NVIDIA replacements today. But they represent genuine architectural innovation that could reshape the market as AI workloads diversify beyond transformers.

๐Ÿง Quick Check

What is the primary bottleneck for LLM inference that newer GPU designs (H200, MI300X) are specifically addressing?

Inference Optimisation

Training gets the headlines, but inference is where the money goes. Techniques that reduce inference cost:

  • Dynamic batching โ€” Group incoming requests to maximise GPU utilisation
  • KV cache management โ€” vLLM's PagedAttention avoids memory fragmentation
  • Speculative decoding โ€” Use a small model to draft tokens, verified in parallel by the large model (2-3ร— speedup)
  • Continuous batching โ€” Process new requests without waiting for all current requests to complete
  • Model parallelism โ€” Split large models across multiple GPUs for latency-sensitive serving
๐Ÿง Quick Check

How does speculative decoding accelerate LLM inference?

The Total Cost of AI Infrastructure

Infrastructure cost extends far beyond GPU rental. A complete picture includes:

  • Compute โ€” GPU hours for training and inference (typically 60-70% of total cost)
  • Storage โ€” Training data, checkpoints, model artefacts, and logs
  • Networking โ€” Data transfer between GPUs, regions, and to end users
  • Engineering โ€” ML engineers, platform engineers, and DevOps staff to operate the stack
  • Electricity โ€” A single H100 server consumes 10+ kW; data centre power is a growing constraint

At scale, companies like Meta spend over $10 billion annually on AI infrastructure. For startups, cloud costs for a single frontier model training run can exceed $1 million. Understanding and optimising this cost stack is a core competency.

๐Ÿค”
Think about it:Your company needs to serve a 70B parameter model to 10,000 concurrent users with sub-second latency. Map out your infrastructure: how many GPUs, which serving framework, what optimisation techniques, and what would the monthly cloud bill look like?

๐Ÿ“š Further Reading

  • NVIDIA H100 Tensor Core GPU Architecture โ€” Technical deep-dive into the Hopper architecture
  • Google Cloud TPU Documentation โ€” Official guide to TPU architecture and programming
  • The GPU Shortage Explained (SemiAnalysis) โ€” Industry analysis of AI chip supply and demand dynamics
Lesson 9 of 100 of 10 completed
โ†AI Regulation
Responsible AI Governanceโ†’