Large language models have fundamentally changed what machines can do with text, code, and reasoning. In this lesson, we'll pull back the curtain on how they work, why they're so powerful, and where they still fall short.
A large language model is a neural network trained on enormous quantities of text to predict the next token in a sequence. "Large" refers to the number of parameters - learnable weights that encode patterns from training data.
These models are trained on hundreds of billions of words drawn from books, websites, code repositories, and academic papers. The sheer scale of data and parameters is what gives LLMs their remarkable versatility.
Tokens, not words: LLMs don't process whole words - they work with tokens, which are sub-word units. The word "understanding" might be split into "under" + "stand" + "ing". A typical English word averages about 1.3 tokens. This tokenisation scheme allows models to handle rare words, technical jargon, and even multiple languages without an impossibly large vocabulary.
Training GPT-4 is estimated to have cost over $100 million in compute alone. That's roughly the budget of a mid-range Hollywood film - except the "actor" can speak every programming language.
Every modern LLM is built on the transformer, introduced in the 2017 paper Attention Is All You Need. The key innovation is self-attention - a mechanism that lets the model weigh how important each word is relative to every other word in the input.
How self-attention works (simplified):
Sign in to join the discussion
The mathematical formula at the heart of attention is:
Attention(Q, K, V) = softmax(QK^T / โd) ร V
The division by โd (the square root of the key dimension) prevents the dot products from growing too large, which would push the softmax function into regions with extremely small gradients. This simple scaling trick was one of the key insights that made training deep Transformers stable.
This is repeated across multiple attention heads in parallel, allowing the model to capture different types of relationships simultaneously - syntax in one head, semantics in another.
If a sentence has 500 tokens, self-attention compares every token with every other - that's 250,000 comparisons per layer. How might this quadratic cost affect what LLMs can process?
| Model | Creator | Notable Feature | |-------|---------|-----------------| | GPT-4o | OpenAI | Multimodal (text, image, audio) | | Claude 4 | Anthropic | Extended thinking, safety-focused | | Gemini 2.5 | Google DeepMind | Native multimodality, long context | | Llama 3 | Meta | Open-weight, community-driven | | Mistral Large | Mistral AI | Efficient European alternative |
The field moves extraordinarily fast - by the time you read this, newer models may already exist.
A crucial distinction in this landscape is between closed-source models (GPT-4o, Claude, Gemini) where only the API is available, and open-weight models (Llama, Mistral) where the model weights are publicly released. Open-weight models allow organisations to run inference on their own infrastructure, fine-tune for specific domains, and inspect the model's behaviour - advantages that matter greatly for privacy-sensitive industries.
Not all "open" models are truly open source. Some release weights but restrict commercial use or don't share training data and code. Always check the licence before deploying an open-weight model in production.
LLMs are built in stages:
Pre-training - The model reads vast amounts of text and learns to predict the next token. This stage is enormously expensive and produces a base model that can complete text but isn't particularly helpful.
Supervised fine-tuning (SFT) - Human-written examples of ideal responses teach the model to follow instructions and answer questions properly.
RLHF (Reinforcement Learning from Human Feedback) - Human raters rank model outputs from best to worst. A reward model learns these preferences, and the LLM is trained to maximise the reward. This is what makes models helpful, harmless, and honest.
What is the primary purpose of RLHF in LLM training?
As models scale up, they develop abilities that weren't explicitly trained:
These emergent properties are one of the most fascinating aspects of scaling - abilities that appear almost "for free" once a model crosses certain size thresholds.
GPT-4 passed the Uniform Bar Exam in the 90th percentile - better than most human law graduates. Yet it can still struggle with basic arithmetic if the numbers are unusual enough.
LLMs are powerful but far from perfect:
Always verify critical facts from LLM outputs against authoritative sources. Treat LLMs as a brilliant but unreliable research assistant - extraordinarily useful, but never the final word on matters of fact.
Why do LLMs sometimes produce hallucinations?
For years, the dominant belief was that bigger is always better - more parameters, more data, and more compute would steadily improve performance. This "scaling law" held remarkably well through GPT-2, GPT-3, and GPT-4.
However, recent research suggests we may be approaching diminishing returns on raw scale alone. The focus is shifting towards:
Large language models represent a genuine paradigm shift in computing. Understanding their architecture, training pipeline, and limitations isn't optional for anyone working seriously with AI - it's foundational knowledge that will serve you well as the field continues to evolve.
Which stage of LLM training is the most computationally expensive?
If an LLM can pass the bar exam but can't reliably count the number of letters in a word, what does that tell us about the difference between human intelligence and language model capabilities?