AI & Engineering Academics›🏕️ AI Canopy›Lessons›Large Language Models

🤖

AI Canopy • Advanced⏱️ 20 min read

Large Language Models

Large Language Models - The Engines Behind Modern AI

Large language models have fundamentally changed what machines can do with text, code, and reasoning. In this lesson, we'll pull back the curtain on how they work, why they're so powerful, and where they still fall short.

What Exactly Is an LLM?

A large language model is a neural network trained on enormous quantities of text to predict the next token in a sequence. "Large" refers to the number of parameters - learnable weights that encode patterns from training data.

GPT-4 is estimated to have over 1 trillion parameters.
Llama 3 comes in sizes from 8 billion to 405 billion parameters.
Mistral 7B shows that smaller models can punch well above their weight.

These models are trained on hundreds of billions of words drawn from books, websites, code repositories, and academic papers. The sheer scale of data and parameters is what gives LLMs their remarkable versatility.

Tokens, not words: LLMs don't process whole words - they work with tokens, which are sub-word units. The word "understanding" might be split into "under" + "stand" + "ing". A typical English word averages about 1.3 tokens. This tokenisation scheme allows models to handle rare words, technical jargon, and even multiple languages without an impossibly large vocabulary.

🤯

Training GPT-4 is estimated to have cost over $100 million in compute alone. That's roughly the budget of a mid-range Hollywood film - except the "actor" can speak every programming language.

The Transformer Architecture

Every modern LLM is built on the transformer, introduced in the 2017 paper Attention Is All You Need. The key innovation is self-attention - a mechanism that lets the model weigh how important each word is relative to every other word in the input.

How self-attention works (simplified):

Each token is converted into three vectors: Query, Key, and Value.
The model computes attention scores by comparing every Query against every Key.
These scores determine how much each token "pays attention" to every other token.

Lesson 1 of 100% complete

←Back to program

Discussion

Suggest an edit to this lesson

The weighted sum of Values produces a context-aware representation.

The mathematical formula at the heart of attention is:

Attention(Q, K, V) = softmax(QK^T / √d) × V

The division by √d (the square root of the key dimension) prevents the dot products from growing too large, which would push the softmax function into regions with extremely small gradients. This simple scaling trick was one of the key insights that made training deep Transformers stable.

This is repeated across multiple attention heads in parallel, allowing the model to capture different types of relationships simultaneously - syntax in one head, semantics in another.

Diagram showing the transformer self-attention mechanism with Query, Key, and Value vectors — Self-attention allows every token to attend to every other token, capturing long-range dependencies.

🤔

Think about it:

If a sentence has 500 tokens, self-attention compares every token with every other - that's 250,000 comparisons per layer. How might this quadratic cost affect what LLMs can process?

Key Models in the Landscape

| Model | Creator | Notable Feature | |-------|---------|-----------------| | GPT-4o | OpenAI | Multimodal (text, image, audio) | | Claude 4 | Anthropic | Extended thinking, safety-focused | | Gemini 2.5 | Google DeepMind | Native multimodality, long context | | Llama 3 | Meta | Open-weight, community-driven | | Mistral Large | Mistral AI | Efficient European alternative |

The field moves extraordinarily fast - by the time you read this, newer models may already exist.

A crucial distinction in this landscape is between closed-source models (GPT-4o, Claude, Gemini) where only the API is available, and open-weight models (Llama, Mistral) where the model weights are publicly released. Open-weight models allow organisations to run inference on their own infrastructure, fine-tune for specific domains, and inspect the model's behaviour - advantages that matter greatly for privacy-sensitive industries.

💡

Not all "open" models are truly open source. Some release weights but restrict commercial use or don't share training data and code. Always check the licence before deploying an open-weight model in production.

Pre-Training, Fine-Tuning, and RLHF

LLMs are built in stages:

Pre-training - The model reads vast amounts of text and learns to predict the next token. This stage is enormously expensive and produces a base model that can complete text but isn't particularly helpful.
Supervised fine-tuning (SFT) - Human-written examples of ideal responses teach the model to follow instructions and answer questions properly.
RLHF (Reinforcement Learning from Human Feedback) - Human raters rank model outputs from best to worst. A reward model learns these preferences, and the LLM is trained to maximise the reward. This is what makes models helpful, harmless, and honest.

🧠Quick Check

What is the primary purpose of RLHF in LLM training?

Emergent Capabilities

As models scale up, they develop abilities that weren't explicitly trained:

Complex reasoning - Multi-step logical deduction and mathematical problem-solving.
Code generation - Writing, debugging, and explaining code across dozens of languages.
Multilingual fluency - Translating and generating text in languages with relatively little training data.
In-context learning - Adapting behaviour based on examples provided in the prompt, without any weight updates.

These emergent properties are one of the most fascinating aspects of scaling - abilities that appear almost "for free" once a model crosses certain size thresholds.

🤯

GPT-4 passed the Uniform Bar Exam in the 90th percentile - better than most human law graduates. Yet it can still struggle with basic arithmetic if the numbers are unusual enough.

Limitations You Must Understand

LLMs are powerful but far from perfect:

Hallucinations - Models generate confident-sounding text that is factually wrong. They don't "know" facts; they predict likely token sequences.
Context window limits - Each model has a maximum input size (e.g., 128K tokens for GPT-4o). Information beyond this window is simply invisible.
Cost and latency - Running inference on large models requires expensive GPU clusters. A single GPT-4 query costs significantly more than a GPT-3.5 query.
Lack of true understanding - LLMs manipulate statistical patterns in text. Whether this constitutes "understanding" is a deep philosophical debate.
Training data cutoffs - Models don't know about events after their training data was collected unless augmented with retrieval systems.

💡

Always verify critical facts from LLM outputs against authoritative sources. Treat LLMs as a brilliant but unreliable research assistant - extraordinarily useful, but never the final word on matters of fact.

🧠Quick Check

Why do LLMs sometimes produce hallucinations?

For years, the dominant belief was that bigger is always better - more parameters, more data, and more compute would steadily improve performance. This "scaling law" held remarkably well through GPT-2, GPT-3, and GPT-4.

However, recent research suggests we may be approaching diminishing returns on raw scale alone. The focus is shifting towards:

Better data quality over quantity (curated, deduplicated datasets).
Inference-time compute - letting models "think longer" on hard problems rather than making the model itself larger.
Specialised architectures - mixture-of-experts models that activate only a fraction of parameters per query, improving efficiency without sacrificing capability.

Large language models represent a genuine paradigm shift in computing. Understanding their architecture, training pipeline, and limitations isn't optional for anyone working seriously with AI - it's foundational knowledge that will serve you well as the field continues to evolve.

🧠Quick Check

Which stage of LLM training is the most computationally expensive?

🤔

Think about it:

If an LLM can pass the bar exam but can't reliably count the number of letters in a word, what does that tell us about the difference between human intelligence and language model capabilities?

AI Foundations

AI Mastery

Career Ready

Lab

Large Language Models

Large Language Models - The Engines Behind Modern AI

What Exactly Is an LLM?

The Transformer Architecture

Discussion

Key Models in the Landscape

Pre-Training, Fine-Tuning, and RLHF

Emergent Capabilities

Limitations You Must Understand

The Scaling Debate

Wrapping Up