AI EducademyAIEducademy
AcademicsLabBlogAbout
Sign In
AI EducademyAIEducademy

Free AI education for everyone, in every language.

Learn

  • Academics
  • Lessons
  • Lab
  • Dashboard
  • About

Community

  • GitHub
  • Contribute
  • Code of Conduct

Support

  • Buy Me a Coffee โ˜•

Free AI education for everyone

MIT Licence. Open Source

Programsโ€บ๐Ÿ•๏ธ AI Canopyโ€บLessonsโ€บLarge Language Models โ€” The Engines Behind Modern AI
๐Ÿ“
AI Canopy โ€ข Intermediateโฑ๏ธ 45 min read

Large Language Models โ€” The Engines Behind Modern AI

What Is an LLM? ๐Ÿค–

A Large Language Model (LLM) is a neural network trained on massive amounts of text to understand and generate human language. The word "large" refers to three things:

Scale of modern LLMs:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Parameters           โ”‚ Billions (7B โ†’ 400B+)            โ”‚
โ”‚ Training data        โ”‚ Trillions of tokens (words)      โ”‚
โ”‚ Training cost        โ”‚ Millions of dollars               โ”‚
โ”‚ Training time        โ”‚ Weeks to months on thousands of GPUs โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

At its core, an LLM does one thing: predict the next token. Given "The cat sat on the", it predicts "mat" (or "roof" or "couch") with different probabilities. This simple objective, at massive scale, produces remarkably intelligent behaviour.

๐Ÿค”
Think about it:

Imagine reading every book, article, and website ever written โ€” billions of pages. After all that reading, you'd be pretty good at predicting what word comes next in any sentence. That's essentially what an LLM does, but with mathematical precision.


The Transformer Architecture ๐Ÿ—๏ธ

Every modern LLM is built on the Transformer architecture (from the 2017 paper "Attention Is All You Need"). The key innovation: self-attention.

Self-Attention โ€” The Core Idea

Traditional models read text sequentially โ€” word by word, left to right. Transformers read everything at once and figure out which words are relevant to each other.

Sentence: "The bank by the river was steep"

For the word "bank", attention scores might be:
  "bank" โ†โ†’ "river"  = 0.45  (high โ€” helps clarify meaning)
  "bank" โ†โ†’ "steep"  = 0.30  (medium โ€” supports the "riverbank" meaning)
  "bank" โ†โ†’ "The"    = 0.05  (low โ€” not very informative)

This lets the model understand context: "bank" near "river" means a riverbank, not a financial bank.

Multi-Head Attention

Transformers have multiple attention heads running in parallel, each learning different relationships (grammar, meaning, coreference). GPT-3 has 96 heads across 96 layers โ€” all self-discovered, no human guidance.

Each Transformer layer follows: Self-Attention โ†’ Add + Normalise โ†’ Feed-Forward โ†’ Add + Normalise. Stack 50โ€“100+ of these blocks and you have a modern LLM.

๐Ÿคฏ

The "Add + Normalise" steps are skip connections โ€” the same trick from ResNet! They keep gradients healthy across dozens of layers, making deep Transformers trainable.


The Training Pipeline ๐Ÿ”„

Training an LLM happens in three distinct phases:

Phase 1: Pretraining (learn language)

The model reads trillions of tokens from books, websites, and code. It learns grammar, facts, reasoning patterns, and even some world knowledge โ€” all from predicting the next token.

Input:  "The capital of France is ___"
Target: "Paris"

Input:  "def fibonacci(n):\n    if n <= 1:\n        return ___"
Target: "n"

Cost: Millions of dollars and weeks of GPU time. This is the expensive step.

Phase 2: Fine-Tuning (learn to follow instructions)

The base model is great at completing text but terrible at following instructions. Fine-tuning trains it on curated question-answer pairs:

User: "Summarise this article in three bullet points."
Assistant: "โ€ข Point one...\nโ€ข Point two...\nโ€ข Point three..."

Phase 3: RLHF (learn human preferences)

Reinforcement Learning from Human Feedback teaches the model what humans consider helpful, harmless, and honest.

Prompt: "How do I pick a lock?"

Response A: "Here are detailed instructions..."   โ† Ranked lower
Response B: "I can't help with that because..."   โ† Ranked higher

The model learns: safety and helpfulness matter.
๐Ÿ’ก

RLHF is what makes the difference between a model that just completes text and one that feels like a helpful assistant. It aligns the model with human values โ€” but it's not perfect, which is why AI safety research remains critical.


Comparing Major Models ๐Ÿ†

The LLM landscape evolves rapidly. Here are the major players:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Model Family โ”‚ Creator     โ”‚ Key Characteristics                โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ GPT-4/4o     โ”‚ OpenAI      โ”‚ Strong general reasoning,          โ”‚
โ”‚              โ”‚             โ”‚ multimodal (text + images)         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Claude       โ”‚ Anthropic   โ”‚ Safety-focused, long context,      โ”‚
โ”‚              โ”‚             โ”‚ strong at analysis and coding      โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Llama        โ”‚ Meta        โ”‚ Open-weight, community-driven,     โ”‚
โ”‚              โ”‚             โ”‚ can run locally                    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Gemini       โ”‚ Google      โ”‚ Multimodal-native, integrated      โ”‚
โ”‚              โ”‚             โ”‚ with Google services               โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Mistral      โ”‚ Mistral AI  โ”‚ Efficient, European-made,          โ”‚
โ”‚              โ”‚             โ”‚ strong for its size                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

No single model is "best" at everything. The right choice depends on your task, budget, privacy requirements, and whether you need to run the model locally.


Capabilities and Limitations โš–๏ธ

What LLMs Can Do Well

  • Generate fluent, coherent text in many languages
  • Summarise, translate, and transform text
  • Write and explain code
  • Answer questions using knowledge from training data
  • Reason through multi-step problems (with the right prompting)

What LLMs Struggle With

  • Hallucinations โ€” confidently stating incorrect facts
  • Maths โ€” unreliable with complex arithmetic without tools
  • Recency โ€” knowledge has a training cutoff date
  • True reasoning โ€” pattern-matching can look like reasoning but fail on novel problems
  • Privacy โ€” they may memorise and regurgitate training data
๐Ÿค”
Think about it:

LLMs are like brilliant but unreliable interns. They can draft amazing work, but you should always fact-check their output. Trust, but verify โ€” especially for anything critical.


Token Economics and Context Windows ๐Ÿ“Š

LLMs don't read characters or words โ€” they read tokens. A token is roughly 3โ€“4 characters or about ยพ of a word.

"Hello, how are you today?" โ†’ ["Hello", ",", " how", " are", " you", " today", "?"]
                               = 7 tokens

Rule of thumb:
  100 tokens  โ‰ˆ  75 words
  1,000 tokens โ‰ˆ  750 words  โ‰ˆ  1.5 pages

The context window is the maximum tokens an LLM can process at once (input + output). GPT-4o supports 128K tokens (~300 pages), Claude handles 200K (~500 pages). API pricing is per token, so understanding token economics is essential:

# Rough cost estimation
input_tokens = 1000    # Your prompt
output_tokens = 500    # Model's response
price_per_1k = 0.01   # Varies by model and provider

cost = ((input_tokens + output_tokens) / 1000) * price_per_1k
print(f"Cost per request: ${cost:.4f}")   # $0.0150
print(f"Cost for 10,000 requests: ${cost * 10000:.2f}")  # $150.00

Hands-On: Using an LLM API ๐Ÿ› ๏ธ

Here's how to call an LLM API in Python. This pattern works similarly across providers:

from openai import OpenAI

# Initialise the client (API key from environment variable)
client = OpenAI()

# Send a request
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful science tutor for teenagers."
        },
        {
            "role": "user",
            "content": "Explain photosynthesis in simple terms."
        }
    ],
    temperature=0.7,    # 0 = deterministic, 1 = creative
    max_tokens=300       # Limit response length
)

# Extract the reply
answer = response.choices[0].message.content
print(answer)

# Check token usage
usage = response.usage
print(f"Input tokens:  {usage.prompt_tokens}")
print(f"Output tokens: {usage.completion_tokens}")
print(f"Total tokens:  {usage.total_tokens}")

Key parameters explained:

temperature:  Controls randomness (0 = focused, 1 = creative)
max_tokens:   Limits response length (saves cost)
messages:     The conversation history (system + user + assistant turns)
model:        Which LLM to use
๐Ÿ’ก

The messages array is your conversation history. The model doesn't "remember" previous conversations โ€” you send the full context every time. This is why context windows matter: longer conversations cost more tokens.


Quick Recap ๐ŸŽฏ

  1. LLMs are neural networks trained on trillions of tokens to predict the next word โ€” scale creates emergent intelligence
  2. Transformers use self-attention to understand context, processing all tokens in parallel
  3. Training pipeline: pretraining (learn language) โ†’ fine-tuning (follow instructions) โ†’ RLHF (align with human values)
  4. Major models (GPT, Claude, Llama, Gemini) each have different strengths โ€” no single "best" model
  5. LLMs are powerful but not perfect โ€” hallucinations, maths errors, and knowledge cutoffs are real limitations
  6. Tokens are the currency of LLMs โ€” understanding them helps you manage cost and context

What's Next? ๐Ÿš€

You now understand the engines โ€” but knowing how to drive them is equally important. In the next lesson, we'll master prompt engineering: the art and science of getting the best results from LLMs through carefully crafted instructions. โœจ

Lesson 2 of 30 of 3 completed
โ†Deep Neural Networks โ€” Why Depth Changes EverythingPrompt Engineering Mastery โ€” The Art of Talking to AIโ†’