AI EducademyAIEducademy
๐ŸŒณ

AI Foundations

๐ŸŒฑ
AI Seeds

Start from zero

๐ŸŒฟ
AI Sprouts

Build foundations

๐ŸŒณ
AI Branches

Apply in practice

๐Ÿ•๏ธ
AI Canopy

Go deep

๐ŸŒฒ
AI Forest

Master AI

๐Ÿ”จ

AI Mastery

โœ๏ธ
AI Sketch

Start from zero

๐Ÿชจ
AI Chisel

Build foundations

โš’๏ธ
AI Craft

Apply in practice

๐Ÿ’Ž
AI Polish

Go deep

๐Ÿ†
AI Masterpiece

Master AI

๐Ÿš€

Career Ready

๐Ÿš€
Interview Launchpad

Start your journey

๐ŸŒŸ
Behavioral Mastery

Master soft skills

๐Ÿ’ป
Technical Interviews

Ace the coding round

๐Ÿค–
AI & ML Interviews

ML interview mastery

๐Ÿ†
Offer & Beyond

Land the best offer

View All Programsโ†’

Lab

7 experiments loaded
๐Ÿง Neural Network Playground๐Ÿค–AI or Human?๐Ÿ’ฌPrompt Lab๐ŸŽจImage Generator๐Ÿ˜ŠSentiment Analyzer๐Ÿ’กChatbot Builderโš–๏ธEthics Simulator
๐ŸŽฏMock InterviewEnter the Labโ†’
JourneyBlog
๐ŸŽฏ
About

Making AI education accessible to everyone, everywhere

โ“
FAQ

Common questions answered

โœ‰๏ธ
Contact

Get in touch with us

โญ
Open Source

Built in public on GitHub

Get Started
AI EducademyAIEducademy

MIT Licence. Open Source

Learn

  • Academics
  • Lessons
  • Lab

Community

  • GitHub
  • Contribute
  • Code of Conduct
  • About
  • FAQ

Support

  • Buy Me a Coffee โ˜•
  • Terms of Service
  • Privacy Policy
  • Contact
AI & Engineering Academicsโ€บ๐Ÿ•๏ธ AI Canopyโ€บLessonsโ€บLarge Language Models
๐Ÿค–
AI Canopy โ€ข Advancedโฑ๏ธ 20 min read

Large Language Models

Large Language Models - The Engines Behind Modern AI

Large language models have fundamentally changed what machines can do with text, code, and reasoning. In this lesson, we'll pull back the curtain on how they work, why they're so powerful, and where they still fall short.

What Exactly Is an LLM?

A large language model is a neural network trained on enormous quantities of text to predict the next token in a sequence. "Large" refers to the number of parameters - learnable weights that encode patterns from training data.

  • GPT-4 is estimated to have over 1 trillion parameters.
  • Llama 3 comes in sizes from 8 billion to 405 billion parameters.
  • Mistral 7B shows that smaller models can punch well above their weight.

These models are trained on hundreds of billions of words drawn from books, websites, code repositories, and academic papers. The sheer scale of data and parameters is what gives LLMs their remarkable versatility.

Tokens, not words: LLMs don't process whole words - they work with tokens, which are sub-word units. The word "understanding" might be split into "under" + "stand" + "ing". A typical English word averages about 1.3 tokens. This tokenisation scheme allows models to handle rare words, technical jargon, and even multiple languages without an impossibly large vocabulary.

๐Ÿคฏ

Training GPT-4 is estimated to have cost over $100 million in compute alone. That's roughly the budget of a mid-range Hollywood film - except the "actor" can speak every programming language.

The Transformer Architecture

Every modern LLM is built on the transformer, introduced in the 2017 paper Attention Is All You Need. The key innovation is self-attention - a mechanism that lets the model weigh how important each word is relative to every other word in the input.

How self-attention works (simplified):

  1. Each token is converted into three vectors: Query, Key, and Value.
  2. The model computes attention scores by comparing every Query against every Key.
  3. These scores determine how much each token "pays attention" to every other token.
Lesson 1 of 100% complete
โ†Back to program

Discussion

Sign in to join the discussion

Suggest an edit to this lesson
  • The weighted sum of Values produces a context-aware representation.
  • The mathematical formula at the heart of attention is:

    Attention(Q, K, V) = softmax(QK^T / โˆšd) ร— V

    The division by โˆšd (the square root of the key dimension) prevents the dot products from growing too large, which would push the softmax function into regions with extremely small gradients. This simple scaling trick was one of the key insights that made training deep Transformers stable.

    This is repeated across multiple attention heads in parallel, allowing the model to capture different types of relationships simultaneously - syntax in one head, semantics in another.

    Diagram showing the transformer self-attention mechanism with Query, Key, and Value vectors
    Self-attention allows every token to attend to every other token, capturing long-range dependencies.
    ๐Ÿค”
    Think about it:

    If a sentence has 500 tokens, self-attention compares every token with every other - that's 250,000 comparisons per layer. How might this quadratic cost affect what LLMs can process?

    Key Models in the Landscape

    | Model | Creator | Notable Feature | |-------|---------|-----------------| | GPT-4o | OpenAI | Multimodal (text, image, audio) | | Claude 4 | Anthropic | Extended thinking, safety-focused | | Gemini 2.5 | Google DeepMind | Native multimodality, long context | | Llama 3 | Meta | Open-weight, community-driven | | Mistral Large | Mistral AI | Efficient European alternative |

    The field moves extraordinarily fast - by the time you read this, newer models may already exist.

    A crucial distinction in this landscape is between closed-source models (GPT-4o, Claude, Gemini) where only the API is available, and open-weight models (Llama, Mistral) where the model weights are publicly released. Open-weight models allow organisations to run inference on their own infrastructure, fine-tune for specific domains, and inspect the model's behaviour - advantages that matter greatly for privacy-sensitive industries.

    ๐Ÿ’ก

    Not all "open" models are truly open source. Some release weights but restrict commercial use or don't share training data and code. Always check the licence before deploying an open-weight model in production.

    Pre-Training, Fine-Tuning, and RLHF

    LLMs are built in stages:

    1. Pre-training - The model reads vast amounts of text and learns to predict the next token. This stage is enormously expensive and produces a base model that can complete text but isn't particularly helpful.

    2. Supervised fine-tuning (SFT) - Human-written examples of ideal responses teach the model to follow instructions and answer questions properly.

    3. RLHF (Reinforcement Learning from Human Feedback) - Human raters rank model outputs from best to worst. A reward model learns these preferences, and the LLM is trained to maximise the reward. This is what makes models helpful, harmless, and honest.

    ๐Ÿง Quick Check

    What is the primary purpose of RLHF in LLM training?

    Emergent Capabilities

    As models scale up, they develop abilities that weren't explicitly trained:

    • Complex reasoning - Multi-step logical deduction and mathematical problem-solving.
    • Code generation - Writing, debugging, and explaining code across dozens of languages.
    • Multilingual fluency - Translating and generating text in languages with relatively little training data.
    • In-context learning - Adapting behaviour based on examples provided in the prompt, without any weight updates.

    These emergent properties are one of the most fascinating aspects of scaling - abilities that appear almost "for free" once a model crosses certain size thresholds.

    ๐Ÿคฏ

    GPT-4 passed the Uniform Bar Exam in the 90th percentile - better than most human law graduates. Yet it can still struggle with basic arithmetic if the numbers are unusual enough.

    Limitations You Must Understand

    LLMs are powerful but far from perfect:

    • Hallucinations - Models generate confident-sounding text that is factually wrong. They don't "know" facts; they predict likely token sequences.
    • Context window limits - Each model has a maximum input size (e.g., 128K tokens for GPT-4o). Information beyond this window is simply invisible.
    • Cost and latency - Running inference on large models requires expensive GPU clusters. A single GPT-4 query costs significantly more than a GPT-3.5 query.
    • Lack of true understanding - LLMs manipulate statistical patterns in text. Whether this constitutes "understanding" is a deep philosophical debate.
    • Training data cutoffs - Models don't know about events after their training data was collected unless augmented with retrieval systems.
    ๐Ÿ’ก

    Always verify critical facts from LLM outputs against authoritative sources. Treat LLMs as a brilliant but unreliable research assistant - extraordinarily useful, but never the final word on matters of fact.

    ๐Ÿง Quick Check

    Why do LLMs sometimes produce hallucinations?

    The Scaling Debate

    For years, the dominant belief was that bigger is always better - more parameters, more data, and more compute would steadily improve performance. This "scaling law" held remarkably well through GPT-2, GPT-3, and GPT-4.

    However, recent research suggests we may be approaching diminishing returns on raw scale alone. The focus is shifting towards:

    • Better data quality over quantity (curated, deduplicated datasets).
    • Inference-time compute - letting models "think longer" on hard problems rather than making the model itself larger.
    • Specialised architectures - mixture-of-experts models that activate only a fraction of parameters per query, improving efficiency without sacrificing capability.

    Wrapping Up

    Large language models represent a genuine paradigm shift in computing. Understanding their architecture, training pipeline, and limitations isn't optional for anyone working seriously with AI - it's foundational knowledge that will serve you well as the field continues to evolve.

    ๐Ÿง Quick Check

    Which stage of LLM training is the most computationally expensive?

    ๐Ÿค”
    Think about it:

    If an LLM can pass the bar exam but can't reliably count the number of letters in a word, what does that tell us about the difference between human intelligence and language model capabilities?