AI EducademyAIEducademy
๐ŸŒณ

AI Foundations

๐ŸŒฑ
AI Seeds

Start from zero

๐ŸŒฟ
AI Sprouts

Build foundations

๐ŸŒณ
AI Branches

Apply in practice

๐Ÿ•๏ธ
AI Canopy

Go deep

๐ŸŒฒ
AI Forest

Master AI

๐Ÿ”จ

AI Mastery

โœ๏ธ
AI Sketch

Start from zero

๐Ÿชจ
AI Chisel

Build foundations

โš’๏ธ
AI Craft

Apply in practice

๐Ÿ’Ž
AI Polish

Go deep

๐Ÿ†
AI Masterpiece

Master AI

๐Ÿš€

Career Ready

๐Ÿš€
Interview Launchpad

Start your journey

๐ŸŒŸ
Behavioral Mastery

Master soft skills

๐Ÿ’ป
Technical Interviews

Ace the coding round

๐Ÿค–
AI & ML Interviews

ML interview mastery

๐Ÿ†
Offer & Beyond

Land the best offer

View All Programsโ†’

Lab

7 experiments loaded
๐Ÿง Neural Network Playground๐Ÿค–AI or Human?๐Ÿ’ฌPrompt Lab๐ŸŽจImage Generator๐Ÿ˜ŠSentiment Analyzer๐Ÿ’กChatbot Builderโš–๏ธEthics Simulator
๐ŸŽฏMock InterviewEnter the Labโ†’
JourneyBlog
๐ŸŽฏ
About

Making AI education accessible to everyone, everywhere

โ“
FAQ

Common questions answered

โœ‰๏ธ
Contact

Get in touch with us

โญ
Open Source

Built in public on GitHub

Get Started
AI EducademyAIEducademy

MIT Licence. Open Source

Learn

  • Academics
  • Lessons
  • Lab

Community

  • GitHub
  • Contribute
  • Code of Conduct
  • About
  • FAQ

Support

  • Buy Me a Coffee โ˜•
  • Terms of Service
  • Privacy Policy
  • Contact
AI & Engineering Academicsโ€บ๐ŸŒฟ AI Sproutsโ€บLessonsโ€บTokenisation
๐Ÿ”ค
AI Sprouts โ€ข Intermediateโฑ๏ธ 14 min read

Tokenisation

Tokenisation - How AI Reads Text

Neural networks work with numbers. They cannot read the word "hello" the way you do. Before any language model can process text, it must be broken into small numerical pieces called tokens. This seemingly simple step has profound consequences for how AI understands - and misunderstands - language.

Why Can't AI Just Read Characters?

The simplest approach: treat each character as a token. "Hello" becomes ['H', 'e', 'l', 'l', 'o'] - five tokens.

The problem? Words become absurdly long sequences. A 500-word essay might become 2,500+ character tokens. Since Transformer models scale quadratically with sequence length, this is computationally brutal. Worse, individual characters carry almost no meaning - the model must learn that 'c', 'a', 't' together mean a furry animal.

Word-Level Tokenisation

The opposite extreme: each word is one token. "The cat sat" becomes ['The', 'cat', 'sat'] - compact and meaningful.

But this creates a different problem: the vocabulary explosion. English alone has hundreds of thousands of words. Add misspellings, technical jargon, and code, and the vocabulary becomes unmanageable. Any word not in the vocabulary becomes an unknown [UNK] token - a dead end for understanding.

๐Ÿค”
Think about it:

If a model using word-level tokens encounters "ChatGPT" for the first time and it is not in the vocabulary, it becomes [UNK]. How might this affect the model's ability to discuss new technology?

The Sweet Spot - Subword Tokenisation

Modern language models use subword tokenisation, which sits between characters and words. Common words stay whole ("the", "and"), while rare words are split into meaningful pieces ("un" + "believ" + "able").

This gives us a manageable vocabulary (typically 32,000โ€“100,000 tokens) while handling any text - even words the model has never seen before.

The word 'unbelievable' split into three subword tokens: 'un', 'believ', and 'able', with arrows showing how they recombine
Subword tokenisation splits rare words into reusable pieces while keeping common words whole.

Byte Pair Encoding (BPE) - Step by Step

BPE is the algorithm behind GPT models. Here is how it builds a vocabulary:

Lesson 8 of 160% complete
โ†Loss Functions and Optimisers

Discussion

Sign in to join the discussion

Suggest an edit to this lesson
  1. Start with individual characters: {'h', 'e', 'l', 'o', 'w', 'r', 'd', ' '}.
  2. Count which pairs of adjacent tokens appear most frequently in the training text.
  3. Merge the most frequent pair into a new token. If 'l' + 'o' appears most, create 'lo'.
  4. Repeat steps 2โ€“3 until you reach the desired vocabulary size.

Worked example with the text "low lower lowest":

| Step | Most frequent pair | New token | Vocabulary grows | |------|-------------------|-----------|-----------------| | 1 | l + o | lo | ...lo... | | 2 | lo + w | low | ...low... | | 3 | e + r | er | ...er... | | 4 | low + e | lowe | ...lowe... |

After enough merges, common words and word fragments emerge naturally from the data.

๐Ÿคฏ

BPE was originally invented in 1994 as a data compression algorithm. It was repurposed for NLP in 2015 by Sennrich et al. - a beautiful example of ideas crossing disciplines.

Other Tokenisation Methods

WordPiece

Used by BERT and related models. Similar to BPE, but instead of merging the most frequent pair, it merges the pair that maximises the likelihood of the training data. Subword pieces are prefixed with ## (e.g., "playing" โ†’ ['play', '##ing']).

SentencePiece

Treats the input as a raw byte stream - no pre-tokenisation by spaces. This is crucial for languages like Japanese and Chinese that do not use spaces between words. GPT-4 and LLaMA use SentencePiece-style approaches.

How GPT-4 Tokenises Text

GPT-4 uses a BPE variant called cl100k_base with roughly 100,000 tokens in its vocabulary. Some surprising behaviours:

  • "Hello world" โ†’ 2 tokens (Hello, world - note the space is attached).
  • "indivisibility" โ†’ 4 tokens (ind, iv, isibility - it splits rare words).
  • A single emoji ๐ŸŽ‰ โ†’ often 1โ€“3 tokens.
  • Python code def hello(): โ†’ each keyword and symbol is typically its own token.
๐Ÿง Quick Check

Why do language models use subword tokenisation instead of whole words?

The Vocabulary Size Trade-Off

| Vocabulary size | Pros | Cons | |----------------|------|------| | Small (8k) | Smaller model, fewer embeddings | Longer sequences, slower processing | | Large (100k+) | Shorter sequences, richer tokens | Larger embedding table, more memory |

Finding the right balance is an engineering decision that affects model speed, memory, and capability.

Multilingual Challenges

Tokenisers trained primarily on English text are biased. The same sentence in Hindi or Arabic may require 3โ€“5ร— more tokens than its English equivalent, because those scripts were underrepresented in training data. This means:

  • Non-English users hit context limits sooner.
  • API costs are higher per word for non-English text.
  • The model has less "thinking space" for non-English reasoning.
๐Ÿง Quick Check

Why might the same sentence cost more API tokens in Hindi than in English?

Token Counting and Cost Implications

Every API call to GPT-4, Claude, or Gemini is billed per token. Understanding tokenisation helps you:

  • Estimate costs before running large jobs.
  • Optimise prompts - shorter prompts with the same meaning save money.
  • Respect context windows - GPT-4 Turbo accepts 128k tokens; exceeding this truncates your input silently.

A rough rule of thumb for English: 1 token โ‰ˆ ยพ of a word, or about 4 characters.

๐Ÿง Quick Check

Approximately how many tokens is a 1,000-word English essay?

๐Ÿคฏ

OpenAI's open-source tiktoken library lets you tokenise text locally with the exact same algorithm GPT-4 uses. Try it on your own writing to see how many tokens your messages really cost.

๐Ÿค”
Think about it:

If you were building a language model for a low-resource language like Welsh, how would you approach tokenisation to ensure fair and efficient encoding?

Key Takeaways

  • Tokenisation converts raw text into numerical tokens that models can process.
  • BPE builds a vocabulary by iteratively merging the most frequent character pairs.
  • Subword tokenisation balances vocabulary size with the ability to handle any text.
  • Tokeniser bias disadvantages non-English languages in cost and capability.
  • Understanding tokens helps you estimate costs and optimise prompts.

๐Ÿ“š Further Reading

  • Andrej Karpathy - nn-zero-to-hero (Tokenizer lecture) - Build a BPE tokeniser from scratch alongside Karpathy
  • OpenAI Tokenizer Tool - Interactive tool to see how GPT models tokenise your text
  • Hugging Face - Summary of Tokenizers - Clear comparison of BPE, WordPiece, and SentencePiece