AI EducademyAIEducademy
๐ŸŒณ

AI Foundations

๐ŸŒฑ
AI Seeds

Start from zero

๐ŸŒฟ
AI Sprouts

Build foundations

๐ŸŒณ
AI Branches

Apply in practice

๐Ÿ•๏ธ
AI Canopy

Go deep

๐ŸŒฒ
AI Forest

Master AI

๐Ÿ”จ

AI Mastery

โœ๏ธ
AI Sketch

Start from zero

๐Ÿชจ
AI Chisel

Build foundations

โš’๏ธ
AI Craft

Apply in practice

๐Ÿ’Ž
AI Polish

Go deep

๐Ÿ†
AI Masterpiece

Master AI

๐Ÿš€

Career Ready

๐Ÿš€
Interview Launchpad

Start your journey

๐ŸŒŸ
Behavioral Mastery

Master soft skills

๐Ÿ’ป
Technical Interviews

Ace the coding round

๐Ÿค–
AI & ML Interviews

ML interview mastery

๐Ÿ†
Offer & Beyond

Land the best offer

View All Programsโ†’

Lab

7 experiments loaded
๐Ÿง Neural Network Playground๐Ÿค–AI or Human?๐Ÿ’ฌPrompt Lab๐ŸŽจImage Generator๐Ÿ˜ŠSentiment Analyzer๐Ÿ’กChatbot Builderโš–๏ธEthics Simulator
๐ŸŽฏMock InterviewEnter the Labโ†’
JourneyBlog
๐ŸŽฏ
About

Making AI education accessible to everyone, everywhere

โ“
FAQ

Common questions answered

โœ‰๏ธ
Contact

Get in touch with us

โญ
Open Source

Built in public on GitHub

Get Started
AI EducademyAIEducademy

MIT Licence. Open Source

Learn

  • Academics
  • Lessons
  • Lab

Community

  • GitHub
  • Contribute
  • Code of Conduct
  • About
  • FAQ

Support

  • Buy Me a Coffee โ˜•
  • Terms of Service
  • Privacy Policy
  • Contact
AI & Engineering Academicsโ€บ๐ŸŒฟ AI Sproutsโ€บLessonsโ€บEmbeddings and Vector Databases
๐Ÿงญ
AI Sprouts โ€ข Intermediateโฑ๏ธ 16 min read

Embeddings and Vector Databases

Embeddings - How AI Understands Meaning

After tokenisation, each token is just a number - an index in a vocabulary. But index 4,821 tells the model nothing about meaning. How does AI know that "king" and "queen" are related, or that "bank" can mean a riverbank or a financial institution? The answer is embeddings.

The Problem with One-Hot Encoding

The naive approach represents each word as a vector with one 1 and thousands of 0s. "Cat" might be [0, 0, 1, 0, ..., 0] and "dog" [0, 0, 0, 1, ..., 0].

This has two fatal flaws:

  • No similarity: "Cat" and "dog" are equally distant from each other as "cat" and "democracy." The encoding captures zero semantic information.
  • Massive size: With a 50,000-word vocabulary, every word needs a 50,000-dimensional vector. Wildly inefficient.

Word Embeddings - Meaning as Geometry

An embedding maps each token to a dense vector of, say, 256 or 768 dimensions. Unlike one-hot vectors, these dimensions are learned during training and encode meaning.

Words used in similar contexts end up close together in this space. "Puppy" lands near "kitten." "London" lands near "Paris." The geometry of the space is the meaning.

A 2D projection of word embeddings showing clusters: animals (cat, dog, fish) grouped together, cities (London, Paris, Tokyo) grouped together, and the famous king-queen analogy as vector arithmetic
In embedding space, meaning becomes geometry. Similar concepts cluster together.

Word2Vec - King โˆ’ Man + Woman = Queen

The 2013 Word2Vec paper showed something remarkable. Trained on large text corpora, the learned vectors exhibit arithmetic relationships:

vector("king") โˆ’ vector("man") + vector("woman") โ‰ˆ vector("queen")

The direction from "man" to "woman" captures the concept of gender. Adding it to "king" moves to "queen." This is not programmed - it emerges from patterns in language.

Other examples: , .

Lesson 9 of 160% complete
โ†Tokenisation

Discussion

Sign in to join the discussion

Suggest an edit to this lesson
Paris โˆ’ France + Italy โ‰ˆ Rome
bigger โˆ’ big + small โ‰ˆ smaller
๐Ÿคฏ

Word2Vec was created by Tomรกลก Mikolov at Google in 2013. The paper has over 40,000 citations and is considered one of the most influential NLP papers ever published. It demonstrated that simple neural networks trained on raw text could learn astonishing semantic relationships.

Embedding Dimensions

Modern models use different embedding sizes:

| Model | Embedding dimensions | |-------|---------------------| | Word2Vec | 100โ€“300 | | BERT | 768 | | GPT-3 | 12,288 | | OpenAI text-embedding-3-large | 3,072 |

More dimensions capture finer distinctions but require more memory and compute. Think of it like describing a person: 3 dimensions (height, weight, age) give a rough sketch; 768 dimensions paint a detailed portrait.

๐Ÿง Quick Check

What does the famous equation 'king โˆ’ man + woman โ‰ˆ queen' demonstrate?

From Words to Sentences

Word embeddings represent individual words, but we often need to compare entire sentences or documents. Sentence embeddings (from models like Sentence-BERT or OpenAI's embedding API) compress a whole passage into a single vector.

"How do I reset my password?" and "I forgot my login credentials" would have very similar sentence embeddings, even though they share almost no words. The embedding captures intent, not just vocabulary.

Measuring Similarity - Cosine Similarity

To compare two embeddings, we use cosine similarity - the cosine of the angle between two vectors. It ranges from โˆ’1 (opposite) to +1 (identical direction).

  • "Happy" and "joyful": cosine โ‰ˆ 0.85 (very similar).
  • "Happy" and "table": cosine โ‰ˆ 0.10 (unrelated).
  • "Love" and "hate": cosine might be โ‰ˆ 0.40 (related but opposite).

Cosine similarity ignores vector magnitude, focusing purely on direction - which is where meaning lives.

๐Ÿค”
Think about it:

"Love" and "hate" are opposites in meaning but might have moderate cosine similarity because they appear in similar contexts (emotions, relationships). What does this tell us about the limitations of embeddings trained purely on word co-occurrence?

Vector Databases - Search by Meaning

A vector database stores millions of embeddings and retrieves the most similar ones blazingly fast. Instead of keyword matching ("find documents containing 'machine learning'"), you search by meaning ("find documents about AI education").

Popular vector databases include:

  • Pinecone - fully managed, scales effortlessly.
  • Weaviate - open-source with hybrid search (vectors + keywords).
  • ChromaDB - lightweight, great for prototyping.
  • pgvector - adds vector search to PostgreSQL.

These databases use algorithms like HNSW (Hierarchical Navigable Small World) to search billions of vectors in milliseconds.

๐Ÿง Quick Check

What advantage does vector search have over traditional keyword search?

RAG - Retrieval-Augmented Generation

RAG is one of the most important patterns in modern AI. It combines vector search with language models:

  1. Embed your documents and store them in a vector database.
  2. When a user asks a question, embed the query.
  3. Retrieve the most similar document chunks via vector search.
  4. Feed those chunks to the language model as context.
  5. The model generates an answer grounded in your data.

RAG lets language models answer questions about your specific data - company documents, product catalogues, research papers - without retraining. It dramatically reduces hallucination because the model has real sources to reference.

๐Ÿง Quick Check

In a RAG system, what role does the vector database play?

Practical Applications

Embeddings power countless real-world systems:

  • Semantic search - find relevant results regardless of exact wording.
  • Recommendations - "users who liked this also liked..." via embedding similarity.
  • Clustering - group similar support tickets, reviews, or documents automatically.
  • Anomaly detection - spot outliers that are far from any cluster.
  • Duplicate detection - find near-identical content across large corpora.
๐Ÿคฏ

Spotify uses audio embeddings to recommend songs. Each track is embedded based on its acoustic features, and recommendations come from finding nearby vectors - songs that "sound similar" in embedding space.

๐Ÿค”
Think about it:

If you embedded every product in an online shop, how could you build a recommendation system that says "customers who viewed this item might also like..." without relying on purchase history?

Key Takeaways

  • Embeddings are dense vector representations where meaning becomes geometry.
  • Similar concepts cluster together; relationships appear as directions.
  • Cosine similarity measures how close two meanings are.
  • Vector databases enable search by meaning at massive scale.
  • RAG combines vector search with language models to answer questions from your own data.

๐Ÿ“š Further Reading

  • Jay Alammar - The Illustrated Word2Vec - Visual, intuitive walkthrough of how word embeddings work
  • Pinecone Learning Centre - What Are Embeddings? - Practical guide to embeddings and vector search
  • OpenAI Embeddings Guide - How to generate and use embeddings with the OpenAI API