You already know that AI can recognise faces, translate languages, and recommend songs. But what actually powers all of that? Data. Without data, AI is like a car without fuel - it simply cannot go anywhere.
In this lesson, we will explore what datasets look like, the different types of data AI uses, and why the quality of that data matters enormously.
A dataset is an organised collection of information that an AI system learns from. Think of it as a giant spreadsheet.
The ImageNet dataset contains over 14 million hand-labelled images across more than 20,000 categories. It took researchers years and thousands of human annotators to build it.
AI works with two broad categories of data:
This is data that fits neatly into tables with rows and columns - like a bank transaction log or a hospital patient record. Each field has a clear type (number, date, category).
Examples: sales figures, sensor readings, survey responses.
This is data that does not follow a fixed format. It includes images, videos, audio recordings, emails, and social media posts. Over 80% of the world's data is unstructured.
Examples: photos on your phone, voice messages, news articles.
Which of the following is an example of unstructured data?
Humans can learn to recognise a dog after seeing just a few pictures. AI typically needs thousands - sometimes millions - of examples before it can do the same. That is because AI has no built-in understanding of the world. Every piece of knowledge must come from the data.
Sign in to join the discussion
The more varied and representative the data, the better the AI generalises to new situations. A model trained only on photos of golden retrievers might fail to recognise a poodle. Diversity in data leads to robustness in AI.
This is why companies invest heavily in collecting, cleaning, and curating large datasets - it is often the most time-consuming and expensive part of building an AI system.
The good news is that once a high-quality dataset exists, it can be reused and shared, accelerating research worldwide.
AI is only as good as the data it learns from. If the data is messy, incomplete, or incorrect, the AI will produce unreliable results. This principle is known as garbage in, garbage out (GIGO).
Common data quality problems include:
Imagine you are studying for an exam, but half your textbook pages are missing and some answers in the back are wrong. How well would you perform? That is exactly what happens when AI trains on poor-quality data.
Some datasets have become famous in the AI community:
| Dataset | What It Contains | Size | |---------|-----------------|------| | ImageNet | Labelled photographs | 14 million+ images | | Common Crawl | Web pages from across the internet | Petabytes of text | | Wikipedia | Encyclopaedia articles | 60 million+ articles | | MNIST | Handwritten digits (0โ9) | 70,000 images |
These datasets are freely available and have been used to train some of the most influential AI models in history.
What does the MNIST dataset contain?
Before AI can learn from data, someone usually needs to label it. This means telling the system what each example represents.
This process is called annotation, and it is often done by humans - sometimes thousands of them working together.
Many AI companies use crowd-sourcing platforms where workers around the world label data for just a few pence per item. It is a massive global effort that most people never see.
Data reflects the world it comes from - including the world's prejudices. When a dataset over-represents one group or under-represents another, the AI trained on it will inherit those imbalances.
Real examples of data bias:
Bias is not just a technical problem - it is a social one. Every dataset carries the assumptions and blind spots of the people who created it.
If you trained an AI to recommend restaurants but only gave it data from London, would it give good recommendations for someone in Manchester? What might it get wrong?
What is the most likely consequence of training an AI on biased data?
Next up, we will look at the algorithms that actually process all this data and turn it into intelligence.