You glance at a photo and instantly know it shows a dog on a beach. For a computer, that same image is nothing more than a giant grid of numbers. Computer vision is the branch of AI that teaches machines to extract meaning from those numbers - and it is already reshaping industries around you.
When you look at a photograph, your brain instantly recognises shapes, colours, and depth. A computer has none of that intuition. Instead, it works with raw numbers.
A digital image is a grid of pixels. Each pixel stores colour values - typically three channels: red, green, and blue (RGB). A 1920 ร 1080 HD image contains over two million pixels, each with three values ranging from 0 to 255. Multiply those together and even a single frame contains millions of numbers.
Resolution determines how much detail the grid captures. Higher resolution means more pixels and richer detail - but also far more data for the AI to process. A 4K image has four times the pixels of HD, which means four times the computational cost.
Grayscale images have just one channel (brightness), while some specialised formats - like satellite imagery or medical scans - may have dozens of channels capturing wavelengths invisible to the human eye.
The human eye can distinguish roughly 10 million colours. A standard 8-bit RGB image can represent over 16.7 million unique colour combinations - more than we can actually perceive!
Early attempts at computer vision relied on hand-crafted rules - "look for edges here, match this template there." These brittle approaches failed whenever the scene changed. Modern systems use Convolutional Neural Networks (CNNs), which learn their own rules from thousands of labelled examples.
Think of a CNN as an assembly line of pattern detectors, each layer building on the one before it:
Sign in to join the discussion
The beauty is that nobody programmes these filters by hand. The network learns them during training, starting from random noise and gradually sharpening into useful detectors.
When you learn to recognise a friend's face, you do not memorise every pixel - you pick up on key features like eye shape, hairstyle, and expression. CNNs do something remarkably similar. What features do you think a CNN would learn first?
Computer vision tackles three progressively harder tasks:
| Task | Question it answers | Example | |------|-------------------|---------| | Image classification | What is in this image? | "This X-ray shows pneumonia." | | Object detection | What is in this image and where? | Drawing boxes around every pedestrian in a street scene. | | Semantic segmentation | Which pixels belong to which object? | Colouring every pixel of a road, pavement, car, and sky differently. |
Self-driving cars need all three simultaneously - classifying objects, locating them precisely, and understanding the full scene pixel by pixel.
Each task requires progressively more computational power and training data. Classification was largely solved by 2015; real-time segmentation on video remains an active area of research today.
Which computer vision task assigns a label to every individual pixel in an image?
Computer vision is already embedded in industries you might not expect:
Google's DeepMind developed an AI that can detect over 50 eye diseases from retinal scans as accurately as world-leading ophthalmologists - in seconds rather than weeks.
Computer vision is powerful, but it raises serious questions that society is still grappling with:
Imagine a school installs facial recognition cameras to take attendance automatically. What are the benefits? What could go wrong? Would you be comfortable with this system?
Why do some facial recognition systems perform worse on certain demographic groups?
In a CNN, what is the purpose of pooling layers?