You glance at a photo and instantly recognise a dog, a car, or your best friend's face. Your brain does this effortlessly โ but for a computer, "seeing" is an incredibly complex task.
Computer vision is the field of AI that gives machines the ability to interpret and understand visual information from the world โ images, videos, and live camera feeds.
To a computer, an image is just a grid of numbers.
# Loading an image and viewing its pixel values
from PIL import Image
import numpy as np
img = Image.open("dog.jpg")
pixels = np.array(img)
print(f"Image shape: {pixels.shape}")
# Output example: (480, 640, 3) โ 480 rows, 640 columns, 3 colour channels
print(f"Total pixel values: {pixels.size:,}")
# Output example: 921,600
# A single pixel in the top-left corner
print(f"Top-left pixel (R, G, B): {pixels[0, 0]}")
# Output example: [142, 178, 225] โ a light blue sky pixel
Raw pixel values are meaningless on their own. The AI needs to detect features โ meaningful patterns:
Think of it like reading: first you learn letters (low-level), then words (mid-level), then sentences and meaning (high-level).
The human visual cortex processes images in layers too! Neurons near your eyes detect simple edges and colours, while deeper brain regions recognise complex objects and faces. CNNs were literally inspired by this biological architecture.
The Convolutional Neural Network is the workhorse of computer vision. Let's build an intuition for how it works โ no heavy maths required.
Imagine placing a small magnifying glass (say, 3ร3 pixels) on an image and sliding it across every position. At each position, the filter looks for a specific pattern โ a vertical edge, a horizontal line, a curve.
Each filter produces a new image called a feature map that highlights where that pattern was found.
| Layer | What It Does | Analogy | |-------|-------------|---------| | Convolutional | Slides filters across the image to detect features | A detective examining every inch of a crime scene with a magnifying glass | | Activation (ReLU) | Keeps only the strong signals, removes noise | Highlighting key evidence and ignoring distractions | | Pooling | Shrinks the feature maps while keeping important info | Summarising a chapter into bullet points | | Fully Connected | Combines all features to make a final decision | The jury deliberates and reaches a verdict |
# Building a simple CNN with TensorFlow/Keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
model = Sequential([
# First convolutional layer: 32 filters, 3x3 each
Conv2D(32, (3, 3), activation='relu', input_shape=(64, 64, 3)),
MaxPooling2D(pool_size=(2, 2)),
# Second convolutional layer: 64 filters
Conv2D(64, (3, 3), activation='relu'),
MaxPooling2D(pool_size=(2, 2)),
# Flatten and classify
Flatten(),
Dense(128, activation='relu'),
Dense(10, activation='softmax'), # 10 classes
])
model.summary()
You don't need to design filters by hand! During training, the CNN learns what filters are most useful. Early layers might learn to detect edges, while deeper layers learn to detect ears, eyes, or entire faces โ all automatically from data.
Question: "What is in this image?"
The model assigns a single label to the entire image: "cat", "car", "pizza".
This is what you learned about in AI Sprouts โ give the model an image, get a label back.
Question: "What objects are here, and where are they?"
The model draws bounding boxes around each object and labels them. This is what self-driving cars use to locate pedestrians, traffic signs, and other vehicles.
Question: "Which exact pixels belong to each object?"
The model colours every pixel with its object class. This gives pixel-perfect outlines โ critical for medical imaging (tumour boundaries) and AR filters (separating your face from the background).
# Using a pre-trained object detection model
# This uses YOLO (You Only Look Once) โ one of the fastest detectors
# Pseudocode for clarity
def detect_objects(image_path):
"""Detect and label objects in an image."""
model = load_pretrained_model("yolov8")
image = load_image(image_path)
results = model.predict(image)
for detection in results:
print(f"Object: {detection.label}")
print(f" Confidence: {detection.confidence:.1%}")
print(f" Location: {detection.bounding_box}")
print()
# Example output:
# Object: person
# Confidence: 97.2%
# Location: (120, 50, 340, 480)
# Object: bicycle
# Confidence: 89.5%
# Location: (200, 300, 450, 520)
Self-driving cars need to detect objects in real-time โ processing around 30 frames per second. That means the AI has roughly 33 milliseconds to analyse each frame. How do you think engineers balance accuracy with speed? What happens if the model is 99.9% accurate but processes 1 million frames per day?
The YOLO (You Only Look Once) object detection model can process over 45 images per second on a modern GPU. Its creator, Joseph Redmon, eventually stopped his research over ethical concerns about how the technology might be used for surveillance and military applications.
You don't need to train a model from scratch to experiment with computer vision. Pre-trained models like ResNet, MobileNet, or EfficientNet have already learned from millions of images.
# Using a pre-trained model to classify an image
from tensorflow.keras.applications import MobileNetV2
from tensorflow.keras.applications.mobilenet_v2 import (
preprocess_input, decode_predictions
)
from tensorflow.keras.preprocessing import image
import numpy as np
# Load pre-trained MobileNetV2 (trained on ImageNet โ 1000 classes)
model = MobileNetV2(weights='imagenet')
# Load and preprocess an image
img = image.load_img("photo.jpg", target_size=(224, 224))
img_array = preprocess_input(
np.expand_dims(image.img_to_array(img), axis=0)
)
# Get predictions
predictions = model.predict(img_array)
top_3 = decode_predictions(predictions, top=3)[0]
print("Top 3 Predictions:")
for rank, (id, label, confidence) in enumerate(top_3, 1):
bar = "โ" * int(confidence * 40)
print(f" {rank}. {label}: {confidence:.1%} {bar}")
# Example output:
# 1. golden_retriever: 92.3% โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# 2. Labrador_retriever: 4.1% โโ
# 3. cocker_spaniel: 1.8% โ
MobileNetV2 is specifically designed to run on mobile phones. It has only 3.4 million parameters โ compared to ResNet-152's 60 million โ while maintaining impressive accuracy. This is achieved through "depthwise separable convolutions," a clever architectural trick that reduces computation.
With great power comes great responsibility:
The technology itself is neutral โ it's how we choose to deploy it that matters.
You've now explored three powerful branches of AI: healthcare applications, natural language processing, and computer vision. These are the foundations that real AI engineers build on every day. In upcoming lessons, we'll dive into AI in creative arts โ how machines generate music, art, and stories!