AI EducademyAIEducademy
AcademicsLabBlogAbout
Sign In
AI EducademyAIEducademy

Free AI education for everyone, in every language.

Learn

  • Academics
  • Lessons
  • Lab
  • Dashboard
  • About

Community

  • GitHub
  • Contribute
  • Code of Conduct

Support

  • Buy Me a Coffee โ˜•

Free AI education for everyone

MIT Licence. Open Source

Programsโ€บ๐ŸŒณ AI Branchesโ€บLessonsโ€บComputer Vision Basics โ€” Teaching Machines to See
๐Ÿ‘๏ธ
AI Branches โ€ข Intermediateโฑ๏ธ 35 min read

Computer Vision Basics โ€” Teaching Machines to See

How Do Machines See? ๐Ÿ‘€

You glance at a photo and instantly recognise a dog, a car, or your best friend's face. Your brain does this effortlessly โ€” but for a computer, "seeing" is an incredibly complex task.

Computer vision is the field of AI that gives machines the ability to interpret and understand visual information from the world โ€” images, videos, and live camera feeds.

A photo of a dog being broken down into pixels, features, and a label
Computer vision transforms raw pixels into meaningful understanding

From Pixels to Features ๐Ÿ”

To a computer, an image is just a grid of numbers.

What a Pixel Is

  • A grayscale image is a 2D grid where each cell holds a value from 0 (black) to 255 (white)
  • A colour image has three layers (channels): Red, Green, and Blue โ€” stacked on top of each other
  • A typical smartphone photo (12 MP) contains 12 million pixels โ€” that's 36 million numbers for a colour image!
# Loading an image and viewing its pixel values
from PIL import Image
import numpy as np

img = Image.open("dog.jpg")
pixels = np.array(img)

print(f"Image shape: {pixels.shape}")
# Output example: (480, 640, 3) โ†’ 480 rows, 640 columns, 3 colour channels

print(f"Total pixel values: {pixels.size:,}")
# Output example: 921,600

# A single pixel in the top-left corner
print(f"Top-left pixel (R, G, B): {pixels[0, 0]}")
# Output example: [142, 178, 225] โ€” a light blue sky pixel

From Pixels to Features

Raw pixel values are meaningless on their own. The AI needs to detect features โ€” meaningful patterns:

  • Low-level features: edges, corners, colour gradients
  • Mid-level features: textures, shapes, parts (eyes, wheels, windows)
  • High-level features: entire objects (face, car, building)

Think of it like reading: first you learn letters (low-level), then words (mid-level), then sentences and meaning (high-level).

๐Ÿคฏ

The human visual cortex processes images in layers too! Neurons near your eyes detect simple edges and colours, while deeper brain regions recognise complex objects and faces. CNNs were literally inspired by this biological architecture.


Convolutional Neural Networks (CNNs) ๐Ÿง 

The Convolutional Neural Network is the workhorse of computer vision. Let's build an intuition for how it works โ€” no heavy maths required.

The Key Idea: Sliding Filters

Imagine placing a small magnifying glass (say, 3ร—3 pixels) on an image and sliding it across every position. At each position, the filter looks for a specific pattern โ€” a vertical edge, a horizontal line, a curve.

Each filter produces a new image called a feature map that highlights where that pattern was found.

CNN Layers (Visual Intuition)

| Layer | What It Does | Analogy | |-------|-------------|---------| | Convolutional | Slides filters across the image to detect features | A detective examining every inch of a crime scene with a magnifying glass | | Activation (ReLU) | Keeps only the strong signals, removes noise | Highlighting key evidence and ignoring distractions | | Pooling | Shrinks the feature maps while keeping important info | Summarising a chapter into bullet points | | Fully Connected | Combines all features to make a final decision | The jury deliberates and reaches a verdict |

# Building a simple CNN with TensorFlow/Keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense

model = Sequential([
    # First convolutional layer: 32 filters, 3x3 each
    Conv2D(32, (3, 3), activation='relu', input_shape=(64, 64, 3)),
    MaxPooling2D(pool_size=(2, 2)),

    # Second convolutional layer: 64 filters
    Conv2D(64, (3, 3), activation='relu'),
    MaxPooling2D(pool_size=(2, 2)),

    # Flatten and classify
    Flatten(),
    Dense(128, activation='relu'),
    Dense(10, activation='softmax'),  # 10 classes
])

model.summary()
๐Ÿ’ก

You don't need to design filters by hand! During training, the CNN learns what filters are most useful. Early layers might learn to detect edges, while deeper layers learn to detect ears, eyes, or entire faces โ€” all automatically from data.


The Three Tasks of Computer Vision ๐ŸŽฏ

1. Image Classification

Question: "What is in this image?"

The model assigns a single label to the entire image: "cat", "car", "pizza".

This is what you learned about in AI Sprouts โ€” give the model an image, get a label back.

2. Object Detection

Question: "What objects are here, and where are they?"

The model draws bounding boxes around each object and labels them. This is what self-driving cars use to locate pedestrians, traffic signs, and other vehicles.

3. Image Segmentation

Question: "Which exact pixels belong to each object?"

The model colours every pixel with its object class. This gives pixel-perfect outlines โ€” critical for medical imaging (tumour boundaries) and AR filters (separating your face from the background).

# Using a pre-trained object detection model
# This uses YOLO (You Only Look Once) โ€” one of the fastest detectors

# Pseudocode for clarity
def detect_objects(image_path):
    """Detect and label objects in an image."""
    model = load_pretrained_model("yolov8")
    image = load_image(image_path)

    results = model.predict(image)

    for detection in results:
        print(f"Object: {detection.label}")
        print(f"  Confidence: {detection.confidence:.1%}")
        print(f"  Location: {detection.bounding_box}")
        print()

# Example output:
# Object: person
#   Confidence: 97.2%
#   Location: (120, 50, 340, 480)
# Object: bicycle
#   Confidence: 89.5%
#   Location: (200, 300, 450, 520)
๐Ÿค”
Think about it:

Self-driving cars need to detect objects in real-time โ€” processing around 30 frames per second. That means the AI has roughly 33 milliseconds to analyse each frame. How do you think engineers balance accuracy with speed? What happens if the model is 99.9% accurate but processes 1 million frames per day?


Real-World Applications ๐ŸŒ

๐Ÿš— Self-Driving Cars

  • Multiple cameras feed images into CV models simultaneously
  • The AI detects lanes, traffic signs, pedestrians, and other vehicles
  • Combines computer vision with LiDAR (laser sensing) and radar for 360ยฐ awareness
  • Companies: Waymo, Tesla, Cruise

๐Ÿง‘ Facial Recognition

  • Maps unique facial features (distance between eyes, jaw shape, nose bridge)
  • Used for phone unlocking (Face ID), airport security, and photo organisation
  • Controversy: surveillance concerns, racial bias in accuracy, consent issues

๐Ÿ“ฑ AR Filters (Snapchat / Instagram)

  • Detects facial landmarks in real-time (68+ points on your face)
  • Tracks your face as it moves and overlays digital content
  • Uses lightweight models optimised to run on mobile processors

๐Ÿญ Manufacturing Quality Control

  • Cameras on assembly lines detect defective products
  • Spots scratches, dents, or misalignments faster than human inspectors
  • Runs 24/7 without fatigue
๐Ÿคฏ

The YOLO (You Only Look Once) object detection model can process over 45 images per second on a modern GPU. Its creator, Joseph Redmon, eventually stopped his research over ethical concerns about how the technology might be used for surveillance and military applications.


Hands-On: Understanding a Pre-Trained Model's Predictions ๐Ÿ”ฌ

You don't need to train a model from scratch to experiment with computer vision. Pre-trained models like ResNet, MobileNet, or EfficientNet have already learned from millions of images.

# Using a pre-trained model to classify an image
from tensorflow.keras.applications import MobileNetV2
from tensorflow.keras.applications.mobilenet_v2 import (
    preprocess_input, decode_predictions
)
from tensorflow.keras.preprocessing import image
import numpy as np

# Load pre-trained MobileNetV2 (trained on ImageNet โ€” 1000 classes)
model = MobileNetV2(weights='imagenet')

# Load and preprocess an image
img = image.load_img("photo.jpg", target_size=(224, 224))
img_array = preprocess_input(
    np.expand_dims(image.img_to_array(img), axis=0)
)

# Get predictions
predictions = model.predict(img_array)
top_3 = decode_predictions(predictions, top=3)[0]

print("Top 3 Predictions:")
for rank, (id, label, confidence) in enumerate(top_3, 1):
    bar = "โ–ˆ" * int(confidence * 40)
    print(f"  {rank}. {label}: {confidence:.1%} {bar}")

# Example output:
#   1. golden_retriever: 92.3% โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
#   2. Labrador_retriever: 4.1% โ–ˆโ–ˆ
#   3. cocker_spaniel: 1.8% โ–ˆ

What to Explore

  1. Try different images โ€” how does the model handle unusual angles or lighting?
  2. Test edge cases โ€” what happens with blurry images, drawings, or optical illusions?
  3. Check confidence โ€” does high confidence always mean the prediction is correct?
  4. Look for biases โ€” does the model work equally well on diverse subjects?
๐Ÿ’ก

MobileNetV2 is specifically designed to run on mobile phones. It has only 3.4 million parameters โ€” compared to ResNet-152's 60 million โ€” while maintaining impressive accuracy. This is achieved through "depthwise separable convolutions," a clever architectural trick that reduces computation.


The Ethics of Computer Vision โš–๏ธ

With great power comes great responsibility:

  • Surveillance: Facial recognition enables mass surveillance without consent
  • Bias: Studies show many CV systems are less accurate for women and people of colour
  • Deepfakes: AI can generate fake but realistic images and videos of real people
  • Privacy: Cameras are everywhere โ€” street CCTV, doorbells, drones โ€” and AI makes it easy to identify individuals at scale

The technology itself is neutral โ€” it's how we choose to deploy it that matters.


Quick Recap ๐ŸŽฏ

  1. Computers see images as grids of numbers โ€” pixels with Red, Green, and Blue values
  2. CNNs use sliding filters to automatically detect features from edges to objects
  3. The three core tasks are classification (what?), detection (what and where?), and segmentation (pixel-level labels)
  4. Real applications include self-driving cars, facial recognition, AR filters, and quality control
  5. Pre-trained models like MobileNetV2 let you experiment without training from scratch
  6. Ethics โ€” bias, surveillance, deepfakes, and privacy โ€” must guide how we deploy computer vision

What's Next? ๐Ÿš€

You've now explored three powerful branches of AI: healthcare applications, natural language processing, and computer vision. These are the foundations that real AI engineers build on every day. In upcoming lessons, we'll dive into AI in creative arts โ€” how machines generate music, art, and stories!

Lesson 3 of 30 of 3 completed
โ†Chatbots and NLP โ€” Teaching Machines to Understand Language๐Ÿ•๏ธ AI Canopyโ†’