Computer Vision

How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?

Computer Vision (CV) is the field of artificial intelligence that enables machines to interpret and understand visual information from the world — images, videos, and other visual data. It is one of the most mature AI disciplines, with applications spanning medical imaging, autonomous driving, surveillance, augmented reality, and industrial quality control. Modern computer vision is dominated by deep learning approaches, particularly convolutional neural networks and vision transformers.

Remembering

Pixel — The smallest unit of a digital image, containing color information (RGB channels with values 0–255).
Image classification — Assigning a label to an entire image (e.g., "cat" or "dog").
Object detection — Identifying and localizing multiple objects within an image using bounding boxes.
Semantic segmentation — Classifying every pixel in an image with a class label (e.g., sky, road, person).
Instance segmentation — Like semantic segmentation but distinguishing individual object instances.
Convolution — A mathematical operation that slides a small filter (kernel) across an image to produce a feature map highlighting learned patterns.
Kernel/Filter — A small matrix of learnable weights applied during convolution (e.g., 3×3 or 5×5).
Pooling — A downsampling operation that reduces spatial dimensions, retaining important features (max pooling, average pooling).
Feature map — The output of a convolution layer representing activations at each spatial position.
Bounding box — A rectangle (x, y, width, height) used to localize an object in an image.
IoU (Intersection over Union) — A metric measuring overlap between predicted and ground-truth bounding boxes.
Anchor boxes — Predefined bounding box shapes used in object detection models like YOLO and Faster R-CNN.
Data augmentation — Artificially increasing training set diversity through transformations: flipping, rotation, cropping, color jitter.
Transfer learning — Using a model pre-trained on a large dataset (e.g., ImageNet) as a starting point for a new task.
ResNet — Residual Network; a CNN architecture with skip connections that enabled training of very deep networks.
Vision Transformer (ViT) — A transformer applied directly to patches of an image rather than using convolutions.

Understanding

Human vision processes images in a hierarchical, parallel manner. CNNs were designed to mimic this: early layers detect simple features (edges, colors), middle layers combine these into shapes and textures, and later layers assemble these into object representations.

The key insight behind convolutions is translation invariance and parameter sharing. A filter that detects a horizontal edge in the top-left corner of an image should also detect it anywhere else. Sharing weights across spatial positions means the network learns this once, not for every possible location — vastly reducing parameters compared to a fully connected network.

Receptive field: Each neuron in a deeper layer "sees" a larger portion of the original image. Stacking convolution layers increases the receptive field, allowing the network to integrate information over larger regions.

Residual connections (ResNets) solved the degradation problem: simply adding identity skip connections (output = F(x) + x) allowed training networks of 100+ layers by giving gradients a shortcut path backward, preventing vanishing.

Vision Transformers treat an image as a sequence of patches (e.g., 16×16 pixel patches), apply position embeddings, and process them with multi-head self-attention. This allows global context from the start — every patch attends to every other — unlike CNNs where large receptive fields only emerge deep in the network.

Applying

Image classification with a pre-trained ResNet (PyTorch):

<syntaxhighlight lang="python"> import torch import torchvision.transforms as transforms from torchvision.models import resnet50, ResNet50_Weights from PIL import Image

Load pre-trained model

model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2) model.eval()

Preprocessing pipeline

preprocess = transforms.Compose([

   transforms.Resize(256),
   transforms.CenterCrop(224),
   transforms.ToTensor(),
   transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225]),

])

img = Image.open("image.jpg") input_tensor = preprocess(img).unsqueeze(0)

with torch.no_grad():

   output = model(input_tensor)

probabilities = torch.nn.functional.softmax(output[0], dim=0) top5 = torch.topk(probabilities, 5) </syntaxhighlight>

Object detection model selection: Speed-critical (edge/mobile) → YOLOv8-nano, MobileNet-SSD; Accuracy-critical → YOLOv8-x, Faster R-CNN with ResNet-101 backbone; Instance segmentation → Mask R-CNN, YOLOv8-seg; Medical imaging → U-Net (segmentation), EfficientDet; Aerial/satellite imagery → DOTA-trained detection models

Analyzing

Computer Vision Architecture Trade-offs
Architecture	Speed	Accuracy	Data Efficiency	Interpretability
Simple CNN	Fast	Moderate	Moderate	Low
ResNet-50	Moderate	High	Good	Low
EfficientNet	Fast	Very high	Good	Low
Vision Transformer (ViT)	Slow	Very high	Low (needs large data)	Moderate (attention maps)
YOLO (real-time detection)	Very fast	High	Moderate	Low

Failure modes and edge cases:

Distribution shift — A model trained on daytime images may fail catastrophically at night or in fog. Always evaluate on data matching deployment conditions.
Adversarial examples — Imperceptible pixel perturbations can cause confident misclassification. Critical in security contexts.
Class imbalance — Rare but important classes (e.g., a specific disease) are underrepresented. Use weighted loss, oversampling, or focal loss.
Shortcut learning — Models learn spurious correlations (e.g., "cows are in green fields") rather than the actual discriminative features.
Label noise — Incorrect annotations in training data systematically bias learning. Perform label quality audits on a random sample.

Evaluating

Expert CV practitioners use a layered evaluation strategy:

mAP (mean Average Precision): The standard metric for object detection. Computed as the mean of per-class average precision at IoU thresholds (0.5, 0.5:0.95). [email protected]:0.95 is the COCO benchmark standard.

Confusion matrix analysis: Which classes are being confused with each other? Is the model confusing "pedestrian" with "cyclist" at distances? This reveals actionable improvements.

Calibration of confidence scores: Object detectors output confidence scores. A well-calibrated detector's 0.9 confidence should be right ~90% of the time. Use temperature scaling post-hoc.

Latency profiling: FPS (frames per second) on the target hardware. A 95% accurate model running at 5 FPS on a drone is less useful than a 90% accurate model running at 30 FPS. Use TensorRT, ONNX, or CoreML for hardware-specific optimization.

Expert practitioners also perform sliced evaluation — measuring performance separately across subgroups (day/night, near/far objects, occluded/visible) to surface hidden disparities between overall and subgroup accuracy.

Creating

Designing a production computer vision system:

1. Define task taxonomy precisely

Classification only? Detection? Segmentation? Pose estimation?
Single class or multi-class/multi-label?
Real-time constraint? (determines architecture choice)

2. Data pipeline architecture <syntaxhighlight lang="text"> Raw images

↓

[Quality filtering: blur detection, resolution check]

↓

[Annotation: LabelStudio / CVAT / Roboflow]

↓

[Augmentation: Albumentations pipeline]

↓

[Train/Val/Test split: stratified by class]

↓

[Model training + experiment tracking]

↓

[Evaluation: mAP, confusion matrix, latency]

↓

[Deployment: ONNX export → serving]

↓

[Monitoring: prediction distribution drift, confidence degradation] </syntaxhighlight>

3. Active learning loop for continuous improvement:

Deploy model, collect low-confidence predictions
Route uncertain samples to human labelers
Retrain on expanded dataset
Evaluate and deploy if metrics improve

4. Key tooling

Training: PyTorch + Lightning, Ultralytics (YOLO)
Annotation: Roboflow, CVAT, Label Studio
Augmentation: Albumentations
Serving: Triton Inference Server, TorchServe, ONNX Runtime

Computer Vision

Contents

Remembering

Understanding

Applying

Analyzing

Evaluating

Creating

Navigation menu

Computer Vision

Remembering

Understanding

Applying

Analyzing

Evaluating

Creating

Navigation menu

Search