Editing Computer Vision

<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
{{BloomIntro}}
Computer Vision (CV) is the field of artificial intelligence that enables machines to interpret and understand visual information from the world — images, videos, and other visual data. It is one of the most mature AI disciplines, with applications spanning medical imaging, autonomous driving, surveillance, augmented reality, and industrial quality control. Modern computer vision is dominated by deep learning approaches, particularly convolutional neural networks and vision transformers.
</div>

__TOC__

<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Remembering</span> ==
* '''Pixel''' — The smallest unit of a digital image, containing color information (RGB channels with values 0–255).
* '''Image classification''' — Assigning a label to an entire image (e.g., "cat" or "dog").
* '''Object detection''' — Identifying and localizing multiple objects within an image using bounding boxes.
* '''Semantic segmentation''' — Classifying every pixel in an image with a class label (e.g., sky, road, person).
* '''Instance segmentation''' — Like semantic segmentation but distinguishing individual object instances.
* '''Convolution''' — A mathematical operation that slides a small filter (kernel) across an image to produce a feature map highlighting learned patterns.
* '''Kernel/Filter''' — A small matrix of learnable weights applied during convolution (e.g., 3×3 or 5×5).
* '''Pooling''' — A downsampling operation that reduces spatial dimensions, retaining important features (max pooling, average pooling).
* '''Feature map''' — The output of a convolution layer representing activations at each spatial position.
* '''Bounding box''' — A rectangle (x, y, width, height) used to localize an object in an image.
* '''IoU (Intersection over Union)''' — A metric measuring overlap between predicted and ground-truth bounding boxes.
* '''Anchor boxes''' — Predefined bounding box shapes used in object detection models like YOLO and Faster R-CNN.
* '''Data augmentation''' — Artificially increasing training set diversity through transformations: flipping, rotation, cropping, color jitter.
* '''Transfer learning''' — Using a model pre-trained on a large dataset (e.g., ImageNet) as a starting point for a new task.
* '''ResNet''' — Residual Network; a CNN architecture with skip connections that enabled training of very deep networks.
* '''Vision Transformer (ViT)''' — A transformer applied directly to patches of an image rather than using convolutions.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Understanding</span> ==
Human vision processes images in a hierarchical, parallel manner. CNNs were designed to mimic this: early layers detect simple features (edges, colors), middle layers combine these into shapes and textures, and later layers assemble these into object representations.

The key insight behind convolutions is '''translation invariance and parameter sharing'''. A filter that detects a horizontal edge in the top-left corner of an image should also detect it anywhere else. Sharing weights across spatial positions means the network learns this once, not for every possible location — vastly reducing parameters compared to a fully connected network.

'''Receptive field''': Each neuron in a deeper layer "sees" a larger portion of the original image. Stacking convolution layers increases the receptive field, allowing the network to integrate information over larger regions.

'''Residual connections''' (ResNets) solved the degradation problem: simply adding identity skip connections (output = F(x) + x) allowed training networks of 100+ layers by giving gradients a shortcut path backward, preventing vanishing.

'''Vision Transformers''' treat an image as a sequence of patches (e.g., 16×16 pixel patches), apply position embeddings, and process them with multi-head self-attention. This allows global context from the start — every patch attends to every other — unlike CNNs where large receptive fields only emerge deep in the network.
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Applying</span> ==
'''Image classification with a pre-trained ResNet (PyTorch):'''

<syntaxhighlight lang="python">
import torch
import torchvision.transforms as transforms
from torchvision.models import resnet50, ResNet50_Weights
from PIL import Image

# Load pre-trained model
model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)
model.eval()

# Preprocessing pipeline
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225]),
])

img = Image.open("image.jpg")
input_tensor = preprocess(img).unsqueeze(0)

with torch.no_grad():
    output = model(input_tensor)

probabilities = torch.nn.functional.softmax(output[0], dim=0)
top5 = torch.topk(probabilities, 5)
</syntaxhighlight>

; Object detection model selection
: '''Speed-critical (edge/mobile)''' → YOLOv8-nano, MobileNet-SSD
: '''Accuracy-critical''' → YOLOv8-x, Faster R-CNN with ResNet-101 backbone
: '''Instance segmentation''' → Mask R-CNN, YOLOv8-seg
: '''Medical imaging''' → U-Net (segmentation), EfficientDet
: '''Aerial/satellite imagery''' → DOTA-trained detection models
</div>

<div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Analyzing</span> ==
{| class="wikitable"
|+ Computer Vision Architecture Trade-offs
! Architecture !! Speed !! Accuracy !! Data Efficiency !! Interpretability
|-
| Simple CNN || Fast || Moderate || Moderate || Low
|-
| ResNet-50 || Moderate || High || Good || Low
|-
| EfficientNet || Fast || Very high || Good || Low
|-
| Vision Transformer (ViT) || Slow || Very high || Low (needs large data) || Moderate (attention maps)
|-
| YOLO (real-time detection) || Very fast || High || Moderate || Low
|}

'''Failure modes and edge cases:'''
* '''Distribution shift''' — A model trained on daytime images may fail catastrophically at night or in fog. Always evaluate on data matching deployment conditions.
* '''Adversarial examples''' — Imperceptible pixel perturbations can cause confident misclassification. Critical in security contexts.
* '''Class imbalance''' — Rare but important classes (e.g., a specific disease) are underrepresented. Use weighted loss, oversampling, or focal loss.
* '''Shortcut learning''' — Models learn spurious correlations (e.g., "cows are in green fields") rather than the actual discriminative features.
* '''Label noise''' — Incorrect annotations in training data systematically bias learning. Perform label quality audits on a random sample.
</div>

<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Evaluating</span> ==
Expert CV practitioners use a layered evaluation strategy:

'''mAP (mean Average Precision)''': The standard metric for object detection. Computed as the mean of per-class average precision at IoU thresholds (0.5, 0.5:0.95). mAP@0.5:0.95 is the COCO benchmark standard.

'''Confusion matrix analysis''': Which classes are being confused with each other? Is the model confusing "pedestrian" with "cyclist" at distances? This reveals actionable improvements.

'''Calibration of confidence scores''': Object detectors output confidence scores. A well-calibrated detector's 0.9 confidence should be right ~90% of the time. Use temperature scaling post-hoc.

'''Latency profiling''': FPS (frames per second) on the target hardware. A 95% accurate model running at 5 FPS on a drone is less useful than a 90% accurate model running at 30 FPS. Use TensorRT, ONNX, or CoreML for hardware-specific optimization.

Expert practitioners also perform '''sliced evaluation''' — measuring performance separately across subgroups (day/night, near/far objects, occluded/visible) to surface hidden disparities between overall and subgroup accuracy.
</div>

<div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Creating</span> ==
Designing a production computer vision system:

'''1. Define task taxonomy precisely'''
* Classification only? Detection? Segmentation? Pose estimation?
* Single class or multi-class/multi-label?
* Real-time constraint? (determines architecture choice)

'''2. Data pipeline architecture'''
<syntaxhighlight lang="text">
Raw images
    ↓
[Quality filtering: blur detection, resolution check]
    ↓
[Annotation: LabelStudio / CVAT / Roboflow]
    ↓
[Augmentation: Albumentations pipeline]
    ↓
[Train/Val/Test split: stratified by class]
    ↓
[Model training + experiment tracking]
    ↓
[Evaluation: mAP, confusion matrix, latency]
    ↓
[Deployment: ONNX export → serving]
    ↓
[Monitoring: prediction distribution drift, confidence degradation]
</syntaxhighlight>

'''3. Active learning loop''' for continuous improvement:
* Deploy model, collect low-confidence predictions
* Route uncertain samples to human labelers
* Retrain on expanded dataset
* Evaluate and deploy if metrics improve

'''4. Key tooling'''
* Training: PyTorch + Lightning, Ultralytics (YOLO)
* Annotation: Roboflow, CVAT, Label Studio
* Augmentation: Albumentations
* Serving: Triton Inference Server, TorchServe, ONNX Runtime

[[Category:Artificial Intelligence]]
[[Category:Computer Vision]]
[[Category:Deep Learning]]
</div>