Editing Ai Hardware

<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
{{BloomIntro}}
AI chips and hardware accelerators are the physical foundation on which all modern artificial intelligence runs. The AI revolution would not exist without specialized hardware capable of performing the massive matrix multiplications and tensor operations that deep learning requires, at speeds impossible on conventional CPUs. From NVIDIA's GPUs to Google's TPUs to Apple's Neural Engine to custom inference ASICs, the hardware landscape is evolving rapidly and has become a strategic axis of competition in the AI industry. Understanding AI hardware is essential for anyone deploying AI systems at scale.
</div>

__TOC__

<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Remembering</span> ==
* '''GPU (Graphics Processing Unit)''' — A processor originally designed for parallel graphics rendering, repurposed for AI due to its thousands of parallel cores ideal for matrix operations.
* '''TPU (Tensor Processing Unit)''' — Google's custom ASIC designed specifically for neural network computation; optimized for the matrix multiply operations in deep learning.
* '''ASIC (Application-Specific Integrated Circuit)''' — A chip designed for a specific application rather than general computation; offers maximum efficiency but no flexibility.
* '''FLOPS (Floating Point Operations Per Second)''' — A measure of computational throughput; modern AI training requires petaFLOPS (10^15) to exaFLOPS (10^18) of compute.
* '''VRAM (Video RAM)''' — The on-chip memory of a GPU; a critical constraint for AI — a model must fit its active tensors in VRAM. Modern AI GPUs have 24GB–192GB.
* '''Memory bandwidth''' — The rate at which data can be transferred between VRAM and the compute cores; often the primary bottleneck in inference.
* '''Tensor Core''' — Specialized compute units within NVIDIA GPUs that perform matrix multiplications in mixed precision at much higher throughput than standard CUDA cores.
* '''HBM (High Bandwidth Memory)''' — Stacked memory used in high-end AI chips (A100, H100); provides much higher bandwidth than GDDR6.
* '''NVLink''' — NVIDIA's high-speed interconnect for multi-GPU systems; provides GPU-to-GPU bandwidth far exceeding PCIe.
* '''Data parallelism''' — Splitting a training batch across multiple GPUs; each GPU computes gradients on its shard and they are averaged.
* '''Model parallelism''' — Splitting a model across multiple GPUs when a single GPU doesn't have enough memory.
* '''Mixed precision training''' — Using FP16 or BF16 for computation while maintaining FP32 for gradient accumulation; reduces memory and increases throughput.
* '''Quantization''' — Reducing numerical precision of model weights (INT8, INT4) for more efficient inference.
* '''TFLOPs''' — Tera FLOPS; a common unit for AI chip performance. H100 delivers ~2,000 TFLOPS (FP16).
* '''MFU (Model FLOPs Utilization)''' — The fraction of a chip's theoretical peak FLOPS actually achieved during training; a measure of training efficiency.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Understanding</span> ==
Traditional CPUs are designed for sequential, low-latency computation with sophisticated branch prediction, out-of-order execution, and large caches — ideal for general-purpose code with complex control flow. Deep learning has the opposite profile: it is embarrassingly parallel (millions of independent multiply-add operations) but the same simple operation, performed billions of times per second.

'''The matrix multiply insight''': The core operation in a neural network layer is Y = XW, where X is the activation matrix and W is the weight matrix. For a layer with 4096 inputs and 4096 outputs processing a batch of 2048, this is a 2048×4096 multiplied by 4096×4096 matrix — roughly 34 billion multiply-add operations. A CPU does this slowly, sequentially. A GPU with 10,000+ CUDA cores does it in thousands of parallel streams.

'''Why GPUs became the AI chip''': In 2012, Krizhevsky, Sutskever, and Hinton trained AlexNet on NVIDIA GTX 580 GPUs, achieving a breakthrough on ImageNet. This demonstrated that GPU training was not just feasible but transformative — a trend that has only accelerated.

'''The memory wall''': Modern AI chips can compute faster than they can feed data from memory. The H100 can do 2 PFLOPS of FP16 compute but its memory bandwidth is "only" 3.35 TB/s. For large transformer inference, most time is spent waiting for weights to be streamed from memory, not computing. This is called '''memory bandwidth bottleneck''' and drives design decisions in inference chips (large on-chip SRAM, HBM stacking).

'''The interconnect problem''': A single H100 has 80GB HBM. GPT-3 (175B parameters) requires ~350GB in FP16. Training requires 8–16 GPUs minimum. Connecting them with high-bandwidth NVLink (900 GB/s) vs. standard PCIe (64 GB/s) changes training throughput dramatically. Multi-node training requires fast InfiniBand networking between GPU servers.
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Applying</span> ==
'''Profiling GPU utilization and identifying bottlenecks:'''

<syntaxhighlight lang="python">
import torch
from torch.profiler import profile, ProfilerActivity, tensorboard_trace_handler

# Profile a model training step
model = MyTransformerModel().cuda()
optimizer = torch.optim.Adam(model.parameters())
inputs = torch.randn(32, 512).cuda()  # batch=32, seq_len=512

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=torch.profiler.schedule(wait=1, warmup=1, active=3),
    on_trace_ready=tensorboard_trace_handler("./log/profiler"),
    record_shapes=True,
    profile_memory=True,
    with_stack=True
) as prof:
    for step in range(5):
        optimizer.zero_grad()
        loss = model(inputs).sum()
        loss.backward()
        optimizer.step()
        prof.step()

# Print top CUDA kernels by time
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
</syntaxhighlight>

'''Enabling mixed precision training (automatic):'''
<syntaxhighlight lang="python">
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch in dataloader:
    optimizer.zero_grad()
    with autocast(dtype=torch.bfloat16):  # BF16 on H100 (preferred over FP16)
        output = model(batch)
        loss = criterion(output, targets)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
</syntaxhighlight>

; AI chip selection guide
: '''Research / single GPU''' → NVIDIA RTX 4090 (24GB, consumer) or A6000 (48GB, workstation)
: '''Production training''' → NVIDIA H100 (80GB HBM3, 3.35 TB/s bandwidth), clusters of 8–1024
: '''Google Cloud training''' → TPU v4/v5 pods (purpose-built for large-scale training)
: '''Edge inference (mobile)''' → Apple Neural Engine (ANE), Qualcomm Hexagon DSP
: '''Edge inference (industrial)''' → NVIDIA Jetson Orin, Hailo-8, Coral Edge TPU
: '''Data center inference''' → NVIDIA H100/L40S, Groq LPU (ultra-low latency), Cerebras
</div>

<div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Analyzing</span> ==
{| class="wikitable"
|+ Key AI Chip Comparison (as of 2024)
! Chip !! Peak FP16 TFLOPS !! Memory !! Bandwidth !! Best For
|-
| NVIDIA H100 SXM || 1,979 || 80GB HBM3 || 3.35 TB/s || Large model training, top inference
|-
| NVIDIA A100 SXM || 312 (TF32) || 80GB HBM2e || 2.0 TB/s || Training workhorse (prior gen)
|-
| NVIDIA RTX 4090 || 165 || 24GB GDDR6X || 1.0 TB/s || Consumer training, fine-tuning
|-
| Google TPU v4 || 275 || 32GB HBM || 1.2 TB/s || Large-scale training (TPU pods)
|-
| AMD MI300X || 1,307 || 192GB HBM3 || 5.3 TB/s || Memory-hungry inference, very large models
|-
| Groq LPU || 750 || 230MB SRAM (on-chip) || 80 TB/s (on-chip) || Ultra-low latency inference
|}

'''Failure modes and bottlenecks:'''
* '''VRAM OOM (Out of Memory)''' — The most common training failure. Activations, gradients, optimizer states, and model weights all compete for VRAM. Fix: reduce batch size, gradient checkpointing, activation offloading, or model parallelism.
* '''Low GPU utilization''' — Compute cores idle while waiting for data. Causes: slow data loading (increase DataLoader workers, use DALI), small batch sizes, communication overhead in multi-GPU. Use nvidia-smi or PyTorch profiler to diagnose.
* '''Communication bottleneck in distributed training''' — In multi-node training, gradient synchronization across nodes via InfiniBand can dominate training time. Fix: gradient compression, ZeRO-3 with CPU offload, reduce gradient sync frequency.
* '''Thermal throttling''' — GPUs reduce clock speed when temperature exceeds threshold. Relevant for long training runs without adequate cooling.
</div>

<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Evaluating</span> ==
Expert hardware evaluation goes beyond peak FLOPS numbers:

'''Roofline model analysis''': For a given model and hardware, determine whether the workload is compute-bound (FLOPS-limited) or memory-bandwidth-bound. The roofline model plots operational intensity (FLOPS/byte) against performance, revealing the actual bottleneck.

'''MFU (Model FLOPs Utilization)''': The ratio of measured throughput to theoretical peak. For transformer training on H100, good practitioners achieve 40–60% MFU. Below 20% indicates significant efficiency problems. MFU is the standard metric for training efficiency in the research community (popularized by PaLM).

'''Total Cost of Ownership (TCO)''': Cloud GPU pricing varies dramatically by provider and contract. For large-scale training, comparing $/PFLOP-hour and factoring in spot/preemptible pricing is essential. On-premise vs. cloud break-even analysis depends heavily on utilization.

Expert practitioners profile at the '''kernel level''' using Nsight Compute (NVIDIA) to understand exactly which operations are bottlenecked, enabling targeted optimization.
</div>

<div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Creating</span> ==
Designing an AI compute infrastructure:

'''1. Training cluster architecture'''
<syntaxhighlight lang="text">
[N × GPU servers]
  └── Each server: 8× H100 GPUs (NVLink 3.0, 900 GB/s)
  └── 2× InfiniBand 400Gb/s NICs per server
    ↓
[InfiniBand fat-tree fabric: all-reduce at 400 Gb/s]
    ↓
[Shared storage: Lustre or GPFS parallel filesystem (> 1 TB/s)]
    ↓
[Job scheduler: SLURM or Kubernetes with GPU device plugin]
    ↓
[Training frameworks: PyTorch FSDP, DeepSpeed ZeRO-3, Megatron-LM]
</syntaxhighlight>

'''2. Parallelism strategy for large models'''
* Data parallelism (DP): split batch across GPUs — works for small models
* Tensor parallelism (TP): split each weight matrix across GPUs — reduces per-GPU memory
* Pipeline parallelism (PP): split model layers across GPUs — overlaps compute and communication
* ZeRO (Zero Redundancy Optimizer): shard optimizer states, gradients, and parameters across DDP ranks
* Typical recipe for 70B+ models: DP × TP × PP (3D parallelism) + ZeRO-1

'''3. Inference serving optimization'''
<syntaxhighlight lang="text">
Original FP32 model
    ↓
[Quantization: INT8 with calibration dataset (bitsandbytes, GPTQ, AWQ)]
    ↓
[ONNX export → TensorRT optimization (layer fusion, kernel auto-tuning)]
    ↓
[vLLM serving: PagedAttention, continuous batching, speculative decoding]
    ↓
[Hardware: H100 for throughput-critical; Groq LPU for latency-critical]
    ↓
[Monitoring: GPU util, memory usage, latency p50/p95/p99, tokens/sec]
</syntaxhighlight>

[[Category:Artificial Intelligence]]
[[Category:AI Infrastructure]]
[[Category:Hardware]]
</div>