Editing AI Chips and Hardware Accelerators (section)

== <span style="color: #FFFFFF;">Remembering</span> ==
* '''GPU (Graphics Processing Unit)''' — A processor originally designed for parallel graphics rendering, repurposed for AI due to its thousands of parallel cores ideal for matrix operations.
* '''TPU (Tensor Processing Unit)''' — Google's custom ASIC designed specifically for neural network computation; optimized for the matrix multiply operations in deep learning.
* '''ASIC (Application-Specific Integrated Circuit)''' — A chip designed for a specific application rather than general computation; offers maximum efficiency but no flexibility.
* '''FLOPS (Floating Point Operations Per Second)''' — A measure of computational throughput; modern AI training requires petaFLOPS (10^15) to exaFLOPS (10^18) of compute.
* '''VRAM (Video RAM)''' — The on-chip memory of a GPU; a critical constraint for AI — a model must fit its active tensors in VRAM. Modern AI GPUs have 24GB–192GB.
* '''Memory bandwidth''' — The rate at which data can be transferred between VRAM and the compute cores; often the primary bottleneck in inference.
* '''Tensor Core''' — Specialized compute units within NVIDIA GPUs that perform matrix multiplications in mixed precision at much higher throughput than standard CUDA cores.
* '''HBM (High Bandwidth Memory)''' — Stacked memory used in high-end AI chips (A100, H100); provides much higher bandwidth than GDDR6.
* '''NVLink''' — NVIDIA's high-speed interconnect for multi-GPU systems; provides GPU-to-GPU bandwidth far exceeding PCIe.
* '''Data parallelism''' — Splitting a training batch across multiple GPUs; each GPU computes gradients on its shard and they are averaged.
* '''Model parallelism''' — Splitting a model across multiple GPUs when a single GPU doesn't have enough memory.
* '''Mixed precision training''' — Using FP16 or BF16 for computation while maintaining FP32 for gradient accumulation; reduces memory and increases throughput.
* '''Quantization''' — Reducing numerical precision of model weights (INT8, INT4) for more efficient inference.
* '''TFLOPs''' — Tera FLOPS; a common unit for AI chip performance. H100 delivers ~2,000 TFLOPS (FP16).
* '''MFU (Model FLOPs Utilization)''' — The fraction of a chip's theoretical peak FLOPS actually achieved during training; a measure of training efficiency.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">