Editing
AI Chips and Hardware Accelerators
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== <span style="color: #FFFFFF;">Remembering</span> == * '''GPU (Graphics Processing Unit)''' β A processor originally designed for parallel graphics rendering, repurposed for AI due to its thousands of parallel cores ideal for matrix operations. * '''TPU (Tensor Processing Unit)''' β Google's custom ASIC designed specifically for neural network computation; optimized for the matrix multiply operations in deep learning. * '''ASIC (Application-Specific Integrated Circuit)''' β A chip designed for a specific application rather than general computation; offers maximum efficiency but no flexibility. * '''FLOPS (Floating Point Operations Per Second)''' β A measure of computational throughput; modern AI training requires petaFLOPS (10^15) to exaFLOPS (10^18) of compute. * '''VRAM (Video RAM)''' β The on-chip memory of a GPU; a critical constraint for AI β a model must fit its active tensors in VRAM. Modern AI GPUs have 24GBβ192GB. * '''Memory bandwidth''' β The rate at which data can be transferred between VRAM and the compute cores; often the primary bottleneck in inference. * '''Tensor Core''' β Specialized compute units within NVIDIA GPUs that perform matrix multiplications in mixed precision at much higher throughput than standard CUDA cores. * '''HBM (High Bandwidth Memory)''' β Stacked memory used in high-end AI chips (A100, H100); provides much higher bandwidth than GDDR6. * '''NVLink''' β NVIDIA's high-speed interconnect for multi-GPU systems; provides GPU-to-GPU bandwidth far exceeding PCIe. * '''Data parallelism''' β Splitting a training batch across multiple GPUs; each GPU computes gradients on its shard and they are averaged. * '''Model parallelism''' β Splitting a model across multiple GPUs when a single GPU doesn't have enough memory. * '''Mixed precision training''' β Using FP16 or BF16 for computation while maintaining FP32 for gradient accumulation; reduces memory and increases throughput. * '''Quantization''' β Reducing numerical precision of model weights (INT8, INT4) for more efficient inference. * '''TFLOPs''' β Tera FLOPS; a common unit for AI chip performance. H100 delivers ~2,000 TFLOPS (FP16). * '''MFU (Model FLOPs Utilization)''' β The fraction of a chip's theoretical peak FLOPS actually achieved during training; a measure of training efficiency. </div> <div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
Summary:
Please note that all contributions to BloomWiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
BloomWiki:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information