Editing
AI Chips and Hardware Accelerators
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== <span style="color: #FFFFFF;">Understanding</span> == Traditional CPUs are designed for sequential, low-latency computation with sophisticated branch prediction, out-of-order execution, and large caches β ideal for general-purpose code with complex control flow. Deep learning has the opposite profile: it is embarrassingly parallel (millions of independent multiply-add operations) but the same simple operation, performed billions of times per second. '''The matrix multiply insight''': The core operation in a neural network layer is Y = XW, where X is the activation matrix and W is the weight matrix. For a layer with 4096 inputs and 4096 outputs processing a batch of 2048, this is a 2048Γ4096 multiplied by 4096Γ4096 matrix β roughly 34 billion multiply-add operations. A CPU does this slowly, sequentially. A GPU with 10,000+ CUDA cores does it in thousands of parallel streams. '''Why GPUs became the AI chip''': In 2012, Krizhevsky, Sutskever, and Hinton trained AlexNet on NVIDIA GTX 580 GPUs, achieving a breakthrough on ImageNet. This demonstrated that GPU training was not just feasible but transformative β a trend that has only accelerated. '''The memory wall''': Modern AI chips can compute faster than they can feed data from memory. The H100 can do 2 PFLOPS of FP16 compute but its memory bandwidth is "only" 3.35 TB/s. For large transformer inference, most time is spent waiting for weights to be streamed from memory, not computing. This is called '''memory bandwidth bottleneck''' and drives design decisions in inference chips (large on-chip SRAM, HBM stacking). '''The interconnect problem''': A single H100 has 80GB HBM. GPT-3 (175B parameters) requires ~350GB in FP16. Training requires 8β16 GPUs minimum. Connecting them with high-bandwidth NVLink (900 GB/s) vs. standard PCIe (64 GB/s) changes training throughput dramatically. Multi-node training requires fast InfiniBand networking between GPU servers. </div> <div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
Summary:
Please note that all contributions to BloomWiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
BloomWiki:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information