NVIDIA plans to dominate the inference stack with next-gen Feynman chips, as the firm could integrate LPU units within the architecture.
NVIDIA Could Use Hybrid Bonding With SRAM Dies For Inference, But There Are Several Implications
Team Green's IP licensing agreement for Groq's LPU units might sound like a moderate development when you look at the scope of the acquisition and the revenue figures involved, but in reality, NVIDIA intends to take a lead in the inference segment through LPUs, and we have already discussed this in an extensive coverage here, and as the industry shifts metrics to cost-per-million-tokens. In terms of how NVIDIA plans to integrate LPUs, various propositions have surfaced; however, based on what the GPU expert AGF believes, it appears that LPU units might be stacked on next-gen Feynman GPUs through TSMC's hybrid bonding technology.
The expert believes that the implementation could resemble what AMD has done with X3D CPUs, utilizing TSMC's SoIC hybrid bonding technology to integrate 3D V-Cache tiles onto the main compute die. AGF argues that integrating SRAM as a monolithic die may not be the right move for Feynman GPUs, considering that SRAM scaling is limited, and building it on advanced nodes would result in wasting high-end silicon and dramatically increasing the usage cost per wafer area. Instead, AGF believes that NVIDIA will stack LPU units onto the Feynman compute die.
Now, the approach sounds sensible, considering that with this, chips like the A16 (1.6nm) will be used for the main Feynman die, which contains the compute blocks (tensor units, control logic, etc.), while separate LPU dies will contain large SRAM banks. Additionally, to connect these dies together, TSMC's hybrid bonding technology will prove crucial, as it will enable a wide interface and lower energy per bit compared to off-package memory. To top it off, since A16 features backside power delivery, the front side would be freed for vertical SRAM connections, ensuring a low-latency decode response.
However, with this technique, there are concerns regarding how NVIDIA will manage thermal limits, as stacking dies on a process that operates at high compute density is already a challenge. And, with LPUs that focus on sustained throughput, it could create bottlenecks. More importantly, execution-level implications will also grow tremendously with such an approach, as LPUs concentrate on a fixed execution order, which, of course, creates a conflict between determinism and flexibility.
Even if NVIDIA manages to resolve hardware-level constraints, the primary concern is caused by how CUDA behaves within LPU-style execution, as it requires explicit memory placement, whereas CUDA kernels are designed for hardware abstraction. Integrating SRAM within AI architectures won't be an easy task for Team Green, as it would require an engineering marvel to ensure LPU-GPU environments are well-optimized. However, this might be the cost NVIDIA is willing to pay if it wants to lead the inference segment.
Follow Wccftech on Google to get more of our news coverage in your feeds.
