NVIDIA Working To Embed Multiple GPUs and Multiple Layers of Stacked DRAM on A Single MCM Package – Up To 256 SMs and Multiple Terabytes of Bandwidth
NVIDIA recently announced their Volta GV100 GPU, the biggest graphics chip ever to be created in history. The company reached practical limits of the latest process node when designing the mono lithic GPU which is aimed at the compute intensive market.
NVIDIA Plans To Cram Several GPUs and Several Stacked DRAM Dies on a Single Package In Future
NVIDIA currently has the two fastest GPU accelerators for the compute market, the last years Tesla P100 that is based on Pascal and this years Tesla V100 that is based on Volta. There's one thing in common about both chips, they are as big as a chip can get on their particular process node. The Pascal GP100 GPU measured at a die size of 610mm2 while the Volta V100 GPU, even being based on a 12nm process from TSMC is 33.1% larger at 815mm2. NVIDIA's CEO Jen-Hsun Huang revealed at GTC that this is the practical limits of what's possible with today's physics and they cannot make a chip as dense or as big as GV100 today.
NVIDIA's CEO - Jen-Hsun Huang: GTC 2017 - "The part that is really shocking is this is rectile limits. Rectile limits basically means that is at the limit of photo lithography meaning you can't make a chip any bigger."
NVIDIA is one of the biggest power players in the GPU industry and they have a very tight grip over at at AI and Deep Learning market since some recent years. The launch of Volta just a year after Pascal confirms that they have a huge demand for their GPU based accelerators by corporations who hunger for compute accelerating products. But Jen-Hsun's quote from GTC 2017 may give us a hint at where they are headed after Volta. It is possible that NVIDIA is seeing the limit of process nodes as a bottleneck and while they want to offer even more performance in a short span of time, the process can only allow them to a certain limit.
This year, the Volta die size increased vastly over Pascal. There are a few reasons for that, unlike AMD, NVIDIA focuses on implementing different cores for specialized tasks. Volta and Pascal have dedicated FP32 and FP64 cores. Volta even goes ahead to house dedicated Tensor cores for INT8 operations to accelerate neural networking and deep learning performance. The more cores you add, the larger the die size gets and these can only be added to a limit. So what's the solution? MCM.
NVIDIA's MCM GPU Package Design Featured in Research Publication - A Sight at NVIDIA's Next GPU Accelerator For HPC?
So just recently, a research publication (via The Tech Report) has been posted by NVIDIA which talks about building a MCM package or otherwise known as Multi-Chip-Module package. What MCM basically is that it features several chips (GPU/CPU/Memory/Controllers) on the same chip interposer that are interconnected via fast I/O lanes.
Some examples of MCM packages are the Volta V100, Pascal P100 GPUs from NVIDIA, AMD Fiji and Vega GPUs, and even the new server aimed EPYC processors from AMD. The NVIDIA, AMD GPUs may support only one GPU but they feature multiple DRAM dies on the same package making it a MCM design. The EPYC processors house four individual dies that consist of 8 cores per die and interconnected via their Infinity Fabric link. AMD is also working on a similar approach with their Navi GPUs.
There are multiple options when designing a MCM package. NVIDIA has proposed that their MCM solution may depart the traditional design of featuring a monolithic GPU and few DRAM dies on the package in favor of a MCM design that uses multiple smaller iterations of their GPU chips with significantly more amount of DRAM dies. The GPU and DRAM dies will be connected to a I/O and controller chipset on-die rather than on-chip. The solution here is the implementation of GPMs (GPU Modules) which will be smaller, easier to produce and less expensive chips which will be inter connected.
NVIDIA simulated the performance of a 256 SMs based MCM GPUs, 64 SMs per GPU. The top Volta chip currently features 84 SMs that consist of 5376 cores. The proposed 256 SM MCM package will house 16,384 cores.
Over the base line MCM-GPU, such chip resulted in a performance increases of 22.8%. When compared to the largest theoretically possible monolithic GPU which is assumed to house 128 SMs on a single GPU, the 256 SM chip performed 4.5% better and comes within 10% of a unbuildable yet similarly sized monolithic GPU (256 SMs). There's also architectural upgrades and better interconnects to be taken in consideration. NVIDIA points out that each GPM is expected to be 40-60% smaller than today's biggest GPU assuming it's designed on a new 10nm or 7nm process node. A very basic GPU MCM diagram is illustrated below:
There's still a long way to go before we see more details on an MCM GPU from NVIDIA but I think after Volta, NVIDIA has their R&D team focused on such projects and we will hear more about the MCM designs featuring several GPMs at next year's GTC (Graphics Technology Conference) or when NVIDIA is expected to reveal their new graphics roadmap.