Graphcore’s Colossus GC200 7nm Chip Competes Against The NVIDIA A100 GPU With Colossal Design & 250 TFLOPs AI Performance – 59.4 Billion Transistors In An 823mm2 Die
The AI segment is seeing rapid progress with major tech companies pooling in lots of resources to keep up with the demand for higher performance each year. We've seen NVIDIA and AMD actively building next-generation GPUs specifically with AI and HPC in mind but it looks like competition has arrived from British AI chip designer, Graphcore, who has unveiled its second-generation chip for AI which directly competes against NVIDIA's A100 Tensor Core GPU accelerator.
Graphcore's GC200 Is A Massive 7nm Chip For AI Tasks Which Is Designed To Compete Against NVIDIA's A100 GPU - IPU Delivers Up To 250 Teraflops of AI Compute
For this purpose, Graphcore has announced its new Colossus MK2 GC200 IPU or an Intelligence Processing Unit which is designed exclusively to power machine intelligence. Just like its name, the chip itself features a colossus design and delivers an 8x performance bump over its predecessor, the MK1.
“We’re 100% focused on silicon processors for AI, and on building systems that can plug into existing centers. Why would we want to build CPUs or GPUs if those already work well? This is just a different toolbox.” via Graphcore's CEO, Nigel Toon
The Colossus MK2 GC200 is fabricated on TSMC's 7nm process node and features a die size of 823 mm2. For comparison, that's almost as big as the NVIDIA A100 GPU accelerator which measures at 826mm2. The chip is not only a behemoth in terms of size but also in terms of density with a total of 59.4 Billion transistors onboard compared to 54.2 Billion transistors on the NVIDIA A100 GPU. It shows a higher density on the Graphcore chip than NVIDIA's flagship chip accelerator.
To make the GC200 work, it is configured with 1472 IPU titles, each with an IPU core & In-processor memory. Each IPU core has 6 threads executing in parallel which put the total number of threads in the chip at 8832 (1472 cores / serial processor). For memory, the chip makes use of an on-die solution which offers 900 MB capacity per IPU and a memory bandwidth of 47.5 TB/s. Graphcore has gone with a smaller capacity but the higher-bandwidth solution and stated that you can theoretically get more capacity when using several racks at once and the memory pool would end up higher when compared to a rack composed of A100 GPUs.
For interconnectivity, the chip uses the IPU-Exchange fabric which provides 8 TB/s bandwidth to all IPUs. The chip is composed of 10 IPU links which a 320 GB/s chip to chip bandwidth. The MK200 also supports PCIe Gen 4 (x16) interface. As for computing output, the MK200 delivers 250 TFLOPs of peak FP16 (with Sparsity) and 62.5 TFLOPs (with Sparsity) of peak FP32 performance. The NVIDIA A100 GPU delivers a total of 312 TFLOPs of FP16 (624 TFLOPs with Sparsity) and 19.5 TFLOPs FP32 (156 TFLOPs with Sparsity).
The IPU-Machine - A 1 PetaFlop Rack With Four MK200 IPUs
In addition to the Colossus MK200 IPU, Graphcore is also unveiling its competitor to the NVIDIA DGX A100 rack codenamed the IPU-M2000. This rack is composed of four MK200 IPUs, all of which offer a combined memory pool of 450 GB. The CPU that powers the rack is an ARM Cortex-A quad-core SOC and the system comes with a 1U chassis design featuring an advanced cooling system.
From the looks of it, each IPU has an aluminum fin heatsink block attached over it which features six massive heat pipes that make direct contact with the heatsink block and leads to a massive block at the rear of the rack which is cooled off by central cooling from the rack station.
In terms of performance metrics, Graphcore has compared eight M2000 IPU-Machine racks to a single DGX-A100. The reason is the performance per dollar metric for these comparisons. The DGX A100 costs $199,000 (MSRP) while eight M2000 racks would cost $259,600, MSRP). Graphcores unveils that their solution offers 12x the FP32 compute, 3x the FP16 compute, & 10x the memory over NVIDIA's solution. Do note that the figures for DGX-A100 are derived without sparsity whereas Graphcore's own numbers are derived with sparsity included.
With sparsity, the DGX-A100 will stand at around 5.0 TFLOPs in FP16 versus 8 TFLOPs and 1.248 PFLOPs in FP32 versus 2.0 PFLOPs which still gives the M2000 an edge of 60% in performance while costing 30% higher. In addition to these performance metrics, Graphcore says that the GC200 IPU platform is super flexible in the sense that you can have up to 64,000 of these chips running all-together which will be able to deliver a massive 16 Exaflops of compute horsepower.
As far as availability is concerned, Graphcore states that customers can pre-order the IPU-Machine today with full volume shipments starting sometime in Q4 2020.
Stay in the loop
GET A DAILY DIGEST OF LATEST TECHNOLOGY NEWS
Straight to your inbox
Subscribe to our newsletter