Meet Cerebras WSE, The World’s Largest Chip At More Than 56 Times The Size Of An NVIDIA V100


While our friendly chip giants are bickering over performance increases in the double digits, a startup called Cerebras Systems has gone ahead and shown off a prototype that offers an absolutely unbelievable transistor count increase of 5600 % over the current best available chip: the NVIDIA V100. Bumping transistor count from 21.1 Billion to 2.1 Trillion, the startup has managed to solve key technical challenges that no one else has been able to do and hence make the world's first wafer-scale processor.

Cerebras Systems' Wafer Scale Engine (WSE): The World's First Trillion Transistor Count Chip

The Cerebras Wafer Scale Engine is the world's first wafer-scale processor. You might be wondering why no one else has done something so obvious and the reason is that the key technical challenge of cross-scribe line communication was never overcome by anyone else. See, current lithographic equipment is designed to etch tiny processors on a wafer; they cannot make a whole processor across a wafer. This means that scribe lines will exist one way or the other and the individual blocks must be able to communicate across these lines somehow and this is what Cerebras has solved to be able to claim the throne of the first trillion transistor count processor.

The Cerebras WSE takes over an area of 46,225 mm² and houses 1.2 trillion transistors. All the cores are optimized for AI workloads and the chip consumes a whopping 15 KW of power. Since all that power needs to be cooled as well, this cooling system would require to be just as revolutionary as its power system. Based on their comments on vertical cooling, I am thinking a submersion cooling system with fast-moving freon would probably the only thing that can tame this beast. The power system would also need to be incredibly robust. According to Cerebras, the chip is around 1000 faster than traditional systems simply because communication can happen across the scribe lines instead of jumping through hoops (interconnect, DIMM, etc).

The WSE contains 400,000 Sparse Linear Algebra (SLA) cores. Each core is flexible, programmable, and optimized for the computations that underpin most neural networks. Programmability ensures the cores can run all algorithms in the constantly changing machine learning field. The 400,000 cores on the WSE are connected via the Swarm communication fabric in a 2D mesh with 100 Pb/s of bandwidth. Swarm is a massive on-chip communication fabric that delivers breakthrough bandwidth and low latency at a fraction of the power draw of traditional techniques used to cluster graphics processing units. It is fully configurable; software configures all the cores on the WSE to support the precise communication required for training the user-specified model. For each neural network, Swarm provides a unique and optimized communication path.

The WSE has 18 GB of on-chip memory, all accessible within a single clock cycle, and provides 9 PB/s memory bandwidth. This is 3000x more capacity and 10,000x greater bandwidth than the leading competitor. More cores, more local memory enables fast, flexible computation, at lower latency and with less energy.

This would allow a massive speedup in AI applications and would reduce training times from months to just a couple of hours. This is truly revolutionary, there is no doubt about it, assuming they can deliver on their promise and start delivering this to customers soon. The Cerebras WSE is being manufactured on a TSMC 300mm wafer using their 16nm process which means this is cutting edge technology and just one node behind giants like NVIDIA. Of course, with 84-interconnected blocks that house over 400,000 cores, the process it's manufactured on simply does not matter.

Yield and binning of the Cerebras WSE are going to be very interesting. For one, if you are using the entire wafer as a die, you are either going to get 100% yield if the design can absorb defects or 0% if it cannot. Clearly, since the prototypes were made, the design is capable of absorbing defects. In fact, the CEO stated that the design expects around 1% to 1.5% defects of the functional surface area and the microarchitecture simply reconfigures for the available cores. Furthermore, redundant cores are placed throughout the chip to minimize any performance loss. There is no information on binning right now but it goes without saying that this is the world's most binnable design.

We are also told that the company had to design its own manufacturing and packaging science considering no tools are currently designed to handle a wafer-scale processor. Not only that, the software had to be rewritten to handle over 1 Trillion transistors in a single processor. Cerebras Systems is clearly a company that has incredible potential and seeing the splash they caused at Hot Chips we cannot wait to see some testing results from these Wafer Scale Engines.