Birentech Details China’s Most Powerful GPU, The Biren BR100: 1074mm2 on 7nm, 77 Billion Transistors, Up To 2.8x Faster Than NVIDIA Ampere at 550W

China's Fastest General-Purpose MCM GPU, The Birentech Biren BR100, Architecture Detailed 2

Earlier this month, we reported that Birentech, a company hailing from China, was working on its fastest GPU to date, the Biren BR100. Based on what the company has publicly revealed, the Biren BR100 aims to be a General-Purpose GPU that would offer faster performance than NVIDIA's A100 GPUs in AI processing. Now at Hot Chips 34, the company is presenting us with more details on the specs and architecture within its Biren GPGPU lineup.

China's Fastest General-Purpose MCM GPU, The Birentech Biren BR100, Architecture Detailed

The Birentech BR100 is the flagship General-Purpose GPU that China has to offer, featuring an in-house GPU architecture that utilizes a 7nm process node and houses 77 Billion transistors within its die. The GPU has been fabricated on TSMC's 2.5D CoWoS design and also comes packed with 300 MB of on-chip cache, 64 GB of HBM2e with a memory bandwidth of 2.3 TB/s, and support for PCIe Gen 5.0 (CXL interconnect protocol). The whole chip measures 1074mm2 which is beyond the reticle limit of the process node.

Related StoryHassan Mujtaba
China Creates Its Most Powerful General-Purpose GPU: Meet Biren BR100 With 77 Billion Transistors on 7nm, Faster Than NVIDIA Ampere In AI Horsepower

Some of the fundamentals that went into designing the BR100 GPU included:

  • To break the reticle size limit and integrate more transistors on a chip
  • One tape out to empower multiple SKUs
  • Smaller die for better yield, hence lower cost
  • 896 GB/s high-speed die-to-die interconnect
  • 30% more performance, and 20% better yield compared with a monolithic design

Talking about the architecture itself, the Biren BR100 is made up of two chiplets, each housing 16 SPC or Streaming Processing Clusters. Each SPC has 16 EUs and four of these EUs form an internal Compute Unit or CU that is attached to 64 KB of L1 cache (LSC) while the SPC features a shared 8 MB L2 cache across all Execution Units. So that's a total of 32 SPCs with 512 Execution Units, 256 MB of L2 cache, and 8 MB of L1 cache.

A deeper look at the Execution Unit reveals 16 streaming processing cores (V-Core) and a single Tensor Engine (T-Core). There's 40 KB of TLR (Thread Local Register), 4 SFUs, and a TDA (Tensor Data Accelerator). Interestingly, each CU can contain 4, 8, and up to 16 EUs. The V-Core itself is a general-purpose SIMT processor which features 16-cores that supports FP32, FP16, INT32 & INT16 along with SFU, Load/Store, and Data Processing, while handling deep learning operations such as Batch Norm, ReLu, etc. It also features an enhanced SIMT Model that can run up to 128K threads on 32 SPCs in a super-scalar mode (static and dynamic). For the T-Cores, the tensor design is used to accelerate AI operations such as MMA, Convolution, etc.


Birentech disclosed various performance metrics of the chip. It offers up to 2048 TOPs (INT8), 1024 TFLOPs (BF16), 512 TFLOPs (TF32+), and 256 TFLOPs (FP32), and based on the performance figures, it looks like this chip is going to be faster than the NVIDIA Ampere A100, at least on paper. The GPU has been compared against the NVIDIA Ampere A100 in various HPC workloads and it looks like it would offer up to a 2.6x average speedup and up to a 2.8x speedup over its main competitor.

Related StoryHassan Mujtaba
Birentech Details China’s Most Powerful GPU, The Biren BR100: 1074mm2 on 7nm, 77 Billion Transistors, Up To 2.8x Faster Than NVIDIA Ampere at 550W

The Hopper H100 GPU offers nearly 2x or 2.5x the performance in the same GPU performance metrics. The chip also supports 64-channel encoding and 512-channel encoding. As for the interconnects, the chip comes with an 8 BLink solution which offers 2.3 TB/s of external I/O bandwidth.

What's interesting is that the BR100 isn't that far behind in terms of overall transistor count compared to the NVIDIA H100. The H100 features 80 Billion transistors on the new N4 process node whereas the BR100 is only 3 Billion transistors behind the 7nm process node. This would lead to a much bigger die size.

Birentech Biren BR100
Process 7nm
System interface, bandwidth, interconnection protocol PCIe5.0 X16, 128GB/s, support CXL
FP32 TFLOPS (peak) 256
TF32+ TFLOPS (peak) 512
BF16 TFLOPS (peak) 1,024
INT8 TOPS (peak) 2,048
Memory capacity, interface bit width, bandwidth 64GB HBM2E;4,096bit, 1.64TB/s
interconnection 512GB/s BLink™, supports 8 x8 ports
Secure virtual instance Up to 8 servings
Video codec (FHD@30fps) 64-channel HEVC/H.264 encoding/512-channel HEVC/H.264 decoding
TDP 550W
Product form OAM module

The Biren BR100 isn't the only chip that the China-based company has announced. There's also the Biren BR104 which offers half the performance metrics of the BR100 but the specifications aren't told yet. The only detail available on the other chip is that, unlike the Biren BR100 which uses a chiplet design, the BR104 is a monolithic die and comes in a standard PCIe form factor with a TDP of 300W.

Birentech Biren 104
Process 7nm
System interface, bandwidth, interconnection protocol PCIe5.0 X16, 128GB/s, support CXL
FP32 TFLOPS (peak) 128
TF32+ TFLOPS (peak) 256
BF16 TFLOPS (peak) 512
INT8 TOPS (peak) 1,024
Memory capacity, interface bit width, bandwidth 32GB HBM2E; 2,048bit, 819GB/s
interconnection 192GB/s BLink™, supports 3 x8 ports
Secure virtual instance up to 4 servings
Video codec (FHD@30fps) 32 channels of HEVC/H.264 encoding, 256 channels of HEVC/H.264 decoding
TDP 300W
Product form Full-height full-length, dual-slot PCIe card

The company states that a chip with 77 Billion transistors can mimic the human brain nerve cells and the chip itself will be used for DNN and AI purposes so it is more or less going to replace China's dependence on NVIDIA's AI GPUs.

WccfTech Tv
Filter videos by