Hardware Hot Chips

Birentech Details China’s Most Powerful GPU, The Biren BR100: 1074mm2 on 7nm, 77 Billion Transistors, Up To 2.8x Faster Than NVIDIA Ampere at 550W

Hassan Mujtaba • Aug 22, 2022 at 10:44am EDT

Earlier this month, we reported that Birentech, a company hailing from China, was working on its fastest GPU to date, the Biren BR100. Based on what the company has publicly revealed, the Biren BR100 aims to be a General-Purpose GPU that would offer faster performance than NVIDIA's A100 GPUs in AI processing. Now at Hot Chips 34, the company is presenting us with more details on the specs and architecture within its Biren GPGPU lineup.

China's Fastest General-Purpose MCM GPU, The Birentech Biren BR100, Architecture Detailed

The Birentech BR100 is the flagship General-Purpose GPU that China has to offer, featuring an in-house GPU architecture that utilizes a 7nm process node and houses 77 Billion transistors within its die. The GPU has been fabricated on TSMC's 2.5D CoWoS design and also comes packed with 300 MB of on-chip cache, 64 GB of HBM2e with a memory bandwidth of 2.3 TB/s, and support for PCIe Gen 5.0 (CXL interconnect protocol). The whole chip measures 1074mm2 which is beyond the reticle limit of the process node.

Some of the fundamentals that went into designing the BR100 GPU included:

To break the reticle size limit and integrate more transistors on a chip
One tape out to empower multiple SKUs
Smaller die for better yield, hence lower cost
896 GB/s high-speed die-to-die interconnect
30% more performance, and 20% better yield compared with a monolithic design

birentech-biren-br100-chinas-fastest-general-purpose-gpu-hot-chips-34_4

birentech-biren-br100-chinas-fastest-general-purpose-gpu-hot-chips-34_3

Talking about the architecture itself, the Biren BR100 is made up of two chiplets, each housing 16 SPC or Streaming Processing Clusters. Each SPC has 16 EUs and four of these EUs form an internal Compute Unit or CU that is attached to 64 KB of L1 cache (LSC) while the SPC features a shared 8 MB L2 cache across all Execution Units. So that's a total of 32 SPCs with 512 Execution Units, 256 MB of L2 cache, and 8 MB of L1 cache.

A deeper look at the Execution Unit reveals 16 streaming processing cores (V-Core) and a single Tensor Engine (T-Core). There's 40 KB of TLR (Thread Local Register), 4 SFUs, and a TDA (Tensor Data Accelerator). Interestingly, each CU can contain 4, 8, and up to 16 EUs. The V-Core itself is a general-purpose SIMT processor which features 16-cores that supports FP32, FP16, INT32 & INT16 along with SFU, Load/Store, and Data Processing, while handling deep learning operations such as Batch Norm, ReLu, etc. It also features an enhanced SIMT Model that can run up to 128K threads on 32 SPCs in a super-scalar mode (static and dynamic). For the T-Cores, the tensor design is used to accelerate AI operations such as MMA, Convolution, etc.

birentech-biren-br100-chinas-fastest-general-purpose-gpu-hot-chips-34_8

birentech-biren-br100-chinas-fastest-general-purpose-gpu-hot-chips-34_9

Birentech disclosed various performance metrics of the chip. It offers up to 2048 TOPs (INT8), 1024 TFLOPs (BF16), 512 TFLOPs (TF32+), and 256 TFLOPs (FP32), and based on the performance figures, it looks like this chip is going to be faster than the NVIDIA Ampere A100, at least on paper. The GPU has been compared against the NVIDIA Ampere A100 in various HPC workloads and it looks like it would offer up to a 2.6x average speedup and up to a 2.8x speedup over its main competitor.

The Hopper H100 GPU offers nearly 2x or 2.5x the performance in the same GPU performance metrics. The chip also supports 64-channel encoding and 512-channel encoding. As for the interconnects, the chip comes with an 8 BLink solution which offers 2.3 TB/s of external I/O bandwidth.

What's interesting is that the BR100 isn't that far behind in terms of overall transistor count compared to the NVIDIA H100. The H100 features 80 Billion transistors on the new N4 process node whereas the BR100 is only 3 Billion transistors behind the 7nm process node. This would lead to a much bigger die size.

birentech-biren-br100-gpu-low_res-scale-4_00x

birentech-biren-br100-low_res-scale-4_00x

Birentech Biren BR100
Process	7nm
System interface, bandwidth, interconnection protocol	PCIe5.0 X16, 128GB/s, support CXL
FP32 TFLOPS (peak)	256
TF32+ TFLOPS (peak)	512
BF16 TFLOPS (peak)	1,024
INT8 TOPS (peak)	2,048
Memory capacity, interface bit width, bandwidth	64GB HBM2E；4,096bit, 1.64TB/s
interconnection	512GB/s BLink™, supports 8 x8 ports
Secure virtual instance	Up to 8 servings
Video codec (FHD@30fps)	64-channel HEVC/H.264 encoding/512-channel HEVC/H.264 decoding
TDP	550W
Product form	OAM module

The Biren BR100 isn't the only chip that the China-based company has announced. There's also the Biren BR104 which offers half the performance metrics of the BR100 but the specifications aren't told yet. The only detail available on the other chip is that, unlike the Biren BR100 which uses a chiplet design, the BR104 is a monolithic die and comes in a standard PCIe form factor with a TDP of 300W.

Birentech Biren 104
Process	7nm
System interface, bandwidth, interconnection protocol	PCIe5.0 X16, 128GB/s, support CXL
FP32 TFLOPS (peak)	128
TF32+ TFLOPS (peak)	256
BF16 TFLOPS (peak)	512
INT8 TOPS (peak)	1,024
Memory capacity, interface bit width, bandwidth	32GB HBM2E; 2,048bit, 819GB/s
interconnection	192GB/s BLink™, supports 3 x8 ports
Secure virtual instance	up to 4 servings
Video codec (FHD@30fps)	32 channels of HEVC/H.264 encoding, 256 channels of HEVC/H.264 decoding
TDP	300W
Product form	Full-height full-length, dual-slot PCIe card

birentech-biren-br100-chinas-fastest-general-purpose-gpu-hot-chips-34_7

birentech-biren-br100-chinas-fastest-general-purpose-gpu-hot-chips-34_6

The company states that a chip with 77 Billion transistors can mimic the human brain nerve cells and the chip itself will be used for DNN and AI purposes so it is more or less going to replace China's dependence on NVIDIA's AI GPUs.

About the author: A Software Engineer by training and a PC enthusiast by passion, Hassan Mujtaba serves as Wccftech's Senior Editor for hardware section. With years of experience in the industry, he specializes in deep-dive technical analysis of next-generation CPU and GPU architectures, motherboards, and cooling solutions. His work involves not only breaking news on upcoming technologies but also extensive hands-on reviews and benchmarking.

Follow Wccftech on Google to get more of our news coverage in your feeds.

Read all comments on Birentech Details China’s Most Powerful GPU, The Biren BR100: 1074mm2 on 7nm, 77 Billion Transistors, Up To 2.8x Faster Than NVIDIA Ampere at 550W

Birentech Details China’s Most Powerful GPU, The Biren BR100: 1074mm2 on 7nm, 77 Billion Transistors, Up To 2.8x Faster Than NVIDIA Ampere at 550W

China's Fastest General-Purpose MCM GPU, The Birentech Biren BR100, Architecture Detailed

Trending Stories

Kirin 9030 In-Depth Analysis Proves SMIC Can Create Denser SoCs Than Intel Has With Its 18A Node, But The Attributes That Require Improvements Are Left Out

Xbox Studio Leaders Reportedly Detest Game Pass, Arguing it Destroyed the Value of Their $40+ Games Now Available for Pennies

A Modder Fits Entire Grand Theft Auto PS2 Trilogy Inside a Single Game, While Rockstar Continues to Prepare GTA 6

Nintendo Doubles Down on Switch 2 Security, But Developer Gezine Cracks a Universal Exploit That Works Entirely Offline

AMD Unveils Helios, Its Next-Gen AI Powerhouse With MI455X & 6th Gen EPYC, Challenging NVIDIA’s Rack-Scale Dominance

Popular Discussions

AMD Medusa Point 10-Core “Zen 6” CPU Beats Strix Point 10-Core “Zen 5” By Nearly 35% While Operating at 5.4 GHz

AMD Ryzen 7 7700X3D 4.5 GHz “3D V-Cache” CPU Review: The Budget X3D Champ For AM5

NVIDIA GeForce RTX 50 SUPER GPUs Have Reportedly Arrived At AIBs, But Are On Hold Due To Undecided Memory Prices

AMD Ryzen 7 5800X3D Outsells Ryzen 7 7800X3D For The Same Price On Amazon Despite Being Weaker

AMD Ryzen 7 7800X3D CPU Drops To $299 A Day Ahead of 7700X3D’s Launch, Bringing 3D V-Cache Goodness To Mainstream Gamers

Birentech Details China’s Most Powerful GPU, The Biren BR100: 1074mm2 on 7nm, 77 Billion Transistors, Up To 2.8x Faster Than NVIDIA Ampere at 550W

China's Fastest General-Purpose MCM GPU, The Birentech Biren BR100, Architecture Detailed

Related Story China’s Biren GPUs Can’t Be Made at TSMC Due To US Regulations & Competition With NVIDIA

Further Reading

Trending Stories

Popular Discussions