IBM has detailed its next-generation Telum chip which is part of the Z processor lineup at HotChips 33. The Telum chip features a brand new core architecture design that's geared for AI acceleration.
IBM's Next-Gen Z Processor: 7nm Telum Chip With 22.5 Billion Transistors, 8 Cores, 5 GHz+ Clocks & 6+ TFLOPs AI Acceleration
According to IBM, the newly optimized Z core along with its brand new cache and multi-chip fabric hierarchy enables over 40% per socket performance growth. The Telum chip is comprised of a total of 8 cores that feature their dedicated L2 cache. The chip features SMT2 so which gives 16 threads on the chip while a maximum configuration of 32 core and 64 threads is possible with a 4-drawer system.
Clock speeds are said to be higher than 5 GHz while the Telum Z chip comes with a re-designed branch prediction with integrated 1st/2nd level BTB, Dynamic BTB entry reconfiguration, & more than 270K branch target table entries. The private L2 cache has a size of 32 MB and features a 19 cycle load-use latency (~3.8 ns including TLB access).
Moving over to L3 and L4 caches which are shared across the 8 cores, the IBM Z Telum chip packs virtual on-chip 256 MB L3 cache and virtual 2 GB L4 cache across up to 8 chips. The L2 cache uses a 320 GB/s dual-direction ring interconnect topology whereas the L3 cache is distributed through L2 cooperation and has an average latency of 12ns. The virtual L3 and L4 cache provide 1.5x cache per core.
Performance in AI Acceleration is rated at over 6 TFLOPs per chip & over 200 TFLOPs in a 4-drawer system that packs 4 IBM Z chips. The internal Matrix array features 128 tiles with 8-way FP-16 SIMD, high-density multiply, and accumulates FPUs while the Activation Array is composed of 32 tiles with 8-way FP16/FP-32 SIMD. A dual-chip configuration yields 116,000 inferences (1.1ms) while a 32-chip configuration yields 3,600,000 inferences (1.2ms).
IBM Z Telum chips can be scaled up for even more performance as there are both single-chip and dual-chip modular designs. The 2-chip configuration features a chiplet design with 2 Telum chips and offers 16 cores, 32 threads, and 512 MB of cache.
The AI accelerator on the IBM Z Telum chip provides:
- Very low and consistent inference latency
- Compute capacity for utilization at scale
- Variety of AI models ranging from traditional ML to RNNs and CNNs
- Security - provide enterprise-grade memory virtualization and protection
- Extensibility with future firmware and hardware updates
The IBM Z Telum Chip is going to be fabricated on the 7nm Samsung process node and will feature a die size of 530mm2. The chip will house 22.5 Billion transistors and will be aimed at enterprise & embedded workloads.