AMD Zen CPU Architecture Doubles Down On IPC And Floating Point Throughput – Details Surface In Linux Kernel Patch

Khalid Moammer

MD has just uploaded a patch to the patchwork project detailing many aspects of its hotly anticipated Zen CPU microarchitecture.
The patch was uploaded by and is titled " [x86_64] znver1 enablement" re-affirming that there will be indeed multiple generations of AMD's brand new CPU core and this particular patch only covers the first iteration of the core that's coming out next year.

AMD Zen Feature
Before we dive into the juicy details that were revealed in this patch about AMD's Zen microarchitecture I would like to invite everyone to go check out some of the more exciting information that AMD has revealed about the new core at its Financial Analyst Day event back in May, including the performance of the core and when it's coming out.
Moving on, I'd like to highlight that the new details which have been revealed about Zen in this patch and we're covering in this particular piece are very much legitimate and that significant microarchitectural detail about AMD's Excavator core surfaced two years prior to its launch via a patch very much like the one we're covering today.

What We Know So Far About AMD's Zen

We first broke the news about AMD's next generation high performance core back in September of last year. At which point AMD's then CEO Rory Read revealed the code name for the company's upcoming high performance x86 CPU microarchitecture. Prior to that revelation we only had knowledge of Zen's sister ARMv8 core code named K12.

However recently AMD has been slightly more generous in providing detail about its brand new inception. Three months ago AMD announced that it was preparing an entirely new line-up of FX CPUs and a brand new platform 'AM4". We learned that the new family of FX processors code named "Summit Ridge" would feature an entirely new socket and an updated feature set including DDR4 memory support. And more importantly we learned that the new platform would feature mainstream CPUs with "high core counts" - rumored to be up to eight cores - and "SMT" support.

Two months later we learned that AMD was also working on a monstrous High Performance Computing APU with 16 Zen cores and a huge integrated GPU in addition to stacked High Bandwidth Memory. We also learned that AMD is planning to introduce high performance server CPUs with up to 32 Zen CPU cores. Hearing about all of those different SKUs is jolly exciting but it was also quite frustrating as we had no idea what to expect from Zen. That is until AMD revealed a whopper at its Financial Analyst Day earlier this year, which is that Zen will have a 40% instruction per clock improvement over its predecessor "Excavator".

The Patch Allows Us To Get A Glimpse Into The Inner-Workings Of AMD's Next Generation High Performance x86 CPU Core "Zen"

Today, with the information that we've learned from the patch, we can get a better idea of how Zen looks like from a high-level design standpoint.
So let's dive straight into the new details that made their into the patch, but first I'd like to give a shout-out to  Matthias Waldhauer AKA "Dresdenboy" who spotted the patch and reported on it in his blog.

Below is a quote of the most relevant code sections of the patch, the ones that we're certainly most interested in.

+;; Integer unit 4 ALU pipes.

+(define_cpu_unit "znver1-ieu0" "znver1_ieu")

+(define_cpu_unit "znver1-ieu1" "znver1_ieu")

+(define_cpu_unit "znver1-ieu2" "znver1_ieu")

+(define_cpu_unit "znver1-ieu3" "znver1_ieu")

+(define_reservation "znver1-ieu" "znver1-ieu0|znver1-ieu1|znver1-ieu2|znver1-ieu3")


+;; 2 AGU pipes.

+(define_cpu_unit "znver1-agu0" "znver1_agu")

+(define_cpu_unit "znver1-agu1" "znver1_agu")

+(define_reservation "znver1-agu-reserve" "znver1-agu0|znver1-agu1")
 Floating point unit 4 FP pipes.

+(define_cpu_unit "znver1-fp0" "znver1_fp")

+(define_cpu_unit "znver1-fp1" "znver1_fp")

+(define_cpu_unit "znver1-fp2" "znver1_fp")

+(define_cpu_unit "znver1-fp3" "znver1_fp")


+(define_reservation "znver1-fpu" "znver1-fp0|znver1-fp1|znver1-fp2|znver1-fp3")

This gives us a beautiful insight into what a Zen core looks like from a high-level design standpoint. Each core has four ALU pipes , two AGU pipes and four FP pipes. ALU is short for Arithmetic Logic Unit, AGU is short for Address Generation Unit and FP is short for Floating Point.

The four ALU pipes in this context represent the core's integer pipeline and the four FP pipes represent the floating point pipeline inside the core's Floating Point Unit. The AGU's work in tandem with the integer front-end to facilitate communication between the ALUs and a II-read, I-write L1 cache according to an AMD engineer's linkedin profile that Mr. Waldhauer has spotted.

While all of this sounds mighty exciting it can get really confusing rather quickly. In turn, the best way to comprehend the high-level design of the core is to visualize it and so that's exactly what we did.

High Level Design Of AMD's Next Big Core 'Zen'

If we create a diagram of the core's high-level design based on the Integer and Floating Point pipes mentioned in the patch then we get something that looks like this : Rendition Of Zen’s High-Level Design According To The Linux Patch

For a better perspective we put Zen side by side with AMD's steamroller. The company has not published a block diagram for Excavator unfortunately. However according to what AMD revealed at this past Hot Chips synmposium, Excavator should have a very similar high-level layout to Steamroller. Just a quick note to refresh everyone's memory, Steamroller is the CPU core that AMD has introduced with its "Kaveri" 7000 series APUs. Excavator is Steamroller's successor and is the CPU core powering AMD's "Carrizo" 8000 series mobile APUs.

AMD Zen Steamroller Block Diagram
The first thing that is easily discernible  is that there is only one integer cluster in a Zen core rather than two like there is in a Steamroller module. These two integer clusters in Steamroller are what form the two separate CPU cores / threads in each module. Zen takes on a more traditional AMD CPU layout resembling that of Phenom and Athlon K8/K10 series cores. With a single Integer cluster and one equally large floating point unit.

This is an important distinction because in contrast, the Bulldozer family of cores achieved very high integer throughput but also traded off floating point performance. That's because each pair of cores shared one floating point unit. Although the floating point unit itself was larger and more capable than the one found in AMD's previous K10 CPU core of the Phenom II line of chips. Floating point performance was still lacking compared to integer, merely because the design was heavily weighted towards integer heavy server workloads.

Obviously because Zen forgoes the CMT design of the Bulldozer family it should end up with a single fetch and a single decode unit in the front end, as opposed to the double decoders that were introduced with Steamroller. On the floating point side of things, Zen's floating point unit is significantly more complex than that of Steamroller and Excavator. It became quite clear very quickly as one looks through the patch code that there's a lot of opportunistic sharing of resources between the four floating point pipes. This opportunistic sharing of the idle resources in each pipe should play an instrumental role in providing additional throughput via simulatenous multi-threading.

CPU MicroarchitectureAMD Phenom II / K10AMD BD/PDAMD SR/XVAMD ZenIntel Skylake
Instruction Decode Width3-wide4-wide8-wide4-wide4-wide
Single Core Peak Decode Rate3 instructions4 instructions8 instructions4 instructions4 instructions
Dual Core Peak Decode Rate6 instructions4 instructions8 instructions8 instructions8 instructions

Interestingly, the two 128-bit FMAC units in the Bulldozer family can process one 128-bit SIMD instruction per cycle each, or they can fuse together to process a single 256-bit AVX instruction per cycle.

The patch code indicates that this capability to fuse and process larger SIMD instructions has been carried over to Zen. As the FP3 pipe can converge with the FP0 and/or the FP1 pipes to process AVX256 instructions. However this isn't enough tomake the core compatible with the AVX512 instruction set extension. Which is currently only supported by Intel's Knight's Landing Xeon Phi microarchitecture.

The wider floating point unit also means that Zen will be able to process less complex instructions at a much faster rate than Steamroller. Which would translate to a boost in floating point performance, an area where AMD had historically excelled in with Phenom II and other microarchitectures prior to bulldozer.
I should mention that all the instruction set extensions that Zen supports have been published in a previous Linux enablement patch.

There was also one particularly important improvement with Zen that Mr. Waldhauer has managed to spot in a number of patents filed by AMD CPU engineers working on Zen.

A lot of the new functionality has been filed for patenting. For example there was a mention of checkpointing, which is good for quick reversion of mispredicted branches and other reasons for restarting the pipelines. Some patents suggest, that Zen might use some slightly modified Excavator branch prediction.

The branch misprediction penalty on the Bulldozer family of cores was a particularly significant one due to the deep piped nature of the microarchitecture. Intel's Sandy Bridge, which was introduced to the market at the same time as Bulldozer, had an equally deep pipeline. However with Sandy Bridge Intel introduced a micro-op cache which significantly contributed to reducing the performance penalty of mispredicting a branch. Zen should be AMD's first CPU core to see the introduction of a technology focused solely on reducing branch misprediction penalties. And while it may not be too similar to the solution on Sandy Bridge it will serve the same purpose.

In summary, compared to the Bulldozer family of cores, Zen has considerably more floating point throughput as well as a better way of handling mispredicted branches, coupled with a more streamlined front-end as well as a faster and more efficient cache sub-system. All of these combined have undoubtedly contributed to the massive 40% IPC improvement that AMD announced back in May.

Zen IPC Gain
Zen will also be manufactured using a significantly faster, more power efficient manufacturing process with twice the transistor density of the current 28SHP process used for Steamroller and Excavator based APUs. The process also enables much better scalability from high performance enthusiast FX CPUs to low power APUs.

Fianlly, we should see the new core debut with a new set of enthusiast FX processors scheduled to come out in 2016 on the AM4 socket. With Zen based server chips and mainstream APUs set to follow in 2017.


Deal of the Day