⋮    ⋮  

AMD Details Carrizo APUs Energy Efficient Design at Hot Chips 2015 – 28nm Bulk High Density Design With 3.1 Billion Transistors, 250mm2 Die


AMD announced their 6th generation Carrizo APU platform three months ago at Computex 2015. During the launch, we did a very brief technical analysis of the new Carrizo APU design but AMD has offered even more information regarding Carrizo at their Hot Chips 2015 presentation which has a energy efficient design on the Carrizo APU, built on a 28nm process.

AMD's 6th Generation Carrizo APUs Officially Launched and Detailed

When talking about basic features, first of all, we should know that Carrizo is based on the 28nm process node and comes in the FP4 package. The Carrizo chips feature 4 x86 Excavator cores with 2 MB L2 cache, 3rd generation GCN GPU (integrated) that pack 8 graphics compute units or 512 stream processors and 2 RBs. The chips support DDR3 dual channel memory with speeds of up to 2133 MHz and are designed to feature full support for HSA 1.0 spec. The chips also integrate the southbridge on die and have several I/O technologies along with new software tier support that we will detail in just a bit.

The Carrizo AP features a nominal 5-15% IPC gains from the new Excavator cores which shows AMD is following Intel footsteps in this field with the blue team also offering a similar IPC improvement on their latest 14nm Broadwell Uarch while focusing on energy efficiency to make their designs more compatible with efficient PCs and low power solutions. AMD used the 28nm Bulk High Density node to build Carrizo and has managed to optimize the overall chip design by adding 29% more transistors than Kaveri making it more denser, thanks to the high-density design library. This results in a 3.1 Billion transistor die that delivers 40% lesser power consumption and 23% lesser die area than its predecessor.

The AMD Carrizo APU packs 12 compute cores which are a combination of the CPU and GPU cores that are geared towards compute and work in harmony with the HSA 2.0 architecture. There are up to four x86 Excavator cores and 8 CU GPU core (64 stream processors per CU). The H.265 encode support allows 3.5 times transcode performance of Kaveri while the compute architecture enables the 8 GCN compute units (512 stream processors) a reduction of 20% in power consumption. The SOC design offers up to 3 display heads with the ability to operate on 4K (UHD) resolution, and featuring a separate integrated security co-processor.

When specifically talking about Excavator cores, we get improved and larger cache sizes that allow prefetch improvements and lower latency. Better branch prediction leads to 50% increase in branch target buffer size (512 to 768 Entry)) and accelerated flush in the FPU. New instruction support include AVX2, MOVBE, SMEP and BMI1/2 along with more power gating options to cut down power when the chip remains dormant or doesn't gets utilized to full extent. The most significant gains in frequency come to 15W models while the 35W models actually able to push IPC with and 0-5% clock speed bumps. The 15W variants get a 25-45% frequency push and increase in IPC by 10%.

In terms of size, the Carrizo die measures at 250.04mm2 on the 28nm  BHD node while Kaveri measures at 245mm2 on the same process. The difference between both chips is that Carrizo ups the transistor count to 3.1 billion from Kaveri’s 2.41 billion count. The sudden reduction in the size of the die even when adding more better x86 performance was due to the fact that Excavator cores are smaller than Steamroller cores, measuring at just 14.48mm2 with a core transistor count of 102 million transistors. The L1 cache has also doubled on Carrizo to 32 KB per core from 16 KB. The overall core structure has 690 million transistors crammed in one partition while the rest of the transistors are dedicated to GCN cores that utilize HSA and compute engine advantage in general purpose computing environments.

AMD is also giving a boost update to the GCN architecture with their 3rd generation GCN cores integrated inside Carrizo. These are the same architectural enhancements as featured on Tonga and the soon to be released Fiji graphics card. The iGPU has 512 KB L2 cache, 819 GFlops of compute performance and HSA acceleration via ATC. Some features such as DirectX 12 (Level 12), improved tessellation performance, loss less delta color compression, updated ISA instruction set, high quality scaler unit, cache coherent fabric interface are available on the new GCN unit. The GCN core is fully compatible with DirectX 12 API (Level 12) and has HSA Acceleration with QOS (Wavefront/Compute Preemption and Context Switching). Since this is the latest design that is featured on Tonga and Fiji, we are looking at a updated ISA with a fully cache coherent fabric interface with the L2 cache that is dedicated to the graphics core. Surprising thing is that AMD retains full 8 ACE (Asynchronous Compute Engines) on Carrizo APU that are also found on Tonga GPU. Such a good design retains the efficiency features in the form of graphic voltage island, low power implementation, efficiency power gating and GPU adaptive clocking.

The Graphics Voltage Islands shifts the 33% graphics die to a separate voltage plane, away from the fabric and multimedia IPs. This new design allows independent voltage and frequency control based on graphics application activity. This makes a significant difference at steady state between the SOC and graphics die under gaming load and the graphics engine can specifically declare a certain voltage for itself based on the application demands. The new color compression feature also saves a lot of room by reducing the amount of bandwidth required to load data by compressing it. The compression gets rid of the 40% usual load that is needed on systems that demand DRAM bandwidth for graphics render. The graphics engine serves read and write on compressed data by compressing it during cache flush and decompressing it when needed at read return. This makes a 5-7% improvement in games for a modest silicon area increase of 0.2%.

The most interesting thing about Carrizo, aside from its technical specifications is also the design of the chip itself. AMD for the first time is aiming for a true SOC design eliminating the need of a separate FCH as was the case with Kaveri mobile which requires Bolton FCH for additional connectivity options. The FCH will be integrated on the die itself which will deliver Security, Display, Audio, PCI-e, SATA, SD, USB, Multimedia, UART/12C. CLCKGen and Misc I/O connectivity. AMD is aiming for UVD6, VCE3 and a audio co-processors with H.264 encode while feature a display control engine “DCE11″. With HDMI 2.0 that provides up to 3 display interfaces and PCI-e Gen 3.0 x8 for discrete GPU expansion and PCI-e 3.0 x4 for GPP, the APU begins to look like a decent improvement over Kaveri from a design perspective.  The FCH can deliver 4 USB 3.0 / 2.0 ports, 4 USB 2.0 ports and 2 SATA 3 ports while the memory controller allow for Dual Channel DDR3 memory rated at 2133 MHz in SoDIMM form factor (One per channel). The AMD Carrizo APUs have been shipping to OEMs for months now and several devices based on these chips are already available in the market.

AMD Carrizo APU Die:

AMD Carrizo Die