[IDF15]Intel’s 6th Gen Skylake Unwrapped – CPU Microarchitecture, Gen9 Graphics Core and Speed Shift Hardware P-State
Intel Skylake GPU Architecture Analysis
Moving on, we have the graphics architectural analysis from Intel that is based around their Gen9 GPU. The details will involve all the key information and insights on Intel's Skylake's CPU and GPU architecture. Right now, we can show you the full details regarding the Skylake Gen9 GPU graphics architecture which will be divided into three *initial* tiers that include GT2, GT3/e and GT4/e.
The details Intel provides on their Skylake iGPUs include a block diagram for their Core i7-6700K processor which launched earlier this month as the flagship offering on the Z170 platform. The chip which Intel term as a SOC (System on Chip) houses four CPU cores with a shared LLC cache that's interconnected through a SOC Ring which is a bi-directional, 32-byte wide bus and further connects with the iGPU and System agent. All the memory transactions to/from CPU cores and to/from Intel iGPUs are also handled by this SOC Ring, through the system agent and the unified DRAM controller. Now Intel would like to call it a SOC but its in fact a Semi-SOC since the PCH (Southbridge) is still housed on the motherboard while the Northbridge has been moved to the CPU for quite some time now. Intel has saved a lot of die space with their 14nm CPUs but moving the PCH to the chip itself will require a lot of room, even more if you are going to feature the eDRAM on the side of the chip package. A suitable example would be the case Broadwell Core i7-5775C that houses a 128 MB of eDRAM cache on the core package alongside the main die. That along with the several transistors make up for a lot of room and leaves little space for any further addition to the unit.
Some of the key improvements and changes for the Skylake Gen9 graphics include:
Intel Skylake Gen9 Graphics Features:
Gen9 Memory Hierarchy Refinements:
- Coherent SVM write performance is significantly improved via new LLC cache management policies.
- The available L3 cache capacity has been increased to 768 Kbytes per slice (512 Kbytes for application data).
- The sizes of both L3 and LLC request queues have been increased. This improves latency hiding to achieve better effective bandwidth against the architecture peak theoretical.
- In Gen9 EDRAM now acts as a memory-side cache between LLC and DRAM. Also, the EDRAM memory controller has moved into the system agent, adjacent to the display controller, to support power efficient and low latency display refresh.
- Texture samplers now natively support an NV12 YUV format for improved surface sharing between compute APIs and media fixed function units.
Gen9 Compute Capability Refinements:
- Preemption of compute applications is now supported at a thread level, meaning that compute threads can be preempted (and later resumed) midway through their execution.
- Round robin scheduling of threads within an execution unit.
- Gen9 adds new native support for the 32-bit float atomics operations of min, max, and compare/exchange. Also the performance of all 32-bit atomics is improved for kernel scenarios that issued multiple atomics back to back.
- 16-bit floating point capability is improved with native support for denormals and gradual underflow.
Gen9 Product Configuration Flexibility:
- Gen9 has been designed to enable products with 1, 2 or 3 slices.
- Gen9 adds new power gating and clock domains for more efficient dynamic power management. This can particularly improve low power media playback modes.
Intel Skylake GT2/e Graphics With 24 EUs and Optional eDRAM
The first Gen9 graphics that we are going to talk about is the GT2 graphics core that is found on the Graphics 530 chip featured on both Skylake desktop (unlocked) processors. As was the case with the Haswell processors, each slice of the new graphics block can either be combined or reduced to form different graphics SKUs for various range of products. Each slice is comprised of various subslice that include the foundation of the Gen9 graphics block, the EU (Execution Unit). Each EU is an SMT / IMT combination (Simultaneous and Fine-Grained Interleaved Multi-Threading) with multiple SIMD ALUs featured across multiple threads. Skylake gets 128 general purpose registers per EU with 32 bytes of register stores and 4 Kbytes of general purpose register files. Since Gen9 has 7 threads per EU, these amount to 28 Kbytes of GRF's on each EU. The computation is handled by a pair of SIMD FPUs that can execute up to four 32-bit floating-point or integer and 16-bit floating point or integer operations while retaining the FP64 double precision compute capabilities. The integration of 16-bit floating point is new in Skylake processors with twice the operational speeds of FP32 and a similar path to what NVIDIA is planning to incorporate on their Pascal graphics processors next year.
The GT2 graphics chip has three subslices with 8 EUs per subslice. This makes a 24 EUs slice which is connected to the L3 cache through the SOC ring we talked about earlier. Each subslice comes with a local thread dispatcher which connects the different subslices as a single unified slice. They also come with the sampler (read-only memory fetch) unit that is used for the sampling of tiled (not tiled) texture and image surfaces. It comes with its own Sampler L1 and L2 cache. The data port is a memory load/store unit while the latest GTI (Graphics Technology Interface) works as a gateway between the Gen9 iGPU and the rest of the chip.
The HD Graphics 530 chip housed inside the Core i7-6700K has a clock speed of 350 MHz base and 1150 MHz boost. With 24 Execution units and an improved design, we have seen an increase in overall graphics performance. More surprisingly, the 6700K die also houses an optional eDRAM controller that can feature 64 MB to 128 MB of eDRAM (L4) cache with frequencies of up to 1.6 GHz to increase bandwidth, reduce latency and improve performance on faster iGPUs. It is quite unnecessary to incorporate such high band width memory on the 6700K class processor but the optional controller does make it seem that we may find some GT2 chips with embedded DRAM.
Intel Skylake GT3/e Graphics With 48 EUs and eDRAM
The Skylake GT3 graphics also come in two variants, one with eDRAM and one without it. The chip houses 48 Execution units that are partitioned in two slices, each slice consisting of three subslices with 8 EUs per subslice. Each slice has its own L3 data cache and a unified memory interface. The chip will house up to 64 MB of L4 cache with the GT3e variants that come later this year.
Intel Skylake GT4/e Graphics With 72 EUs and eDRAM
Finally, we have the fastest graphics chip that Intel has ever made, the GT4 class integrated graphics chip. This Gen9 core combines three slices of 24 EUs where each slice is composed of three subslices of 8 EUs per subslice. With three L3 data caches that are combined through the unified local memory interface in a large package, the chip will house up to 128 MB of L4 cache (eDRAM). The chip will featured on Iris Pro class processors with increased graphics performance compared to traditional iGPU based processors. At 1 GHz clock, the GT4e graphics chip can pump out 1152 GFlops of compute performance which is without taking in account the performance of the processing cores.
While we have seen the performance of Core i7-6700K and Core i5-6600K iGPUs which come close to the really low end discrete graphics cards available such as the Radeon R5 230 or the GeForce GT 720/730, both NVIDIA and AMD haven't actually released any discrete level parts in the market as of yet with their latest Radeon 300 series and GeForce 900 series lineup. Now AMD has a reason for not releasing discrete cards as the APUs they ship already come with GCN powered chips but discrete graphics cards market share in major markets such as China has drastically fallen down for these two GPU makers and more people are buying or aiming at the high-end graphics cards. This leaves a question whether the low-end discrete graphics market is going to end in a couple of years?
Well, neither did AMD or NVIDIA focused in performance improvement on the low-end sector, their low-end cards below the Radeon R7 360/260 or the GTX 950/750 is quite poor in terms of performance and retails at around $50 US which if user goes with an integrated graphics solution is a better option since that $50 US is saved for additional upgrades. The market here was wide open for AMD and Intel to leverage their own iGPU chips and they have done so but Intel seems to be doing it better. The chips Intel currently offer are only found on expensive processors but with a few passing generations, we can see the same or better performance scale down to mid-tier chips that are equivalent of an AMD APU retailing at around $150 US - $200 US. This will give user a decent processor that houses performance on par with a GeForce GTX 750 Ti class graphics card that can be a ideal budget PC built with low power consumption.
The concept of onboard graphics chips has been here for a long time, there was a time when all three giants used to integrate these chips on the main boards but with the shrinking desktop market, they had to revise their GPUs and focus their path. NVIDIA went the discrete and mobility route, ATi merged with AMD and offered discrete solutions, mobility solutions and APUs while Intel went an all out integrated route. Now when we look at the discrete market share of cards, we see NVIDIA dominating the graphics market with around 70-75% and AMD with a discrete market share of 25-30% up till Q4 '14 (based on figures compiled by Beyond3D) however when we take a look at the GPU market share which includes discrete, integrated, mobility chips, we see a bigger divide. The figures from John Peddie Research up till Q4 '14 show AMD at 13.61% market share, NVIDIA at 15% and Intel dominating the market with an insane 71.39% market share which is insanely high.
Now we know these numbers are not representative of the existing market as we have seen AMD and NVIDIA launching insanely high-performance graphics cards in the market and Intel introducing two new architectures with new graphics capabilities. But even if AMD and Intel do get a lead in graphics share, it won't nearly even touch Intel's dominance in the market and do note, Intel is known for their processors and their chipsets, not their graphics chips. So what led to this gain? Integration across the board, Intel has secured lots of AIB in the mobility world and that is where the market growth is at. In just a small fraction of time, Intel is on par with AMD's Carrizo APU that houses their latest Excavator and GCN 1.1 graphics core. Intel has a recipie of disaster cooked for AMD and NVIDIA in the entry to mid-range mobility market and possibly even in the discrete GPU market when we look into the next 3-5 years. But don't get surprised, Intel has certain limits to what they can do on existing silicon, unlike discrete GPUs, integrated solutions require cooling, higher power and the demand keeps on increasing. Intel won't certainly be able to tackle AMD or NVIDIA in the mid to high end discrete GPU market unless the come up with their own solution unlike the failed Larrabee which now serves the foundation of their Xeon Phi (MIC HPC Accelators) line.
|Chip Name||GPU Core||GFlops (GPU Only)||GFlops (Whole Package)|
|AMD Radeon R7 360||Tobago Pro||1536 GFlops||N/A|
|NVIDIA GeForce GTX 750 Ti||Maxwell GM107||1389 GFlops||N/A|
|AMD Radeon R7 250X||Cape Verde XT||1216 GFlops||N/A|
|Intel Skylake Gen9 GT4/e||Intel Iris Pro 580||1152 GFlops @ 1 GHz||TBC|
|NVIDIA GeForce GTX 750||Maxwell GM107||1044 GFlops||N/A|
|AMD Radeon R9 M370X||Venus XT||992 GFlops||N/A|
|Intel Skylake Gen9 GT3/e||Intel Iris 560/570?||884 GFlops (Estimation)||TBC|
|AMD Carrizo FX-8800P||GCN 1.2||819 GFlops||1070 GFlops|
|Intel Core i7-5775C||Intel Iris Pro 6200||768 GFlops @ 1 GHz||883 GFlops|
|AMD Kaveri A10-7850K||GCN 1.1||737 GFlops||856 GFlops|
|Intel Core i7-5557U||Intel Iris 6100||724 GFlops||845 GFlops|
|AMD Richland A10-6800K||VLIW4||648 GFlops||779 GFlops|
|Intel Skylake Core i7-6700K||Intel HD 530||442 GFlops||TBC|
|Intel Haswell Core i7-4790K||Intel HD 4600||400 GFlops||512 GFlops|
All this talk goes off to show that the integrated graphics cores from AMD and Intel might actually change the discrete market as we know it by reducing the need for entry level discrete class graphics. Now there are plans by AMD in the future to scale down several TFlops GPUs in APUs which are termed as HPC APUs. Specifically designed for server spaces, these high-end "TFlops class" SOCs from Intel, AMD and even NVIDIA if they get to enhance their GPGPU performance for Denver CPUs to rival the likes of high-end discrete cards. Now call me skeptical but I don't expect to see these parts several years ahead but once they do, things are going to become a lot more interesting in the graphics world.
Stay in the loop
GET A DAILY DIGEST OF LATEST TECHNOLOGY NEWS
Straight to your inbox
Subscribe to our newsletter