NVIDIA’s 64-Bit Denver CPU Architecture Details Unveiled – Dual Custom ARMv8 Cores Clocked at 2.50 GHz
NVIDIA has unveiled the first architecture details of their custom designed 64-Bit Denver CPU which is also their first high-performance SOC design at Hot Chips. It has been almost eight months since NVIDIA launched their new Tegra K1 SOC which features an A15 processor and 192 Kepler cores featuring unparalleled amount of performance and power efficiency against chips from competitors.
NVIDIA's 64-Bit Denver CPU Architecture Details Unveiled
The first Tegra K1 variant which is based off the 32-Bit ARM15 core has made some name and featured in some hot selling devices such as the Xiaomi MiPad and the NVIDIA Shield Tablet which is the company's reference and latest Shield branded "handheld" gaming device. However, we have known since launch that there were always supposed to be two variants of the Tegra K1 SOC, one with the 32-Bit ARM core while the other featuring 64-Bit Denver CPU. Theoretically, Project Denver’s dual core should be much more powerful than the previous 4+1 Cortex A15 based variant. The ‘Super Dual Core’ as Nvidia calls it is a highly efficient architecture (ARMv8 -A) and the first iteration of ARM to feature 64 bit. A major indicator of its power efficiency is that while the 4+1 Variant features a low power core for non-intensive applications, the Denver Variant only has the 2 cores.
Denver is a dual core at its heart featuring a 7-Way Superscalar micorarchitecture fitted across 192 Kepler GPU cores. It includes a 128 KB 4-Way L1 cache, a 64 KB 4_Way L1 cache and a 2 MB 16-Way L2 cache. Denver also makes use of the new Dynamic code optimization which stores frequently used software routines into a dense and highly tuned microcode-equivalent routines. For this purpose, a 128MB main memory based optimization cache has been configured which reduces the need to re-optimize software routines
As part of the Dynamic Code Optimization process, Denver looks across a window of hundreds of instructions and unrolls loops, renames registers, removes unused instructions, and reorders the code in various ways for optimal speed. This effectively doubles the performance of the base-level hardware through the conversion of ARM code to highly optimized microcode routines and increases the execution energy efficiency. NVIDIA
So coming to the technical details, the details presented at Hot Chips show that Denver CPU has its own instruction set and make use of conversion to process ARMv8 instructions to its own ISA. As reported by TechReport:
- Binary translation is for real. Yes, the Denver CPU runs its own native instruction set internally and converts ARMv8 instructions into its own internal ISA on the fly. The rationale behind doing so is the opportunity for dynamic code optimization. Denver can analyze ARM code just before execution and look for places where it can bundle together multiple instructions (that don't depend on one another) for execution in parallel. Binary translation has been used by some interesting CPU architectures in the past, including, famously, Transmeta's x86-compatible effort. It's also used for emulation of non-native code in a number of applications.Denver's binary translation layer runs in software, at a lower level than the operating system, and stores commonly accessed, already optimized code sequences in a 128MB cache stored in main memory. Optimized code sequences can then be recalled and replayed when they are used again.
- Execution is wide but in-order. Denver attempts to save power and reap the benefits of dynamic code optimization by eschewing power-hungry out-of-order execution hardware in favor of a simpler in-order engine. That execution engine is very wide: seven-way superscalar and thus capable of processing as many as seven operations per clock cycle. Denver's peak instruction throughput should be very high. The tougher question is what its typical throughput will be in end-user workloads, which can be variable enough and contain enough dependencies to challenge dynamic optimization routines. In other words, Denver's high peak throughput could be accompanied by some fragility when it encounters difficult instruction sequences. via TechReport
The performance numbers were also presented for the Denver CPU in which its pitted against a Haswell "Celeron 2955", iPhone 5s (A7 Cyclone), Krait-400 (8974-AA) and Baytrail (Celeron N2910) processor. In all benchmarks, the Tegra K1 64-Bit Denver powered SOC turns out faster than the mobility based chips while the 15W Haswell CPU which does have a leverage in some benchmarks is running just on par with the Tegra K1 SOC. The wattage of Tegra K1 Denver is not known but would be lower than what we have seen on the 32-Bit variant but seeing how it performs equivalent to PC level chips is amazing. NVIDIA has stated that their Dual Core Denver CPU can surpass quad and Octa core mobile processors on most mobility workloads while delivering insane power efficiency. The Tegra K1 64-Bit aims to deliver PC-Class performance in the mobile word and NVIDIA assures that they will have mobile devices based on the Denver CPU arriving later this year and they are already developing the next version of Android "L" on Tegra K1.
NVIDIA Tegra K1 64-Bit Denver CPU Specifications:
NVIDIA Tegra K1 64-Bit | NVIDIA Tegra K1 32-Bit | NVIDIA Tegra 4 | NVIDIA Tegra 3 | |
Codename | Logan | Logan | Wayne | Kal-El |
ARM Cores | 2 Core (Multi-Thread) | 4+1 | 4+1 | 4 Core |
ARM Architecture | 64-bit ARM v8 (Custom) | 32-bit Cortex A15 | 32-bit Cortex A15 | 32-bit Cortex A9 |
GPU Architecture | Kepler | Kepler | GeForce GPU | GeForce GPU |
GPU Cores | 192 Core | 192 Core | 72 Core | 12 Core |
Process | 28nm | 28nm | 28nm HPL | 40nm LPG |
Core Frequency | 2.5 GHz | 2.3 GHz | 1.9 GHz | 1.2 GHz |
Memory Size | 8 GB | 8 GB | 4 GB | 2 GB |
Memory Type | DDR3L / LPDDR3 | DDR3L / LPDDR3 | DDR3L / LPDDR3 | DDR3 / LPDDR2 |
Cache | 128 K + 128 K L1 | 32K + 32K L1 | 32K + 32K L1 | - |
Launch | 2014 | 2014 | 2013 | 2012 |
The Performance numbers have been compiled by the fellow forum members over at Beyond3D for better understanding:
DMIPS
- Baytrail (Celeron N2910): 0.45x
- S800 (Krait 400 8974AA): 0.95x
- Tegra K1 (R3 Cortex A15): 1.00x
- A7 (Cyclone): 1.30x
- Haswell (Celeron 2955U): 1.00x
- Tegra K1 (Denver): 1.80x
SPECInt 2K
- Baytrail (Celeron N2910): 0.70x
- S800 (Krait 400 8974AA): 0.60x
- Tegra K1 (R3 Cortex A15): 1.00x
- A7 (Cyclone): 0.90x
- Haswell (Celeron 2955U): 1.30x
- Tegra K1 (Denver): 1.45x
SPECFP 2K
- Baytrail (Celeron N2910): 0.85x
- S800 (Krait 400 8974AA): 0.80x
- Tegra K1 (R3 Cortex A15): 1.00x
- A7 (Cyclone): N/A
- Haswell (Celeron 2955U): 1.95x
- Tegra K1 (Denver): 1.75x
AnTuTu 4
- Baytrail (Celeron N2910): N/A
- S800 (Krait 400 8974AA): 0.80x
- Tegra K1 (R3 Cortex A15): 1.00x
- A7 (Cyclone): 0.70x
- Haswell (Celeron 2955U): N/A
- Tegra K1 (Denver): 1.00x
Geekbench 3 Single-Core
- Baytrail (Celeron N2910): 0.65x
- S800 (Krait 400 8974AA): 0.80x
- Tegra K1 (R3 Cortex A15): 1.00x
- A7 (Cyclone): 1.20x
- Haswell (Celeron 2955U): 1.20x
- Tegra K1 (Denver): 1.65x
Google Octane v2.0
- Baytrail (Celeron N2910): 0.70x
- S800 (Krait 400 8974AA): 0.65x
- Tegra K1 (R3 Cortex A15): 1.00x
- A7 (Cyclone): 0.70x
- Haswell (Celeron 2955U): 1.45x
- Tegra K1 (Denver): 1.30x
16MB Memcpy (GB/s)
- Baytrail (Celeron N2910): 0.85x
- S800 (Krait 400 8974AA): 0.80x
- Tegra K1 (R3 Cortex A15): 1.00x
- A7 (Cyclone): 1.15x
- Haswell (Celeron 2955U): 1.55x
- Tegra K1 (Denver): 1.40x
16MB Memset (GB/s)
- Baytrail (Celeron N2910): 0.40x
- S800 (Krait 400 8974AA): 0.75x
- Tegra K1 (R3 Cortex A15): 1.00x
- A7 (Cyclone): 0.80x
- Haswell (Celeron 2955U): 0.65x
- Tegra K1 (Denver): 1.05x