NVIDIA’s 64-Bit Denver CPU Architecture Details Unveiled – Dual Custom ARMv8 Cores Clocked at 2.50 GHz

NVIDIA has unveiled the first architecture details of their custom designed 64-Bit Denver CPU which is also their first high-performance SOC design at Hot Chips. It has been almost eight months since NVIDIA launched their new Tegra K1 SOC which features an A15 processor and 192 Kepler cores featuring unparalleled amount of performance and power efficiency against chips from competitors.Tegra K1 Project Denver CPU

NVIDIA's 64-Bit Denver CPU Architecture Details Unveiled

The first Tegra K1 variant which is based off the 32-Bit ARM15 core has made some name and featured in some hot selling devices such as the Xiaomi MiPad and the NVIDIA Shield Tablet which is the company's reference and latest Shield branded "handheld" gaming device. However, we have known since launch that there were always  supposed to be two variants of the Tegra K1 SOC, one with the 32-Bit ARM core while the other featuring 64-Bit Denver CPU. Theoretically, Project Denver’s dual core should be much more powerful than the previous 4+1 Cortex A15 based variant.  The ‘Super Dual Core’ as Nvidia calls it is a highly efficient architecture (ARMv8 -A) and the first iteration of ARM to feature 64 bit. A major indicator of its power efficiency is that while the 4+1 Variant features a low power core for non-intensive applications, the Denver Variant only has the 2 cores.

Denver is a dual core at its heart featuring a 7-Way Superscalar micorarchitecture fitted across 192 Kepler GPU cores. It includes a 128 KB 4-Way L1 cache, a 64 KB 4_Way L1 cache and a 2 MB 16-Way L2 cache. Denver also makes use of the new Dynamic code optimization which stores frequently used software routines into a dense and highly tuned microcode-equivalent routines. For this purpose, a 128MB main memory based optimization cache has been configured which reduces the need to re-optimize software routines

As part of the Dynamic Code Optimization process, Denver looks across a window of hundreds of instructions and unrolls loops, renames registers, removes unused instructions, and reorders the code in various ways for optimal speed. This effectively doubles the performance of the base-level hardware through the conversion of ARM code to highly optimized microcode routines and increases the execution energy efficiency. NVIDIA

So coming to the technical details, the details presented at Hot Chips show that Denver CPU has its own instruction set and make use of conversion to process ARMv8 instructions to its own ISA. As reported by TechReport:

  • Binary translation is for real. Yes, the Denver CPU runs its own native instruction set internally and converts ARMv8 instructions into its own internal ISA on the fly. The rationale behind doing so is the opportunity for dynamic code optimization. Denver can analyze ARM code just before execution and look for places where it can bundle together multiple instructions (that don't depend on one another) for execution in parallel. Binary translation has been used by some interesting CPU architectures in the past, including, famously, Transmeta's x86-compatible effort. It's also used for emulation of non-native code in a number of applications.Denver's binary translation layer runs in software, at a lower level than the operating system, and stores commonly accessed, already optimized code sequences in a 128MB cache stored in main memory. Optimized code sequences can then be recalled and replayed when they are used again.
  • Execution is wide but in-order. Denver attempts to save power and reap the benefits of dynamic code optimization by eschewing power-hungry out-of-order execution hardware in favor of a simpler in-order engine. That execution engine is very wide: seven-way superscalar and thus capable of processing as many as seven operations per clock cycle. Denver's peak instruction throughput should be very high. The tougher question is what its typical throughput will be in end-user workloads, which can be variable enough and contain enough dependencies to challenge dynamic optimization routines. In other words, Denver's high peak throughput could be accompanied by some fragility when it encounters difficult instruction sequences. via TechReport

The performance numbers were also presented for the Denver CPU in which its pitted against a Haswell "Celeron 2955", iPhone 5s (A7 Cyclone), Krait-400 (8974-AA) and Baytrail (Celeron N2910) processor. In all benchmarks, the Tegra K1 64-Bit Denver powered SOC turns out faster than the mobility based chips while the 15W Haswell CPU which does have a leverage in some benchmarks is running just on par with the Tegra K1 SOC. The wattage of Tegra K1 Denver is not known but would be lower than what we have seen on the 32-Bit variant but seeing how it performs equivalent to PC level chips is amazing. NVIDIA has stated that their Dual Core Denver CPU can surpass quad and Octa core mobile processors on most mobility workloads while delivering insane power efficiency. The Tegra K1 64-Bit aims to deliver PC-Class performance in the mobile word and NVIDIA assures that they will have mobile devices based on the Denver CPU arriving later this year and they are already developing the next version of Android "L" on Tegra K1.

NVIDIA Tegra K1 64-Bit Denver CPU Specifications:

  NVIDIA Tegra K1 64-Bit NVIDIA Tegra K1 32-Bit NVIDIA Tegra 4 NVIDIA Tegra 3
Codename Logan Logan Wayne Kal-El
ARM Cores 2 Core (Multi-Thread) 4+1 4+1 4 Core
ARM Architecture 64-bit ARM v8 (Custom) 32-bit Cortex A15 32-bit Cortex A15 32-bit Cortex A9
GPU Architecture Kepler Kepler GeForce GPU GeForce GPU
GPU Cores 192 Core 192 Core 72 Core 12 Core
Process 28nm 28nm 28nm HPL 40nm LPG
Core Frequency 2.5 GHz 2.3 GHz 1.9 GHz 1.2 GHz
Memory Size 8 GB 8 GB 4 GB 2 GB
Cache 128 K + 128 K L1 32K + 32K L1 32K + 32K L1 -
Launch 2014 2014 2013 2012

Denver CPU BlockThe Performance numbers have been compiled by the fellow forum members over at Beyond3D for better understanding:


  • Baytrail (Celeron N2910): 0.45x
  • S800 (Krait 400 8974AA): 0.95x
  • Tegra K1 (R3 Cortex A15): 1.00x
  • A7 (Cyclone): 1.30x
  • Haswell (Celeron 2955U): 1.00x
  • Tegra K1 (Denver): 1.80x


  • Baytrail (Celeron N2910): 0.70x
  • S800 (Krait 400 8974AA): 0.60x
  • Tegra K1 (R3 Cortex A15): 1.00x
  • A7 (Cyclone): 0.90x
  • Haswell (Celeron 2955U): 1.30x
  • Tegra K1 (Denver): 1.45x


  • Baytrail (Celeron N2910): 0.85x
  • S800 (Krait 400 8974AA): 0.80x
  • Tegra K1 (R3 Cortex A15): 1.00x
  • A7 (Cyclone): N/A
  • Haswell (Celeron 2955U): 1.95x
  • Tegra K1 (Denver): 1.75x

AnTuTu 4

  • Baytrail (Celeron N2910): N/A
  • S800 (Krait 400 8974AA): 0.80x
  • Tegra K1 (R3 Cortex A15): 1.00x
  • A7 (Cyclone): 0.70x
  • Haswell (Celeron 2955U): N/A
  • Tegra K1 (Denver): 1.00x

Geekbench 3 Single-Core

  • Baytrail (Celeron N2910): 0.65x
  • S800 (Krait 400 8974AA): 0.80x
  • Tegra K1 (R3 Cortex A15): 1.00x
  • A7 (Cyclone): 1.20x
  • Haswell (Celeron 2955U): 1.20x
  • Tegra K1 (Denver): 1.65x

Google Octane v2.0

  • Baytrail (Celeron N2910): 0.70x
  • S800 (Krait 400 8974AA): 0.65x
  • Tegra K1 (R3 Cortex A15): 1.00x
  • A7 (Cyclone): 0.70x
  • Haswell (Celeron 2955U): 1.45x
  • Tegra K1 (Denver): 1.30x

16MB Memcpy (GB/s)

  • Baytrail (Celeron N2910): 0.85x
  • S800 (Krait 400 8974AA): 0.80x
  • Tegra K1 (R3 Cortex A15): 1.00x
  • A7 (Cyclone): 1.15x
  • Haswell (Celeron 2955U): 1.55x
  • Tegra K1 (Denver): 1.40x

16MB Memset (GB/s)

  • Baytrail (Celeron N2910): 0.40x
  • S800 (Krait 400 8974AA): 0.75x
  • Tegra K1 (R3 Cortex A15): 1.00x
  • A7 (Cyclone): 0.80x
  • Haswell (Celeron 2955U): 0.65x
  • Tegra K1 (Denver): 1.05x
WccfTech Tv
Filter videos by