cover
Hardware

NVIDIA GeForce RTX 4060 Review Ft. MSI, GALAX & PNY – $249 Would’ve Made This GPU Perfect

Hassan Mujtaba

NVIDIA Ada GPU - Ada Streaming Multiprocessor, Ada GPC &; Ada GPUs Deep Dive

Let's take a trip down the journey to Ada. In 2016, NVIDIA announced their Pascal GPUs which would soon be featured in their top-to-bottom GeForce 10 series lineup. After the launch of Maxwell, NVIDIA gained a lot of experience in the efficiency department which they put a focus on since their Kepler GPUs.

Four years ago, NVIDIA, rather than offering another standard leap in the rasterization performance of its GPUs took a different approach & introduced two key technologies in its Turing line of consumer GPUs, one being AI-assisted acceleration with the Tensor Cores and the second being hardware-level acceleration for Ray Tracing with its brand new RT cores.

Related Story NVIDIA Reportedly Halts Bundling VRAM Chips With GPU Dies For Board Partners

Then came Ampere with its brand new Samsung 8nm fabrication process, and NVIDIA added even more to its gaming graphics lineup. In the Ampere GPU architecture, NVIDIA provided its latest Ampere SM along with next-gen FP32, INT32, Tensor Cores, and RT cores. The focus was to boost both rasterization and ray tracing capabilities to new heights.

Now enter Ada, a brand new architecture that aims to take everything from the first two RTX GPUs and perfect it. The graphics architecture is designed for speed and that it excels at. So let's see the architecture in detail. Following are the few main highlights of the Ada Lovelace GPU architecture:

  • Revolutionary New Architecture: NVIDIA Ada architecture GPUs deliver outstanding performance for graphics, AI, and compute workloads with exceptional architectural and power efficiency. After the baseline design for the Ada SM was established, the chip was scaled up to shatter records. Manufacturing innovations and materials research enabled NVIDIA engineers to craft a GPU with 76.3 billion transistors and 18,432 CUDA Cores capable of running at clocks over 2.5 GHz while maintaining the same 450W TGP as the prior generation flagship GeForce RTX 3090 Ti GPU. The result is the world’s fastest GPU with the power, acoustics, and temperature characteristics expected of a high-end graphics card.
  • New Ada RT Core for Faster Ray Tracing: For decades, rendering ray-traced scenes with physically correct lighting in real-time has been considered the holy grail of graphics. At the same time, the geometric complexity of environments and objects continues to increase as 3D games and graphics continually strive to provide the most accurate representations of the real world. The Ada RT Core has been enhanced to deliver 2x faster ray-triangle intersection testing and includes two important new hardware units. An Opacity Micro map Engine speeds up ray tracing of alpha-tested geometry by a factor of 2x, and a Displaced Micro-Mesh Engine generates Displaced Micro-Triangles on-the-fly to create additional geometry. The Micro-Mesh Engine provides the benefit of increased geometric complexity without the traditional performance and storage costs of complex geometries.
  • Shader Execution Reordering: NVIDIA Ada GPUs support Shader Execution Reordering which dynamically organizes & reorders shading workloads to improve RT shading Introduction efficiency. This improves performance by up to 44% in Cyberpunk 2077 with Ray Tracing Overdrive Mode.
  • NVIDIA DLSS 3: The Ada architecture features an all-new Optical Flow Accelerator and AI frame generation that boosts DLSS 3’s frame rates up to 2x over the previous DLSS 2.0 while maintaining or exceeding native image quality. Compared to traditional brute-force graphics rendering, DLSS 3 is ultimately up to 4x faster while providing low system latency.

The NVIDIA Ada Lovelace AD107 GPU features up to 3 GPC (Graphics Processing Clusters). These are the same SM count as the GA106 GPU. Each GPU will consist of 6 TPCs and 2 SMs which is the same configuration as the existing chip. Each SM (Streaming Multiprocessor) will house four sub-cores which is also the same as the GA102 GPU. What's changed is the FP32 & the INT32 core configuration. Each sub-core will include 64 FP32 units but combined FP32+INT32 units will go up to 128. This is because half of the FP32 units don't share the same sub-core as the IN32 units. The 64 FP32 cores are separate from the 128 INT32 cores.

So in total, each sub-core will consist of 16 FP32 plus 16 INT32 units for a total of 32 units. Each SM will have a total of 64 FP32 units plus 64 INT32 units for a total of 128 units. And since there are a total of 24 SM units (8 per GPC), we are looking at a total of 3,072 cores.

Moving over to the cache, this is another segment where NVIDIA has given a big boost over the existing Ampere GPUs. The L2 cache will be increased to 24 MB. This is a 12x increase over the Ampere GA107 GPU that hosts just 2 MB of L2 cache. The cache will be shared across the GPU. The GPU will also feature up to 80 ROPs for the full-die.

There are also going to be the latest 4th Generation Tensor and 3rd Generation RT (Raytracing) cores infused on the Ada Lovelace GPUs which will help boost DLSS & Raytracing performance to the next level. The NVIDIA GeForce RTX 4060 makes use of the full AD107 die.

NVIDIA AD107 'RTX 4060' Gaming GPU Block Diagram:

NVIDIA AD107 'Ada Lovelace' Gaming GPU 'SM' Block Diagram:

NVIDIA GeForce RTX 4060

  • 15 TFLOPS of peak single-precision (FP32) performance
  • 30 TFLOPS of peak half-precision (FP16) performance
  • 242 Tensor TFLOPs with sparsity
  • 35 RT-TFLOPs

At the heart of the NVIDIA GeForce RTX 4060 graphics card lies the Ada Lovelace AD107 GPU. The GPU measures 146.0 mm2 and will utilize the TSMC 4N process node which is an optimized version of TSMC's 5nm (N5) node designed for the green team. The GPU features 18.9 Billion transistors.

Massive L2 Cache Resolves Memory Bandwidth Bottlenecks

Coming back to the memory, the 128-bit bus interface might seem like a downgrade over the 192-bit bus on the previous-gen 60 cards but NVIDIA states that the effective bandwidth of the RTX 4060 is increased to 453 GB/s, an increase of 25.8% over the 3060 12 GB. This is made possible by upgrading the L2 cache from 3 MB to 24 MB, an 8x increase.

Increasing the L2 cache allows NVIDIA to overcome some of the bandwidth limitations & memory bottlenecks associated with using a narrower bus interface. You see, when the cores work, they're required to have a fast and effective channel to transfer data through and L1 is the closest and low latency lane that sits right next to them. But sitting close to the cores means you can't increase their size by a lot and if the cores can't find the data that they want on the L1 cache, they move over to the L2 cache which is right next to the L1 cache and has a larger capacity. The memory resides on the same GPU die and is connected via a high-speed interconnect across all GPCs to send data through and forth.

NVIDIA GeForce RTX 4060 Ti Prototype With 2 MB L2 Cache (128-bit bus):

NVIDIA GeForce RTX 4060 Ti Reference With 32 MB L2 Cache (128-bit bus):

If the data is found on the L2 cache, then that's considered a cache hit but if the cores still can't find the data on the L2 cache, that's considered a cache miss and the cores need to go out of the GPU & access the main memory pool (VRAM) to find the data. This taxes the memory subsystem leading to bandwidth bottleneck. NVIDIA's solution to address this bottleneck is to increase the L2 cache. This allows the GPU cores to have more room to travel before burdening the VRAM with a limited 128-bit bus interface.

NVIDIA showcases how the increased L2 cache helps the GeForce RTX 4060 Ti by demonstrating the memory subsystem load on a reference 32 MB L2 variant against a prototype 2 MB L2 variant against each other. Both cards have the same 128-bit bus interface with 512 KB of L2 cache tied to each 32-bit memory controller.

The 32 MB L2 variant was able to reduce traffic by 40% to 60% over the performance of the 2 MB variant. The 2 MB variant was able to fill up its entire pool of L2 cache quickly and that meant that more data was going to the VRAM, causing additional traffic burden compared to the 32 MB cache variant which not only had fewer cache misses but also led to fewer traffic going to the VRAM. So even with a 128-bit bus interface, you are getting far higher bandwidth resulting in far better GPU performance than a traditional 128-bit card.

You can find additional information about our hardware review process and ethics policy here.

Button