Exclusive: The Nvidia and AMD DirectX 12 Editorial – Complete DX12 Graphic Card List with Specifications, Asynchronous Shaders and Hardware Features Explained


The ASync Question: Does Nvidia Support it?

All right, now that we have gotten that out of the way, lets begin with the heart of the current controversy: the Asynchronous Shaders and AotS benchmarks. To make this much more simpler, let me list the benchmarks tested and the basic configurations here, as well as the DirectX 11 and DirectX12 average frames per second:

  • Nvidia Test 1: Core i7 5960X + Geforce GTX Titan X
    • DX11: 45.7 fps
    • DX12: 42.7 fps
  • Nvidia Test 2: Core i5 3570K + Geforce GTX 770
    • DX11: 19.7 fps
    • DX12: 55.3 fps
  • AMD Test 1: Athlon X4 860K + Radeon R7 370
    • DX11: 12.4 fps
    • DX12: 15.9 fps
  • AMD Test 2: Xeon CPU E3-1230 + Radeon R9 Fury
    • DX11: 20.5 fps
    • DX12: 41.1 fps

Explaining the AotS benchmarks with what we know of DirectX 12

These benchmarks were mostly taken at face value and the usual frame war erupted over the raw value of these numbers. The problem is, when taking about an API that eliminates overhead, we need to look at the context as well. In both AMD tests, the numbers seem to rise, but that is because in the first test, the processor is obviously a bottleneck so the configuration had alot of untapped potential. In the second test, the processor was once again an arguable bottleneck, since Xeons are clocked pretty low.

In the second Nvidia test, DX12 also performs as expected, when coupling a reasonably powerful GPU with a decent processor. In all these three scenarios, the bottleneck of the processor was eliminated.

The actual anomalous test is the first one. With the incredibly powerful CPU and incredibly powerful GPU. Theoretically, this configuration has very little bottleneck - if any. DirectX 12 wouldn't have yielded any major performance increase because the configuration is already very much near its maximum potential. But the funny thing is, switching to DX12 actually results in a lowered value than before. That is something, that shouldn't have happened. To understand just what is going we need to look at what was happening behind the scene.

An overview of Synchronous and Asynchronous Shaders in different GPU Architectures

Now what exactly are Asynchronous Shaders? Traditionally, there is one graphical queue available for work to be scheduled. Whatever work needs to be done is scheduled in a serial order in the queue. The problem with this approach is that it usually results in bottlenecking and the GPU not working at its full capacity. For understanding's sake you can imagine the Queue as a thread. And as you might know, multi threaded approach to computation is the future. So Asynchronous Shaders is basically where there is:

  • the standard Graphics Queue available and also another
  • 'Compute Queue' for computational tasks.

A copy queue is also available, but since that is irrelevant to our current topic, I wont be going into that.

Now contrary to popular belief, Nvidia's Maxwell 2.0 does support "Asynchronous Shaders". Do bear in mind that documentation on these things is very limited - most of it comes from engineer comments and documentation on HyperQ (Nvidia's multiple queue implementation). The following data shows the Queue Engines of various AMD and Nvidia architectures:

  • AMD GCN 1.0 -  1 Graphics Queue + 16 Compute Queues ( 7900 series, 7800 series , 280 series, 270 series, 370 series)
    AMD GCN 1.0 -  1 Graphics Queue + 8 Compute Queues ( 7700 series, 250 series )
  • AMD GCN 1.1 - 1 Graphics Queue + 64 Compute Queues ( R9 290, R9 390 series )
  • AMD GCN 1.1 - 1 Graphics Queue + 16 Compute Queues ( HD 7790, R7 260 series, R7 360 )
  • AMD GCN 1.2 - 1 Graphics Queue + 64 Compute Queues ( R9 285/380, R9 Fury series, R9 Nano )
  • Nvidia Kepler - 1 Graphics, Mixed Mode not Supported (32 Pure Compute)
  • Nvidia Maxwell 1.0 - 1 Graphics Queue, Mixed Mode not Supported (32 Pure Compute)
  • Nvidia Maxwell 2.0 - 1 Graphics Queue + 31 Compute Queues

There are two ways the extra compute threads can  be used. In a "Pure Compute" mode which will be expensive because it will require switching and a "Mixed Mode" which is what Asynchronous Shaders is all about. In all AMD GPUs with ASync enabled, the card will be running 1 Graphical Queue and atleast 8 Compute Queues. This means that tasks in-game that require compute can be offloaded onto the GPU (If and only if, it has extra horsepower to spare.) This naturally translates to the GPU becoming more autonomous where the CPU is the bottleneck or the GPU is not being used to its full potential.

As you can see, Maxwell 2.0 does support mixed mode and upto 31 compute queues in conjunction with its primary graphics Queue. No other Nvidia architecture has this capability without involving a complete mode switch. Now this where things get a bit muddy. Since there is no clear documentation, and since Nvidia has yet to release an official statement on the matter, it is alleged (read: not confirmed in any way), that Nvidia's mixed mode requires the use of a software scheduler which is why it is actually expensive to deploy even on Maxwell 2.0.

Different architectural approaches to achieving the same result: gaming excellence

There is something else that we have to consider too. The chip currently employing the Maxwell 2.0 architecture is the GM200, 204 and 206. These chips were not designed to be compute extensive. AMD's architecture on the other hand has always been exceptional in terms of compute. So using Compute threads to supplement the Graphical threads will always be better on a Radeon. That is a fact.

Picture credits: Nvidia


However, the question remains (and is currently unanswered) whether Nvidia cards needs ASync to achieve their maximum potential at all. There is no evidence to suggest that Maxwell architecture would benefit from ASync. There is no evidence to suggest they wouldn't benefit either. But if we are to trust the each vendor on knowing their architecture, then Nvidia, these past generations have focused on creating graphical processors that specialize in single precision and gaming performance.

Double precision and compute took a rather back seat since the Fermi era. Dynamic Parallelism is one of the examples of such technologies present in post Fermi architectures. But usually, these are only ever used in the HPC sector.  This is also one of the reasons why gamers should still focus on the maximum potential or the raw frames per second achieved by the graphic card instead of focusing on the performance gain achieved by tapping into the untapped with DX12.