AMD’s Secret DirectX 12 Weapon That Nvidia Had To Trade Off – Demystifying Async Compute
Ashes of The Singularity has been a major topic of interest for PC gamers for sometime due to its native support for the low level DirectX 12 API. The beta has just been released and we've spent some time testing an especially interesting DirectX 12 feature that the game supports dubbed explicit multi-gpu. This feature allows any two DX12 compatible GPUs , including integrated GPUs, regardless of vendor or capability to work together towards driving the framerate of your game even higher. Essentially pooling their resources together and combining their efforts to deliver a greater level of performance.
The results are quite interesting and you should check them out if you haven't already. Another major DirectX 12 feature that Ashes of The Singualrity supports is Asynchronous Compute, otherwise known as async compute/computing or async shaders/shading. AMD has long touted the DirectX 12 capabilities of its GCN architecture often citing async compute as an advantage for its Radeon graphics cards. Specifically those based on the GCN graphics arechitecture, which include the HD 7000 series and subsequent generations all the way up to the Fury series and 300 series.
Just like ourselves many other tech pubs have been busy benchmarking AoTS and so far the results seem to be overwhelmingly in AMD's favor. This is primarily the result of an effort between the company and the developers who have worked together on a comprehensive DirectX 12 async compute implementation.
Ashes of the Singularity - DirectX 12 Benchmark, Async Compute Enabled by Tomshardware.com
Ashes of the Singularity - DirectX 12 Benchmark, Async Compute Enabled by Anandtech.com
These findings do go hand-in-hand with some of the basic performance goals of async shading, primarily that async shading can improve GPU utilization. At 4096 stream processors the Fury X has the most ALUs out of any card on these charts, and given its performance in other games, the numbers we see here lend credit to the theory that RTG isn’t always able to reach full utilization of those ALUs, particularly on Ashes. In which case async shading could be a big benefit going forward.
Nvidia Confirms, Async Compute Support Is Missing From GeForce Drivers For Ashes Of The Singularity
NVIDIA sent a note over this afternoon letting us know that asynchornous shading is not enabled in their current drivers, hence the performance we are seeing here. Unfortunately they are not providing an ETA for when this feature will be enabled.
Fun FACT of the day: Async Compute is NOT enabled on the driver-side with public Game Ready Drivers. You need app-side + driver-side!
— Sean Pelletier (@PellyNV) February 24, 2016
We were frankly surprised to find out that Nvidia has not yet implemented async compute in its GeForce drivers to this date for Ashes of The Singularity, despite news six months ago that it was actively working with the developers to get it done. The company hasn't offered any sort of timetable yet either as to when we should expect to see this delivered.
It should be noted that there's a bit of a history here with Nvidia and Async Compute. Fable Legends was another game that supported this feature, yet GeForce graphics cards did not support it in a typical fashion. We've already published our detailed investigative report into the bizarre behavior that the feature exhibits in the game on Nvidia graphics cards.Despite the fact that the feature was implemented in a much more limited fashion in Fable Legends. where as Oxide Games makes liberal use of the technology in its real time strategy title AoTS.
This more extensive use may require a level of complexity which simply can't be delivered through software alone and could explain why the company hasn't released a compatible driver yet. Unfortunately the company's silence on the matter has forested a culture of unhealthy speculation around the issue. It is very important to understand that there are distinctive intrinsic architectural differences between Nvidia's Maxwell architecture and AMD's GCN that play a crucial role in all of this.
Joel Hruska, Extremetech.com
Every bit of independent research on this topic has confirmed that AMD and Nvidia have profoundly different asynchronous compute capabilities. Nvidia’s own slides illustrate this as well. Nvidia cards cannot handle asynchronous workloads the way that AMD’s can, and the differences between how the two cards function when presented with these tasks can’t be bridged with a few quick driver optimizations or code tweaks. Beyond3D forum member and GPU programmer Ext3h has written a guide to the differences between the two platforms — it’s a work-in-progress, but it contains a significant amount of useful information.
1 Additional queues are scheduled in software. Only memory limits apply.
2 One 3D engine plus up to 8 compute engines running concurrently.
3 Since GCN 1.1, each compute engine can seamlessly interleave commands from 8 asynchronous queues.
4 Compute and 3D engine can not be active at the same time as they utilize a single function unit.
The Hyper-Q interface used for CUDA is in fact supporting concurrent execution, but it's not compatible with the DX12 API.
If it was used, there would be a hardware limit of 32 asynchronous compute queues in addition to the 3D engine.
5 Execution slots dynamically shared between all command types.
6 Execution slots reserved for compute commands.
7 Execution slots are reserved for use by the graphics command processors.
According to Nvidia, GM20x chips should be able to lift the reservation dynamically. This behaviour appears to be limited to CUDA and Hyper-Q.
8 Execution slots dynamically shared between each 8 compute queues since GCN 1.1.
9 SMX/SMM units can only execute either type of wavefront. A full L1, local shared memory and scheduler flush is required to switch mode. This is most likely due to using a single shard memory block to provide L1 and LSHM in compute mode.
The developer went on to conclude that the current situation is a mess. Nvidia on one hand offers a solution that while unintuitive and rudimentary - to reduce the power consumption and overall size of its GPUs - can still deliver real-world benefits. On the other hand AMD is offering a solution that's more comprehensive and flexible which aids developers directly. However, it comes at the cost of additional hardware inside the graphics processors and subsequently at the cost of higher power consumption and increased manufacturing costs that are the direct result of adding more transistors to deliver this functionality.
It becomes very clear that this was simply a matter of the two vendors favoring one trade-off over the other from the very beginning. Nvidia's strive to push power efficiency underlines one of the sacrifices that they chose to make to achieve this. In this particular case it was a compromise at the cost of async compute.
For the future, I hope that Nvidia will get on par with AMD regarding multi engine support. AMD is currently providing a far more intuitive approach which aids developers directly.
This will come at an increased power consumption as the flexibility naturally requires more redundancy in hardware, but will most likely increase GPU utilization throughout the industry while accelerating development. The ultimate goal is still a common standard where you don't have to care much about hardware implementation details, the same way as x86 CPUs have matured over the course of the past 25 years.
Ashes of the Singularity - DirectX 12 Benchmark, Async Compute Enabled by Extremetech.com
Extremetech's Joel Hruska offers an interesting perspective on the current state of DirectX 12's async compute support from Nvidia and AMD as well as the hardware capabilities of their current graphics architectures in his article. He also pointed to some excerpts from the reviewer's guide to shine light on how the developers deal with optimizing for either vendor.
We have created a special branch where not only can vendors see our source code, but they can even submit proposed changes. That is, if they want to suggest a change our branch gives them permission to do so…
This branch is synchronized directly from our main branch so it’s usually less than a week from our very latest internal main software development branch. IHVs are free to make their own builds, or test the intermediate drops that we give our QA.
Oxide primarily optimizes at an algorithmic level, not for any specific hardware. We also take care to avoid the proverbial known “glass jaws” which every hardware has. However, we do not write our code or tune for any specific GPU in mind. We find this is simply too time consuming, and we must run on a wide variety of GPUs. We believe our code is very typical of a reasonably optimized PC game.
When asked about the decision to turn on async compute by default Dan Baker of Oxide Games had this to say :
“Async compute is enabled by default for all GPUs. We do not want to influence testing results by having different default setting by IHV, we recommend testing both ways, with and without async compute enabled. Oxide will choose the fastest method to default based on what is available to the public at ship time.”
Our Take, DirectX 12 Asynchronous Compute : What It Is And Why It Matters
AMD has clearly been a far more vocal proponent of Async Compute than its rival. The company put this hardware feature under the limelight for the very first time two years ago and attention has been directed towards it more so last year as the imminent launch of the DirectX 12 API was looming ever closer. Prior to that the technology remained, for the most part, out of sight.
Asynchronous Shaders/Compute or what’s otherwise known as Asynchronous Shading is one of the more exciting hardware features that DirectX12 and Vulkan - as well as Mantle before them - exposed. This feature allows tasks to be submitted and processed by shader units inside GPUs ( what Nvidia calls CUDA cores and AMD dubs Stream Processors ) simultaneously and asynchronously in a multi-threaded fashion. In layman's terms it's similar to CPU multi-threading, what intel dubs hyperthreading. It works to fill the gaps in the engine by making sure that as much of the hardware resources inside the chip are fully utilized to drive performance up and that nothing is left idling with nothing to do.
One would’ve thought that with multiple thousands of shader units inside modern GPUs that proper multi-threading support would have already existed in DX11. In fact one would argue that comprehensive multi-threading is crucial to maximize performance and minimize latency. But the truth is that DX11 only supports very basic multi-threading techniques that can’t fully take advantage of the thousands of shader units inside modern GPUs. This meant that GPUs could never reach their full potential as many of their resources would be left untapped.
Multithreaded graphics in DX11 does not allow for multiple tasks to be scheduled simultaneously without adding considerable complexity to the design. This meant that a great number of GPU resources would spend their time with no task to process because the command stream simply can’t keep up. This in turn meant that GPUs could never be fully utilized, leaving a deep well of untapped performance and potential that programmers could not reach.
Other complementary technologies attempted to improve the situation by enabling prioritization of important tasks over others. Graphics pre-emption allowed for the prioritization of tasks but it did not solve the fundamental problem. As it could not enable multiple tasks to be handled and submitted simultaneously independently of one another. A crude analogy would be that what graphics pre-emption does is merely add a traffic light to the road rather than add an additional lane.
Out of this problem a solution was born, one that’s very effective and readily available to programmers with DX12, Vulkan and Vulkan's spiritual predecessor AMD's Mantle. It’s called Asynchronous Shaders and just as we’ve explained above it enables a genuine multi-threaded approach to graphics. It allows for tasks to be simultaneously processed independently of one another. So that each one of the multiple thousands of shader units inside a modern GPU can be put to as much use as possible to drive performance and power efficiency up.
However to enable this feature the GPU must be built from the ground up to support it. In AMD’s Graphics Core Next based GPUs this feature is enabled through the Asynchronous Compute Engines, ACE units, integrated into each GPU. These are structures which are built inside the chip itself and they serve as the multi-lane highway by which tasks are delivered to the stream processors.
AMD's DirectX 12 Secret Weapon - Async Compute Engines
Each ACE is capable of handling eight queues and every GCN GPU features multiple Async Compute Engines. ACEs debuted with AMD’s first GCN (GCN 1.0 ) based GPU code named Tahiti, HD 7970, in late 2011 which featured two Asynchronous Compute Engines.
They were originally used primarily for compute workloads rather than games because no API existed at the time that could directly access them. Today however ACEs can take on a more prominent role in gaming through modern APIs such as DirectX 12 and Vulkan. So in every sense this has been AMD's dormant secret DirectX 12 weapon for the very beginning.
Last year AMD showcased a demo for this hardware feature which demonstrated a performance improvement of 46% in VR workloads. So far however, Nvidia has not talked much if at all about the feature other than say that support is on the way six months ago.
Speaking of GPUs in general, while modern GPU architectures of the day like GCN, which powers the current flock gaming consoles and AMD's roster of graphics cards, or Maxwell, which powers Nvidia's latest Tegra mobile processors and its array of GTX 900 series graphics cards, have grown to accumulate far more similarities than differences, different hardware will always inherent different architectural traits.
There will always be one thing that a specific architecture does better than the other. This diversity is dictated by the needs of the market and the diversity of the great minds through which this wonderful technology that we enjoy today has been conceived. The semantics will always be there, and while it can be fun to discuss and debate them, looking at the whole picture will be the only way forward to any substantial progress.
|3D queue support||Yes||Yes|
|Compute queue support||Yes||Yes|
|3D queue limit||N/A||N/A|
|Compute queue limit||64 (GCN 1.2)|
64 (GCN 1.1)
2 (GCN 1.0)
|Multi engine concurrency||1+8 (GCN 1.2)|
1+8 (GCN 1.1)
1+2 (GCN 1.0)
|Compute shader concurrency on 3D engine||64/128 (GCN 1.2)|
32/64 (GCN 1.1)
64 (GCN 1.0)
|1 (900 series)
1 (700 series)
|3D shader concurrency on 3D engine||64/128 (GCN 1.2)|
32/64 (GCN 1.1)
64 (GCN 1.0)
|31 (900 series)
31 (780 series)
5/10/15 (remainder of 700 series)
|Compute shader concurrency on compute engine||32/64 (GCN 1.2)|
32 (GCN 1.1)
64 (GCN 1.0)
|32 (900 series )
32 (780 series)
6/11/16 (remainder of 700 series)
|Mixed 3D/Compute wavefront interleaving||Yes||Limited|
GCN 1.2 : R9 Fury X, Fury, Nano, R9 380 series & R9 285.
GCN 1.1 : R9 390 series, R9 290 series, R7 360, R7 260 series & HD 7790.
GCN 1.0 : All other AMD graphics cards from the HD 7000 series onwards.