DirectX 12 Async Shaders An Advantage For AMD And An Achilles Heel For Nvidia Explains Oxide Games Dev
Oxide Games have pinpointed Asynchronous Shaders as one of the main reasons AMD hardware showed significant gains vs Nvidia in DX12. More specifically in the recently launched DX12 benchmark for the developer's real time strategy title Ashes of The Singularity, which is set for release next year. The benchmark however, which Oxide Games is adamant is accurately representative of the game's performance, has been available to download for free since earlier this month. We have already ran this test on a variety of graphics cards from both Nvidia and AMD and published our results in an article earlier this month.
What we ,and other publications, had found was that AMD GPUs consistently showed significantly greater performance gains than their Nvidia counterparts and in many instances the AMD cards matched or outperformed more expensive Nvidia offerings. On the Nvidia side the results were fairly inconsistent to say the least, where in some instances we, and other publications, registered a performance loss with Nvidia hardware running the DX12 version of the benchmark compared to DX11. We learned later-on that this was down to a hardware feature called Asynchronous Shaders/Compute.
DirectX 12 Asynchronous Compute : What It Is And Why It Matters
Asynchronous Shaders/Compute or what’s otherwise known as Asynchronous Shading is one of the more exciting hardware features that DirectX12, Vulkan and Mantle before them exposed. This feature allows tasks to be submitted and processed by shader units inside GPUs ( what Nvidia calls CUDA cores and AMD dubs Stream Processors ) simultaneous and asynchronously in a multi-threaded fashion.
One would’ve thought that with multiple thousands of shader units inside modern GPUs that proper multi-threading support would have already existed in DX11. In fact one would argue that comprehensive multi-threading is crucial to maximize performance and minimize latency. But the truth is that DX11 only supports basic multi-threading methods that can’t fully take advantage of the thousands of shader units inside modern GPUs. This meant that GPUs could never reach their full potential, until now.
Multithreaded graphics in DX11 does not allow for multiple tasks to be scheduled simultaneously without adding considerable complexity to the design. This meant that a great number of GPU resources would spend their time idling with no task to process because the command stream simply can’t keep up. This in turn meant that GPUs could never be fully utilized, leaving a deep well of untapped performance and potential that programmers could not reach.
Other complementary technologies attempted to improve the situation by enabling prioritization of important tasks over others. Graphics pre-emption allowed for prioritizing tasks but just like multi-threaded graphics in DX11 it did not solve the fundamental problem. As it could not enable multiple tasks to be handled and submitted simultaneously independently of one another. A crude analogy would be that what graphics pre-emption does is merely add a traffic light to the road rather than add an additional lane.
Out of this problem a solution was born, one that’s very effective and readily available to programmers with DX12, Vulkan and Mantle. It’s called Asynchronous Shaders and just as we’ve explained above it enables a genuine multi-threaded approach to graphics. It allows for tasks to be simultaneously processed independently of one another. So that each one of the multiple thousand shader units inside a modern GPU can be put to as much use as possible to improve performance.
However to enable this feature the GPU must be built from the ground up to support it. In AMD’s Graphics Core Next based GPUs this feature is enabled through the Asynchronous Compute Engines integrated into each GPU. These are structures which are built directly into the GPU itself. And they serve as the multi-lane highway by which tasks are delivered to the stream processors.
Each ACE is capable of handling eight queues and every GCN based GPU has a minimum of two ACEs. More modern chips such as the R9 285 and R9 290/290X have eight ACEs. ACEs debuted with AMD’s first GCN based GPU code named Tahiti in late 2011. They were originally added to GPUs mainly to handle compute tasks because they could not be leveraged with graphics APIs of the time. Today however ACEs take on a more important role in graphics processing in addition to compute.
Asynchronous Shaders Can Provide A 46% Performance Uplift on AMD Hardware With DX12
To showcase the performance advantage that this feature can bring to the table AMD demoed it via a Liuqid VR sample five months ago. The demo ran at 245 FPS with Asynchronous Shaders off and post-processing disabled. However after post-processing was enabled the performance dropped to 158 FPS. Finally when Asynchronous Shaders and post-processing were both enabled, the average FPS went up to 230 FPS, approximately a 46% performance uplift. While this is likely a best case scenario improvement, it isn't too far off the 30% performance boost mark that Oxide Games mentioned other devs achieving with this feature on the consoles.
This isn’t all just a theoretical exercise either, there are a number of games which have already been released with Asynchronous Shaders implemented. These games include Battlefield 4, Infamous Second Son and The Tomorrow Children on the PS4 and Thief when running under Mantle on the PC. Ashes Of The singularity will obviously be joining that list soon as well. AMD always likes to point out that the consoles and the PC share the same GCN graphics architecture. So whatever is achieved on one platform the company can be taken to the other.
DirectX 12 Async Shaders A Big Advantage For AMD Over Its Rival Nvidia, Oxide Games Explain
Since then an Oxide Games dev shared a lot of information on why that is and shed light on the issue more recently in a couple of posts on overclock.net which we covered yesterday and followed it with more details in an additional comment today.
In regards to the purpose of Async compute, there are really 2 main reasons for it:
1) It allows jobs to be cycled into the GPU during dormant phases. In can vaguely be thought of as the GPU equivalent of hyper threading. Like hyper threading, it really depends on the workload and GPU architecture for as to how important this is. In this case, it is used for performance. I can't divulge too many details, but GCN can cycle in work from an ACE incredibly efficiently. Maxwell's schedular has no analog just as a non hyper-threaded CPU has no analog feature to a hyper threaded one.
2) It allows jobs to be cycled in completely out of band with the rendering loop. This is potentially the more interesting case since it can allow gameplay to offload work onto the GPU as the latency of work is greatly reduced. I'm not sure of the background of Async Compute, but it's quite possible that it is intended for use on a console as sort of a replacement for the Cell Processors on a ps3. On a console environment, you really can use them in a very similar way. This could mean that jobs could even span frames, which is useful for longer, optional computational tasks.
It didn't look like there was a hardware defect to me on Maxwell just some unfortunate complex interaction between software scheduling trying to emmulate it which appeared to incure some heavy CPU costs. Since we were tying to use it for #1, not #2, it made little sense to bother. I don't believe there is any specific requirement that Async Compute be required for D3D12, but perhaps I misread the spec.
Previous comments :
I suspect that one thing that is helping AMD on GPU performance is D3D12 exposes Async Compute, which D3D11 did not. Ashes uses a modest amount of it, which gave us a noticeable perf improvement. It was mostly opportunistic where we just took a few compute tasks we were already doing and made them asynchronous, Ashes really isn't a poster-child for advanced GCN features.
Our use of Async Compute, however, pales with comparisons to some of the things which the console guys are starting to do. Most of those haven't made their way to the PC yet, but I've heard of developers getting 30% GPU performance by using Async Compute. Too early to tell, of course, but it could end being pretty disruptive in a year or so as these GCN built and optimized engines start coming to the PC. I don't think Unreal titles will show this very much though, so likely we'll have to wait to see. Has anyone profiled Ark yet?
In the end, I think everyone has to give AMD alot of credit for not objecting to our collaborative effort with Nvidia even though the game had a marketing deal with them. They never once complained about it, and it certainly would have been within their right to do so. (Complain, anyway, we would have still done it, )
P.S. There is no war of words between us and Nvidia. Nvidia made some incorrect statements, and at this point they will not dispute our position if you ask their PR. That is, they are not disputing anything in our blog. I believe the initial confusion was because Nvidia PR was putting pressure on us to disable certain settings in the benchmark, when we refused, I think they took it a little too personally.
AFAIK, Maxwell doesn't support Async Compute, at least not natively. We disabled it at the request of Nvidia, as it was much slower to try to use it then to not.
Weather or not Async Compute is better or not is subjective, but it definitely does buy some performance on AMD's hardware. Whether it is the right architectural decision for Maxwell, or is even relevant to it's scheduler is hard to say.
According to Oxide Games, what has seemingly helped propel AMD hardware in the DX12 version of the game benchmark was the company's Asynchronous Compute feature found in the GCN architecture. And with a well designed implementation and proper optimization we may see DX12 games approach that 46% performance uplift figure just from Async Shaders. And that alone is a fairly exciting prospect.