Nvidia GTX 1080 DirectX 12 Async Compute Performance Tested & Detailed – Closing The Gap With AMD’s GCN Architecture

Khalid Moammer
Posted May 18, 2016
576Shares
Share Tweet Submit

Nvidia’s GTX 1080 & Pascal architecture do much to narrow the DirectX 12 Async Compute gap with AMD, but is it enough ? NVIDIA was at the receiving end of some criticism over the last couple of years over inadequate async compute support in its GTX 900 series graphics cards. This DirectX 12 feature’s purpose is to dramatically improve utilization of the resources and horsepower inside modern GPUs.

It does this by eliminating gaps and unnecessary downtime in the pipeline by running multiple kernels concurrently to execute graphics and compute workloads simultaneously. The feature is notably present in some of the most talked about DirectX 12 games out there. including Ashes of the Singularity and the new Hitman, where NVIDIA cards were consistently outshined by competing Radeons from AMD.

Nvidia Brings Faster Pre-emption & Dynamic Load Balancing With Pascal & The GTX 1080

With Pascal Nvidia is introducing key improvements in the architecture to address this. And it’s achieved via improved pre-emption and faster context switching. Nvidia game several examples of the types of workloads that will benefit from the improvements introduced with Pascal. They include things like physics and audio processing, some image post-processing effects and async timewarp which is responsible for accurate head positioning within a VR environment.


Pascal is now able to dynamically overlap workalods in the pipeline, things such as PhysX and post-processing steps can now be layered on top of the graphics pipeline via dynamic load balancing. A feat that was cumbersome on Maxwell as it had to be done via static partitioning in software. This improvement helps reduce gaps in the pipeline and improves utilization inside Pascal GPUs.

slides15

Graphics and compute workloads are assigned to specific blocks in Nvidia GPUs. With Maxwell this meant that the developer had to sort of guess how the compute and graphics workload should be split via static partitioning. Meaning the developer would say OK I want graphics to get this much and compute to get this much of the resources. Unfortunately this would result in a highly inefficient mixed workload unless the developers was able to get the ratio just right, which is extremely difficult. Otherwise either the graphics or the compute part of the workload would finish first and the a portion of the GPU’s resources would have to wait idly by until the other part is done to start working on the next thing.

What Dynamic Load Balancing does is it allows the workload to be partitioned dynamically. This means that the developer doesn’t really have to guess anymore because the GPU will now take over this responsibility. If either the graphics or compute workload are completed first, the other workload gets picked up and distributed to the rest of the GPU’s resources that have finished their work. The result is all parts of the GPU will be put to work and nothing will go to waste, completing the task faster and improving performance as a result.

Time critical workloads are a different challenge compared to balancing workload and resource distribution and this is where Pascal’s improved pre-emption capability comes into play. An example of a time critical task would be asynchronous timewarp in a VR environment. If the GPU fails to complete it before the next display refresh then the entire frame will be dropped. Which is literally the worst thing that could happen in a VR environment because it completely disrupts immersion and can induce motion sickness.

With Maxwell, pre-emption is only available at the draw call level. Which means that an asynchronous timewarp operating can only be pre-empted and started once all the work in the previous draw call is completed. Any given draw call includes many polygons, often hundreds, each includes hundreds of pixels. In other terms think of it as a dinner situation where you need to finish your steak, your mashed potatoes and your veggies before you can take a sip of water. You need that sip of water right now, but you can only get it to it after you finish your entire plate.

slides16
In this analogy Pascal allows you to stop for a second, take a sip of water and pickup where you left off with that delicious, delicious, steak. The Pascal graphics architecture is the very first to include pixel level pre-emption. This means that rather than having to request an asynchronous timewarp at the draw call level, it can now be done at the pixel level. The GPU doesn’t have to wait until all the work – hundreds of polygons that include hundreds of pixels each – in a given draw call is completed before it can start working on the time sensitive asynchronous timewarp operation, or any other operation for that matter. This switch takes roughly a tenth of a millisecond according to Nvidia.

Dynamic load balancing and improved pre-emption both improve the performance of async compute code considerably on Pascal compared to Maxwell. Although principally this is not exactly the same as Asynchronous Shading or Computing. Because Pascal still can’t execute async code concurrently without pre-emption. This is quite different from AMD’s GCN architecture which has Asynchronous Compute engines that enable the execution of multiple kernels concurrently without pre-emption.

AMD has long touted the asynchronous compute capabilities of its GCN graphics architecture. The company built what it calls ACEs, Asynchronous Compute Engines, into its hardware. It’s available in all of AMD’s GCN architecture based graphics cards, including the now more than four year old HD 7970.
What Nvidia is doing with preemption and dynamic load balancing right now, while not exactly async compute, can be used to accomplish similar goals.

With all of this being said, it’s important to point out that what really matters at the end of the day and what everyone is after is performance. AMD’s and Nvidia’s architectures have long had different and distinct characteristics and strong suits. Asynchronous compute is without question an important ingredient in the DirectX 12 performance formula.

Have Nvidia’s efforts with Pascal done enough to minimize the DirectX 12 Async Compute gap that exists between its GPUs and AMD’s dedicated hardware approach with ACEs ? That’s the million dollar question. Testing is well underway by some of the industry’s most respectable publications to get some answers that to that very question. Thankfully as of right now we have some useful early data that can help us shed some light on this.

Credit : Computerbase.de

Testing was conducted in Ashes of The Singularity at 4K, 2560×1440 and 1920×1080 with DirectX 11, DirectX 12 with Async Compute disabled and with it enabled. At 4K the GTX 1080 lost performance with DirectX 12 compared to DirectX 11. It also gained no performance with Async Compute turned on vs off. In Comparison the GTX 980 Ti was also slower in DirectX 12 compared to 11 and actually lost performance with Async Compute turned on vs off.  AMD’s R9 Fury X gained performance with DirectX 12 compared to 11 and gained more performance once Async Compute was enabled.

DirectX 12 Async Compute Performance - 4K GTX 1080, Fury X, GTX 980 TiAt 2560×1440 and 1920×1080 Nvidia’s GTX 980 Ti behaved very much as it did at 4K, losing performance with DirectX 12 compared to 11 and losing more performance with Async Compute on. The GTX 1080 on the other hand showed gains with DirectX 12 compared to 11. Unlike the 4K results it benefited from Async Compute and saw a 3% gain at 2560×1440 and a 1% gain at 1920×1080 with it on vs off. AMD’s R9 Fury X again saw performance improvements with DirectX 12 compared to 11. It also showed much more significant gains with Async Compute that amounted to 10% at 4K, 9% at 2560×1440 and 4% at 1920×1080.

DirectX 12 Async Compute Performance - 1440p GTX 1080, Fury X, GTX 980 TiIn summary AMD’s R9 Fury X saw much greater benefit at all resolutions from the DirectX 12 API itself and from Async Compute compared to the GTX 1080 and 980 Ti. The 1080 showed modest gains with Async Compute and DirectX 12 but did not exhibit any performance regression like the GTX 980 Ti. So the improvements introduced with Pascal definitely helped but were not quite enough to close the gap that exists between Nvidia’s and AMD’s hardware here. In fact, It’s quite eye opening to see the GTX 1080 – which is 30% faster than the R9 Fury X at 4K – only managing to squeeze past the Fury X by 9% in DirectX 12.

So there you have it folks. An initial look at what we could expect from Pascal with DirectX 12 and Async Compute. Hopefully we can paint a clearer picture as more benchmarking data becomes available to us overtime and as more DirectX 12 games are released.

 

 

Share Tweet Submit