[Exclusive] Asynchronous Compute Investigated On Nvidia And AMD in Fable Legends DX12 Benchmark, Not Working on Maxwell

•

Oct 2, 2015 at 11:54pm EDT

What Does This All Mean?

Disclaimer: The GPU's we use in performance benchmarks are the ones we personally have. We can't afford everything, nor do manufacturers give us much. The Fury X is likewise not really available in the US readily, so until we receive a sample from the magical unicorn fairies, these will have to do.

A big hearty thanks to Shaun Walsh for helping with GPUView and explaining some deep programatic concepts.

[Update]

AMD has apparently been able to offload up to 30% of a workload to async compute, making the 18% seen here as a rather pleasant and good example of async compute usage.

There seems to have been quite the murmur regarding the performance of the Fable Legends benchmark. Certainly the graphics latency tests reveal some interesting information as reported by the benchmark, though they perhaps don't tell the entire story about performance.

Async Compute is enabled, just not fully being utilized to the greatest extent possible.

It's important to keep in mind that the Fable Legends benchmark is a test using software that's undeniably in the beta phase. It's not representative of final performance, and due to the closed nature, it's not even really indicative of how the actual game will play.

As we know, NVIDIA currently doesn’t support Asynchronous Compute fully, or at least the current driver implementation isn’t able to schedule these tasks correctly. Thus it’s been argued that GPU’s with async compute support could, or even should, have a larger advantage.

Microsoft and Lionhead Studios have assured us that asynchronous compute is indeed activated and on for the test, across the board, and that it doesn't turn off based on the presence of a card that doesn't support it. They've also given us a statement on just how of the compute pipeline is used. Dynamic lighting and even using async compute for instanced foliage. In essence, they've told us that compute is being used in rather healthy doses. We'll see just how much of the compute queue is actually used as opposed to what Lionhead says is being used.

For cards that might not support it, however, those tasks are simply put into a the normal render queue instead of being put into a compute queue. How does this effect performance? Unfortunately it doesn’t quite scale linearly, and as we’ll see shortly, async compute is indeed working, though even on AMD, very little of the workload is being offloaded to the available ACE’s, and only one ACE is actually being used at that.

Initially we wanted to see if we could simply turn off async ourselves within the benchmark to see if there’s any appreciable difference in performance. Unfortunately, the settings have been hardcoded into the benchmark, likely to keep things even across the board for a less controversial test. So then we resorted to using the trusty GPUView, which is a tool by a former Microsoft intern. First we capture log data from Microsoft’s Event Tracing framework and analyze it within GPUView.

For the test we'll be using the the below configuration. We'll also explore CPU usage on this i5-6600K as well as an i7-5960X with the Nano as the driving GPU for consistencies sake. The Nano will be compared against a Titan X for the GPU tests. All the tests will be run at 1080P to limit GPU bottlenecks.

Component	Selection
CPU	Intel Core i5-6600K
Motherboard	ASRock Z170 Extreme 4
Power Supply	EVGA SuperNOVA 1300 G2
HDD	SanDisk Extreme II 120GB
Storage Disk	Seagate 2TB
Memory	16GB Crucial Ballistix DDR4 2400
Monitor	Dell P2715Q
Video Cards	AMD R9 Fury, AMD R9 Fury Nano, GeForce GTX Titan X
Operating System	Window 10 64-Bit

First up we'll take a look into how AMD handles the Fable Legends benchmark. Then we'll delve into NVIDIA's take, and finally we'll look at CPU utilization.

Async Shader Usage

Looking at the GPU we can see that there are four separate queues that are being used with Fable Legends running. We have the normal 3D render queue, which has tasks stacked, though they're only run one at a time, not concurrently. Underneath we see two copy queues that are two instances of the SDMA (serial direct memory access) engine and one compute queue. The interesting thing is that only one compute queue is being utilized here, and within it only one thread. The _0 indicates that at least more than one can be recognized, though they aren’t technically being show, or even used here.

The presence of the compute queue is the key to identifying async compute usage. That's not to say that it isn't turned on within the game and the engine itself, because it is, but it merely indicates if async is actually being used by the driver and thus the GPU. In practical terms it just shows that some process somewhere on your system is making use of compute resources. If we see an operation down there that's the same color as what's in the 3D render queue (indicating they're from the same program), then it means that compute is being used by the benchmark.

There are two streams happening, but of course only one can be executed. Also, the compute work is being put into the render stream in the next frame, after it's been completed. What's interesting is that there is actually very little happening in the compute queue. We're looking at a total of 18.91%. This is in stark contrast to what Lionhead has told us. But keep in mind that while we aren't seeing much in this pre-defined demo, that doesn't mean that the game itself will act in the same way.

Moving on to NVIDIA's Maxwell based hardware we're going to use the same test-setup but instead use a Titan X to analyze the effects of async compute, or at least see if it's able to be taken advantage of here.

Analyzing the GPUView graph we see one 3D render queue, two copy queues and one compute queue. Oddly, that compute queue is never made use of for the Fable Legends benchmark, but is there for another background task not associated to the benchmark at all.

Unfortunately, GPUView assigns the color to applications, and decided that a dark blue and black were good choices for different packets from the same program but for different purposes. Difficult to discern, but I digress, we can see two different shades here that are associated to the same program. The dark blue is likely representative of the majority of the render queue, though the black is something else entirely, or at least is started out as a request for another type of queue, but was placed in the render queue instead.

What we do find, however, is that the Titan X is likely allowing the benchmarks request for async compute to go through, but instead those workloads are placed directly into the 3D render queue. So Async is still on, and NVIDIA's driver is aware if it, it's just not scheduling it as would be proper. What might be happening is that some kind of other, still efficient method of dealing with those specific types of requests is being used instead.

CPU Usage

CPU utilization is an important question to ask as well. It tells us just how hard a CPU has to work in concert with the GPU in order to provide a better and more visceral experience. DX12 and async compute allow the offloading of some of those tasks that even the CPU would complete onto the compute queue. For this test we're going to look at total CPU utilization by using Task Manager, which is a surprisingly robust utility. I run the benchmark with Task Manager open in the background, alt-tab to the Task Manager, then use the clipping tool to acquire the screenshot. After this we'll go much deeper into actual core utilization with HWiNFO64, a robust and very reliable tool. For this, I start HWiNFO64, then start the benchmark.

Skylake, i5-6600K

Of course we'll start with a graph from Task Manager, just to see the overall usage as reported by Windows itself.

The Skylake results seem to be a mixed bag. The performance is adequate, certainly, but for some reason when we actually look at the amount of time that the CPU is operating, it's actually quite low. That's not necessarily that high, so it's not really CPU limited, even if there are only four total threads.

Looking a little deeper using HWiNFO64, we can see that all four threads are following very similar patterns of utilization. The threads seem to over between 26% and 80%, working hard the entire time. Usage is being spread across the cores fairly evenly here. You may be wondering why the time is a bit longer here than with the i7-5960X down below, well that's because it took longer to load to get to the benchmark itself, due to the benchmark residing on a platter based HDD as opposed to a PCIE based SSD for the i7-5960X rig.

What about the trusty Haswell-E, though, how does that fare with this benchmark?

Haswell-E , i7-5960X

The obligatory Task Manager info screen, showing a slightly low usage. That isn't quite the whole story however.

To delve a bit deeper, we'll use HWiNFO64 once again to Haswell-E seems to be quite the busy beast when paired with that same R9 Fury Nano. Thread usage isn't constant among the 16 threads, though it one in particular gets to around 65% utilization, with the rest hovering between 1% and 54%. The mess of spaghetti below shows you the state of the various threads. Yes, it's certainly a lot of info, but the main points to take home are that are that utilization seems to revolve around a few cores, while the rest stay less used.

Benchmarking keeps performance with an R9 Nano at ~76FPS, a number that's only a few away from what it was using Skylake. It's curious that the performance is very similar despite having more cores to work with. Also, the i5-6600K doesn't appear to be going into any low-power states, and the utilization is consistent throughout the test. The same is not necessarily true for the i7-5960X.

Conclusions

It's quite possible that the underlying game itself was optimized with the Xbox One in mind, which has significantly less async compute units than current high-end PC hardware. Thus only a few in-game graphical assets are actually being pushed to the async compute queue, owing to a much smaller amount of resources available. And we're not entirely sure what kind of requests from the engine are being given to the async compute queue, only that there's clearly some measure of rendering happening there. These tasks could be very efficient in the normal pipeline for NVIDIA, owing to a similar performance to AMD's hardware. Any number of things could be going on under the hood.

The official statement is that compute is a large part of the workload, and that they're able to offload more than just lighting to that particular queue. This could potentially help alleviate bottlenecks and shortcomings. What we actually see, however, is that async compute is barely used at this point. What's actually being rendered down there is unknown, but whatever it is, it's not very much of the total work output.

Similarly, CPU usage seems to be very mixed. Skylake is provide a great showing and certainly is not a limiting factor for any high-end GPU with this benchmark. Utilization, though, is far higher across the board than with Haswell-E, that has significantly more threads at its disposal. It's curious how the performance is actually largely the same. What's going on underneath is a mystery, though both are handling their work without issue with similar performance results. So at least there's that.

Now, this is just an analysis of a closed benchmark that isn't indicative of the behavior of the actual final build of the game. With more action on screen and more assets to light, render and make pretty, we could see an infusion of work in that compute queue, making good use of the async compute capabilities in DX12. At the moment, however, it's not actually there.

I'm excited for the final game, though, because the final product will indeed likely take advantage of the different DX12 components that we're all looking forward to. Async compute is one such technology that has the ability, if implemented fully and properly, to provide more complex scenes. Just imagine a resurgence of a texture compression algorithm similar to S3TC that can be done faster and more efficiently in the compute queue instead of in memory. We could have larger, and better looking textures without much of a performance hit. Not to mention all the other pretties we could have on screen.

Follow Wccftech on Google to get more of our news coverage in your feeds.

[Exclusive] Asynchronous Compute Investigated On Nvidia And AMD in Fable Legends DX12 Benchmark, Not Working on Maxwell

What Does This All Mean?

[Update]

Async Compute is enabled, just not fully being utilized to the greatest extent possible.

Async Shader Usage

CPU Usage

Skylake, i5-6600K

Haswell-E , i7-5960X

Conclusions

Contents

Further Reading

NVIDIA Bids Farewell To Maxwell, Pascal & Volta GPUs, Including The Mighty 1080 Ti, In GeForce 590 Drivers

NVIDIA To Continue Maxwell, Pascal, Volta GPU Game Ready Driver Support Till October 2025, Windows 10 Support For All GeForce RTX GPUs Extended To October 2026

NVIDIA Releases Firmware Update For DisplayPort Bug Which Might Cause Issues With Pascal And Maxwell Based Graphics Cards

NVIDIA Launches The GeForce MX130 and MX110 "Maxwell" GPUs For Efficient Notebooks - Up To 2.5x Faster Than Intel's UHD 620 Graphics