AMD Vega Demoed, Outperforms Nvidia’s GTX 1080 – Features 8GB of HBM2 & 512GB/s Of Bandwidth

Author Photo
Dec 12, 2016
1310Shares
Submit

AMD just demoed the gaming capability of an upcoming enthusiast Radeon graphics card powered by its next gen Vega GPU and the results are in. In a head to head performance showdown with an overclocked GTX 1080, en equally specced Vega equipped system was able to outperform its counterpart running Doom at 4K in Vulkan.

AMD Radeon Instinct MI25

AMD Showcases Vega In Action – Running Doom In Vulkan At 4K & DeepBench GEMM

At the event AMD showcased two systems featuring graphics cards powered by its next generation Vega GPU. The first had a Radeon instinct MI25 graphics accelerator equipped with 16GB of HBM2, second generation high bandwidth memory. This system was used to demonstrate Vega’s capabilities in deep learning, which were quite impressive. The MI25 outperformed Nvidia’s Tesla P100 accelerator based on the GP100 GPU at two key AI workloads.

Related AMD CEO : Gaming RX Vega Launching “Very Soon” After Frontier Edition – Out By July At The Latest

The second system was configured with the consumer version of Vega, equipped with 8GB of HBM2. However, unlike the Mi25, AMD was very secretive about what this consumer card looked like. It was not shown to the press and to maintain its appearance a secret all ventilation and fan inlets were taped shut. Which obviously deprived the machine of any air flow. The folks over at PCGamesHardware.de made it a point to note that it’s quite possible that the graphics card was throttling, as AMD’s secretive measures made things quite toasty inside.

AMD Vega demo system, photo courtesy of pcgameshardware.de

Additionally, the demo was conducted using an ordinary Fiji ( Radeon R9 Fury X, Fury and Nano ) driver with an additional debugging layer. No Vega optimized driver was used. Despite this the consumer Vega graphics card was able to outperform a GTX 1080 running at 1911Mhz by 10%. Although, Doom’s Vulkan implementation has been shown to run faster on AMD GPUs.

With that being said, with optimized drivers and proper cooling it’s likely that we’ll see AMD squeeze out more performance out of Vega before launch. The folks over at pcgameshardware.de have also confirmed that this is the very same graphics card “687F:C1” that we spotted mingling with other GTX 1080s on AOTS’s benchmark leaderboard a couple of weeks back.

Related AMD Vega Announcement Due Today, Teased By Raja Koduri & Chris Hook – New Frontier

Vega’s Confirmed Specs

Members of the press inside the demo room were able to spot some key specifications pertaining to Vega by taking a look at the expanded statistics in Doom. 8GB of HBM2 memory for the consumer version of Vega was confirmed.

Additionally, an employee slipped a key specification that wasn’t supposed to be made public yet and it’s that Vega 10 features 512GB/s of memory bandwidth. The memory capacity and bandwidth are clear indications that Vega 10 has a 2048bit wide memory interface. Half that of its older sibling, Fiji. However, because HBM2 is rated at twice the speed of HBM1, Vega 10 is able to achieve the same 512GB/s of memory bandwidth.

In terms of graphics horsepower, the Vega 10 powered MI25 accelerator is rated at a staggering 12.5 TERAFLOPS of single precision floating point compute and double that in half precision FP16 compute. That’s 1.5 TERAFLOPS more than Nvidia’s Tesla P100 accelerator, powered by the monstrous 610mm² GP100 GPU and 2.5 TERAFLOPS more than the GTX 1080.

The MI25 is a professional, passively cooled product. The gaming oriented variant of Vega, equipped with more aggressive cooling solutions and running at higher clock speeds, would naturally be expected to achieve an even higher figure.

Vega’s Next Generation Compute Unit Architecture

Vega is based on a brand new graphics architecture, the particulars of which we had already detailed briefly in our exclusive piece about Vega 10 and Vega 11.  AMD confirmed today in its announcement what we had brought you back in October, which is that Vega makes use of a brand new compute unit design called NCU. Short for Next Compute Unit.

AMD hasn’t discussed any details pertaining to the new design. However, we’re going to give you an exclusive high-level look at NCU. This new architecture holds several key advantages over its predecessor. Chief among which is that each Vega NCU is now capable of simultaneously processing variable length wavefronts. To understand why this is such a big deal we have to look at AMD’s current GCN implementation.

In AMD’s current GCN implementation, each compute unit has four 16-wide vector SIMD units, capable of executing four 16-wide wavefronts ( a group of threads ) over four cycles. In addition to one scalar unit, capable of executing one instruction per cycle. This unit is delegated time-critical tasks, where the four-cycle turnaround of the SIMD units isn’t sufficient.

Unfortunately, these 16-wide SIMD units work exactly the same no matter how small of a wavefront they’re fed. Executing a 16-wide wavefront would take just as long as executing a 4-wide wavefront, rendering the other 12 ALUs inside the SIMD completely useless. And as graphics workloads are inherently non uniform it’s effectively impossible to find any scenario where all 16-wide SIMD units are fully occupied at any given time.

Variable Width SIMDs, Getting More Performance Out Of Fewer Cycles

This is no longer the case in AMD’s new GCN implementation inside Vega. The V9 architecture includes new incredibly clever schedulers and coherency subsystems that allow several smaller wavefronts to be executed simultaneously inside any SIMD that’s able to accommodate the workload. This in effect allows each NCU to finish considerably more work in the same amount of time compared to its predecessor. In addition to freeing up valuable cache and memory resources for other compute units.

AMD Vega architecture
It’s very hard to predict how much of a difference this big of an improvement in resource utilization and CU occupancy will yield given how unpredictable and inherently fluctuant graphics workloads are. Which brings us neatly to Vega’s rumored specs.

Vega, The Rumored Specs

One of the few things that AMD has not talked about regarding Vega’s specifications to date are the number of GCN stream processors it actually has. Vega 10 is believed to have 4096 GCN stream processors, according to the LinkedIn page of a leading engineer which leaked earlier this year.

Assuming that this figure is accurate, Vega 10 would have to operate at a frequency 20% higher than Polaris 10 to achieve the 12.5 TFLOPS of the Radeon Instinct MI25. We’re talking 1520Mhz+, on a passively cooled enterprise GPU. A clock speed that few, mostly liquid cooled, overclocked RX 480 cards can achieve. None of AMD’s current or past professional grade graphics cards and/or accelerators come close to that. We’ve also never seen such a large hike in clock speeds from one graphics generation to another in the same process node generation.

AMD Vega 10 & Vega 11 GPUs

Graphics CardRadeon R9 Fury XRadeon RX 480Radeon RX Vega SeriesRadeon RX Vega (Gaming)Radeon RX Vega Frontier EditionRadeon RX Vega Pro Duo
GPUFiji XTPolaris 10Vega 11Vega 10Vega 102x Vega 10
Process Node28nm14nm FinFETFinFETFinFETFinFETFinFET
Stream Processors40962304TBATBA4096Up to 8192
Performance8.6 TFLOPS
8.6 (FP16) TFLOPS
5.8 TFLOPS
5.8 (FP16) TFLOPS
TBA>13 TFLOLPS
>25 (FP16) TFLOPS
~13 TFLOLPS
~25 (FP16) TFLOPS
TBA
TBA
Memory4GB HBM8GB GDDR5TBATBA16GB HBM2TBA
Memory Bus4096-bit256-bitTBA2048-bit2048-bit4096-bit
Bandwidth512GB/s256GB/STBATBA480GB/STBA
TDP275W150WTBATBATBATBA
Launch201520162017June-July 2017June 2017TBA

It’s more plausible that this 20% improvement actually comes from the IPC ( instruction per clock ) improvement of the new architecture. In fact, it’s not unlikely that the MI25 runs at an even lower frequency than that of the RX 480. Especially considering it’s a 300W, passively cooled enterprise part. Which would indicate that 20%+ of the chip’s performance stems directly from architecture-based enhancements.

Whether that’s actually the case or not remains to be seen. A combination of IPC uplift and higher clock speeds is probably the most plausible scenario.

Submit