The One Vision That Intel, AMD And Nvidia Are All Chasing – Why Heterogeneous Computing Is The Future

Khalid Moammer

The Solution To The Problem

The answer was GPUs, parallel processors which can easily continue to scale with Moore's law.
The industry could continue its reliance on the growth of transistor counts and densities enabled by Moore's Law to improve the performance of each design by increasing the number of parallel processors, instead of attempting to push the frequencies or complexity of a handful of CPU cores. This meant that computing performance can continue to scale for the foreseeable future.

HSA

Parallel processors existed for years but were limited to a few applications. They have always been used in High Performance Computing "HPC" and graphics processing among other applications. Graphics processing was an obvious target for parallel processors, if you needed to process colors for millions of pixels tens of times every second a CPU was simply not going to cut it and thousands of smaller, slower & more efficient processors were perfect for such an application.

But there was a trick, not all code can be applied to GPUs and the grand majority of computer programmers in the world were either used to programming for a single fast serial processor or had learned programming on such a device. After all the entire industry relied on CPUs for several decades & If the industry was going to turn to parallel computing it had to make it less challenging and more accessible for programmers to write code for these types of processors.

From A Beautifully Simple Concept To A An Industry Wide Vision : Heterogeneous Computing

All of  Intel's past actions and future roadmaps are strong indicators that this is the future that they envision. A future where CPUs and GPUs work seamlessly together to address the new challenges of CPU performance scaling and to address problems that CPUs and GPUs simply cannot solve separately.
AMD made it very clear from 2006 that their goal was to build the ultimate heterogeneous processor.
The company was able to more quickly adapt its vision to practice with the Heterogeneous System Architecture Foundation and mold it into an industry wide strategy that it hopes will bare fruit. The HSA foundation was brought into existence as a collective industry effort to chase the untapped potential of heterogeneous designs and begin a new era of computing where performance would scale again at the rate of golden age Silicon Valley.

AMD Forms The HSA Foundation

HSA stands for Heterogeneous System Architecture. To know what this foundation is all about, we need to take a few steps back. AMD's goal to build the ultimate heterogeneous  processor meant that they had their work cut out for them.  The company's vision for the next era of computing has a far-reaching effect on the entire industry, which made an industry-wide collaboration crucial to the success of any effort to bring this vision to reality. Luckily for AMD, many companies shared its aspirations. Industry giants such as Samsung, Qualcomm, ARM, Imagination Technologies, Mediatek and Texas Instruments joined AMD in its efforts and the HSA foundation was born.

So what exactly is HSA & how does it solve the problem?

HSA is a relatively old concept based on a simple idea. The idea is to run the code on the ideal processor that would be the fastest and most efficient in executing it. Serial code with a lot of branches and conditionals would then be well suited to run on the CPU because that's the fastest and most efficient pcoessor for this type of code. On the other hand, code that is fairly short, less conditional and massively parallel, such as the code used in graphics to calculate what color each pixel on the screen should be, would be well suited for a graphics processor.

GPUs differ from traditional CPUs in several key characteristics. CPUs generally have a lot more decode and branch prediction resources because they tend to deal with more complex branchy code. GPUs on the other hand are designed with heavy emphasis on execution resources. Because GPUs deal with code that relatively is less complex and data that's massively more parallel.  Which in turn means that the weight would fall on the execution engines rather than the front end of the processor having to deal with the complexity of serial code.

A great, yet simple, example of a heterogeneous system would be a gaming computer. The graphics processor does all the graphics' heavy lifting and the CPU deals with the API communication, audio processing, artificial intelligence and gameplay physics such as bullet trajectory, hit boxes, etc. Now think of HSA as a significantly more sophisticated and versatile system although based on the very same concept. Instead of the GPU and CPU working on two completely different tasks, such as graphics and AI, the processors can now work on and share the exact same task, such as physics. However, each processor takes care of a different stage of the task. The stages that would be completed faster on the CPU are done by the CPU and the stages which are more appropriate for the GPU are handled by the GPU.

Luckily, this concept works exceptionally well because the majority of software out there has a healthy mix of serial and parallel workloads, making the heterogeneous processor the ideal candidate for a lot of software.

An example of such a task is a Suffix Array. Suffix Arrays are used in a variety of workloads, such as full text index search, lossless data compression and Bio-informatics.

HSA workload

Though it sounds simple, there are multiple challenges that need to be overcome first, both on the hardware and software side.

The Software

Heterogeneous systems won't show their true benefits until capable and compatible software is available. Members of the HSA foundation don't expect developers to start writing code directly to the instruction set language of the foundation, which is called HSAIL. That is why they're going  to rely on three major languages to deliver the benefits of HSA: OpenCL 2.0, Java and Microsoft's C++ AMP.

OpenCL 2.0 has already been ratified and it takes advantage of many aspects of HSA systems, such as memory coherency between the CPU and GPU via hUMA, which completely eliminates the need for copies between the CPU and GPU to maintain coherency. Thus, code written in OpenCL 2.0 that takes that into account will show greatly improved performance.

Java will support HSA systems via APARAPI, which compiles Java code to OpenCL. C++ AMP doesn't take advantage of HSA systems directly, much like Java. However, a compiler that generates HSAIL has been in the works for quite time and was finally released in July of this year.. Java will be addressing HSA directly with Java 9 by generating HSAIL directly from Java bytecode, Java 9's final release and public availability is set for September of next year.

The Hardware

Updated memory protocols were among the first orders of business for the foundation. Heterogeneous Uniform Memory Access was developed to significantly reduce the overhead that was previously needed for adequate communication between various processors in a heterogeneous system. The technology completely eliminates the need for data copies between the CPU and GPU to achieve full memory coherency. The need for data copies between various processors in a heterogeneous system was undoubtedly the most significant bottleneck.

This means that various HSA compatible processors (GPUs,CPUs and Audio Processors, for example) can share data seamlessly. The insurmountable wall of overhead that was standing in front of the concept of HSA was finally torn down. The relevant code can now travel freely between different execution engines inside the system without any penalty. This was a crucial step in allowing the most appropriate processor for the task to gain quick access to the data it needs. This is what would eventually lead to the best possible performance and efficiency in an HSA compatible processor.

AMD hUMA

AMD delivered the first HSA-capable hardware with Kaveri in early 2014. This year, the company introduced Carrizo, which is the first HSA 1.0 compliant design in the industry.

It's not just the HSA foundation that has been working on bringing up its hardware and software up to par, Intel and Nvidia have also been working toward the same goal. Right behind the HSA foundation both Nvidia and Intel have interdependently developed their own implementations for unified memory access.

Nvidia's Push For Heterogeneous Software & Hardware

Nvidia introduced the capability to share virtual memory between the CPU and GPU with CUDA 6 and its Maxwell graphics architecture. The technology doesn't offer a hardware-level unified memory access like AMD's hUMA, which was introduced with Kaveri APUs and the GCN 1.1 graphics architecture. What it does is simplify the method by which programmers address CPU and GPU memory. What it does not do, however, is allow the processors to share data via pointers. Which is an essential part in addressing the most significant bottleneck in a heterogeneous design. Without reducing the huge latency dictated by data copies no tangible performance or efficiency gains can be realized from a a heterogeneous processor.  So despite the memory pool being "virtually" unified in software, performance exhaustive data copies still have to be made between the CPU and GPU.

UniMemFull-blown hardware-level unified memory is something that Nvidia has in the works still and will officially debut with the Pascal GPU architecture and the Tegra SOC that integrates it next year. It's important to remember that unified memory is an SOC feature. Both the CPU and the GPU have to be on the same chip to achieve this functionality. Sharing memory across two physically distant processors is not a feasible task due to the great amount of latency associated with it.

The code name for Nvidia's SOC with unified memory is still unknown. What we do know is that it's going to succeed the Tegra X1, most likely sometime next year.
New-Tegra-roadmap

Intel's Foray Into Heterogeneous Computing

Intel currently offers an extension to its Haswell SOCs that enable unified virtual memory. Very much like what Nvidia has done with CUDA 6 and Maxwell. Intel even released a whitepaper on heterogeneous system architectures using OpenCL very much like what the HSA foundation is aiming to achieve with OpenCL 2.0. However, unlike Nvidia the Intel has managed to follow AMD"s Kaveri last year with its own fully memory coherent design implementation in Broadwell processors featuring the company's Gen8 graphics architecture.

Intel Roadmap

Compute Architecture of Intel Processor Graphics Gen8
These new mechanisms can be used to maintain memory coherency and consistency for fine grained sharing throughout the memory hierarchy between CPU cores and devices. Moreover, the same virtual addresses can be shared seamlessly across devices. Such memory sharing is application programmable through emerging heterogeneous compute APIs such as the shared virtual memory (SVM) features specified in OpenCL 2.0. The net effect is that pointer-rich data- structures can be shared directly between application code running on CPU cores with application code running on Intel® Processor Graphics, without programmer data structure marshalling or cumbersome software translation techniques.

This is very much the essence of heterogeneous CPU/GPU compute. And the introduction of memory coherency with Kaveri and Broadwell last year signals the very beginning of the second phase in the next era of computing. Beginning with the principal premise of combining GPU and CPU on the same piece of Silicon in 2011 to full memory coherency between the two engines in 2014.

What Does It All Mean?

Once you examine all the major players carefully, a crystal-clear image of the entire industry moving towards heterogeneous computing appears. AMD, Nvidia and Intel are all addressing the same challenges. As is usual with such cases, AMD chose to go for the open standard industry-wide route, where the entire industry (or as much of it as possible) collaborates to achieve a common goal. Nvidia chose to go for the proprietary route, while Intel took a more awkward position in the middle. They're making sure their hardware is going to be up to snuff, but are leaving a lot of the industry-wide software challenges for the industry to deal with rather than address them directly like the HSA foundation is doing. Of course, there are exceptions to this, but they remain very specific and quite limited in scope.

So to summarize what all this means, we only need to take a look at one of AMD's more intriguing press releases in which the company talks about how it's going to improve the power efficiency of its processors by a factor of 25X, or the equivalent of 2500%, in just five -and-a-half years' time. Heterogeneous computing is listed as the number one breakthrough that will drive this enormous power efficiency improvement.

It can be quite difficult to wrap our heads around a figure as large as 2500%. But what we're essentially being told is that by 2020, a 4W processor would have the computational capacity to rival the 100W processors of today. And that's certainly a future worth getting excited about.

 

Contents

Share this story

Deal of the Day

Comments