NVIDIA Titan V Reportedly Producing Errors in Scientific Simulations

Author Photo
Mar 25

NVIDIA’s prosumer oriented $3000 GTX Titan V reportedly suffers from a memory bug that’s causing it to produce erroneous results in scientific simulation workloads. Built on the company’s latest Volta architecture the Titan V is powered by the largest GPU NVIDIA has ever made, the 815mm², 21.1 billion transistor GV100  behemoth.

NVIDIA Titan V Reportedly Producing Errors in Scientific Simulations

The Titan V, which was introduced late last year, is the most powerful discrete graphics card on the market today. It’s also the most expensive Titan we have ever seen NVIDIA put out. According to an engineer who has spoken with The Register, the Titan V is incapable of reliably producing results under specific conditions. The card is said to suffer from a precarious bug that’s causing it to produce different results whilst repeatedly running the same calculations.

nvidia-geforce-1180Related NVIDIA GeForce GTX 1180 – Specs, Performance, Price & Release Date (Preliminary)

One of the examples mentioned for such an instance is when running identical simulations of an interaction between a protein and an enzyme. These calculations are supposed to produce identical results every time. However, two out of four Titan V cards that the engineer had tested would throw errors when running the same simulation.

Issue Believed to Be Due to a Memory Design Flaw

It’s thought that this issue is due to a memory design flaw. According to an unnamed industry veteran that has spoken with The Register, NVIDIA may be pushing the Titan V hardware to its limits, or perhaps even beyond the edge. And unlike proper workstation graphics cards, like the Quadro line and AMD’s Radeon Pro, NVIDIA has disabled error correcting memory on the Titan V. These two issues combined, the veteran believes, could be why the Titan V suffers from memory read errors when dealing with such large data-sets in memory.

Scientists rely on the hardware to produce reliable data, otherwise they simply cannot trust the results of their tests. So suffice to say that these types of errors render the Titan V useless for these types of tasks where precision is a key requirement. A calculator that can’t add up is useless, as such If scientists can’t trust the results out of a Titan V they can’t afford to run simulations on it.

NVIDIA offered the following comment to the Register

“All of our GPUs add correctly. Our Tesla line, which has ECC [error-correcting code memory], is designed for these types of large scale, high performance simulations. Anyone who does experience issues should contact support@nvidia.com.”