⋮    ⋮  

NVIDIA Accelerates AI Inferencing With Pascal Based Tesla P40 and Tesla P4 GPU Accelerators – Also Announces 10W Drive PX 2 Board


NVIDIA has announced their latest Pascal based Tesla P40 and Tesla P4 GPU accelerators. The new cards are designed to accelerator AI / Neural Network inferencing with a boost up to 45x over the CPUs and around 4x increase over past generation GPUs. The GPU accelerators are backed up with powerful software tools that deliver a massive increase in overall efficiency.

NVIDIA Tesla P40 and Tesla P4 Announced - Accelerating AI / Deep Neural Network Inferences

NVIDIA has created a platform for deep learning with their latest Tesla cards. The platform is segmented into Training and Infrerencing GPUs. For AI Training, NVIDIA offers the Tesla P100 solution with the fastest compute performance available to date, both FP16 and FP64. This along with DIGITS Training system and Deep learning frameworks adds in higher efficiency and performance. On the other hand, we have interfacing cards and this line is powered by the Tesla P40 and Tesla P4 accelerators.

The Tesla P4 and P40 are specifically designed for inferencing, which uses trained deep neural networks to recognize speech, images or text in response to queries from users and devices. Based on the Pascal architecture, these GPUs feature specialized inference instructions based on 8-bit (INT8) operations, delivering 45x faster response than CPUs1 and a 4x improvement over GPU solutions launched less than a year ago. via NVIDIA

Replacing the Tesla M40 and Tesla M4, the Pascal based accelerators come with DeepStream SDK and TensorRT support. The two interfacing cards are based on the GP102 and GP104 architecture, both of which are available on NVIDIA's consumer platforms in the form of GeForce and Quadro. Let's take a look at the specifications for these cards:

NVIDIA Tesla P40 "Pascal GP102" Specifications:

The Tesla P40 is the faster part of the two, featuring a full fledged GP102 GPU core. The card consists of 3840 CUDA cores and 24 GB of GDDR5 memory. Clock speeds are maintained at 1303 MHz base and 1531 MHz for boost. The memory is clocked at 7.2 GHz effective which delivers 346 GB/s bandwidth along a 384-bit interface. The chip packs 12 TFLOPs of FP32 and 47 TFLOPs of INT8 compute performance on a 250W TDP package. Like the Tesla M40 before it, the P40 also comes in passive form factor.

NVIDIA Tesla P4 "Pascal GP104" Specifications:

The Tesla P4 on the other hand features the GP104 core. It has the full 2560 CUDA cores attached to it but run at a much lower clock speed of 810 MHz base and 1063 MHz boost. This has to do with the low form factor design which the card is offered in, as it is designed for blade servers. The P4 also comes in a 50-75W package which is much lower than the GTX 1080's 190W TDP. The GTX 1080 does feature the same core count but has higher clock speeds reaching up to 2 GHz. This product is clocked at half the rate of the 1080 hence the higher power efficiency.

Rest of the specifications include a 8 GB video ram. Clock speeds for memory is retained at 6 GHz that offers 192 GB/s bandwidth along a 256-bit bus. The compute performance for this card is rated at 5.5 TFLOPs (FP32) and 22 DLTOPs (INT8). No price has been announced for the Tesla P40 or Tesla P4 but they are expected to hit the market through OEM channels in late Q4 (October-Novemeber) 2016.

NVIDIA Tesla P40 and Tesla P4 Specifications:

Product NameTesla M4Tesla M40Tesla P4Tesla P40
GPU ArchitectureMaxwell GM206Maxwell GM200Pascal GP104Pascal GP102
GPU Process28nm28nm16nm FinFET16nm FinFET
CUDA Cores1280 CUDA3072 CUDA2560 CUDA3840 CUDA
Clock Speed1072 MHz1114 MHz1063 MHz1531 MHz
FP32 Compute2.20 TFLOPs7.00 TFLOPs5.50 TFLOPs12.0 TFLOPs
INT8 ComputeN/AN/A22 DLTOPs47 DLTOPs
Memory Clock5.5 GHz6.0 GHz6.0 GHz7.2 GHz
Memory Bus128-bit384-bit256-bit384-bit
Memory Bandwidth88.0 GB/s288.0 GB/s192.0 GB/s346 GB/s

Software Tools for Faster Inferencing

Complementing the Tesla P4 and P40 are two software innovations to accelerate AI inferencing: NVIDIA TensorRT and the NVIDIA DeepStream SDK.

TensorRT is a library created for optimizing deep learning models for production deployment that delivers instant responsiveness for the most complex networks. It maximizes throughput and efficiency of deep learning applications by taking trained neural nets — defined with 32-bit or 16-bit operations — and optimizing them for reduced precision INT8 operations.

NVIDIA DeepStream SDK taps into the power of a Pascal server to simultaneously decode and analyze up to 93 HD video streams in real time compared with seven streams with dual CPUs. This addresses one of the grand challenges of AI: understanding video content at-scale for applications such as self-driving cars, interactive robots, filtering and ad placement. Integrating deep learning into video applications allows companies to offer smart, innovative video services that were previously impossible to deliver.

NVIDIA Offers 10W, Palm-Sized Energy-Efficient AI Computer for Self-Driving Cars

NVIDIA also announced a new Drive PX 2 board for self driving cars. While the original design uses two Parker SOCs, the new model is a single chip based design. With a TDP of just 10W and a much smaller board footprint, the AI supercomputer adds more affordability to the product.

"Baidu and NVIDIA are leveraging our AI skills together to create a cloud-to-car system for self-driving," said Liu Jun, vice president of Baidu. "The new, small form-factor DRIVE PX 2 will be used in Baidu's HD map-based self-driving solution for car manufacturers." via NVIDIA

The new single-processor DRIVE PX 2 will be available to production partners in the fourth quarter of 2016. DriveWorks software and the DRIVE PX 2 configuration with two SoCs and two discrete GPUs are available today for developers working on autonomous vehicles.

NVIDIA Drive PX 2 Single Chip Board: