⋮    ⋮  

NVIDIA Launches Tesla M40 and Tesla M4 GPUs For Data Centers – Tegra X1 Powered Jetson TX1 Module Announced Too


NVIDIA has made a couple of new announcements this week which include new products in their Tesla lineup along with a Tegra based launch. A total of three new products have been announced for the data center and Tegra (Android/Linux) development market. All three products have one thing in common as they are powered by NVIDIA's current generation Maxwell architecture and will be focused towards the emerging, machine deep learning markets.

NVIDIA New Maxwell Tesla Cards - GM200 Powered Tesla M40 and GM206 Powered Tesla M4

The launch of the new Tesla cards comes two months after NVIDIA launched their first Maxwell based Tesla cards. The Tesla M60 and Tesla M6 were launched to power the GRID but were also available for consumers to get them through NVIDIA OEM partners. The first two Tesla cards were based on the GM204 core, we got to see the first dual-chip Maxwell offering in the form of the Tesla M60 which featured two full GM204 core configuration. Today, NVIDIA is launching their first GM200 and GM206 powered Telsa parts which include the Tesla M40 and Tesla M4. Dubbed as "Hyperscale Accelerators", the focus of these new cards will be the deep machine learning sector which NVIDIA has put a lot of focus towards since 2014.

Together, they enable developers to use the powerful Tesla Accelerated Computing Platform to drive machine learning in hyperscale data centers and create unprecedented AI-based applications.

"The artificial intelligence race is on," said Jen-Hsun Huang, co-founder and CEO of NVIDIA. "Machine learning is unquestionably one of the most important developments in computing today, on the scale of the PC, the internet and cloud computing. Industries ranging from consumer cloud services, automotive and health care are being revolutionized as we speak.

"Machine learning is the grand computational challenge of our generation. We created the Tesla hyperscale accelerator line to give machine learning a 10X boost. The time and cost savings to data centers will be significant," he said. via NVIDIA

So we have two products in the market, starting off with the heavy weight Tesla M40, powered by the full GM200 core that comes with 3072 CUDA cores, 192 TMUs, 96 ROPs. The card is configured to run at boost clocks of 1140 MHz. The Tesla M40 features 12 GB GDDR5 VRAM that operates along a 384-bit bus interface and is clocked at 6.00 GHz effective memory frequency which indicates a total available bandwidth of 288.0 GB/s. The card has a peak FP32 throughput of 7.00 TFLOPs and just 0.21 TFLOPs of double precision (FP64) throughput due to lack of necessary double precision hardware on Maxwell GPUs. That is going to change with the upcoming Pascal GPUs which are solely built for peak compute performance and aimed at the HPC markets that require the higher computational throughput. The card features a TDP of 250W, powered by a single 8-Pin and single 6-Pin connector configuration and comes with passive cooling since the servers which these cards are configured in have the necessary cooling to keep them stable under the workloads.

The second card is the Tesla M4 which is a surprisingly tiny card and is the first Tesla card using the GM206 GPU core. The card comes in a low-profile form factor and is the smallest Maxwell card we have seen yet that features the full GM206 GPU core. The card features 1024 CUDA cores, 64 TMUs and 32 ROPs. Boost clocks are maintained around 1075 MHz (Max). The Tesla M4 features 4 GB of GDDR5 memory along a 128 bit bus interface that is clocked at 5.5 GHz effective clock frequency and pumps out 88.0 GB/s bandwidth. The card is offered with passive cooling and has a TDP configured around 50W up to 75W. The card's peak performance is rated at 2.2 TFLOPs (FP32) and 0.07 TFLOPs (FP64).

NVIDIA believes that Machine Learning is an emerging market and that is where these two cards are focused at. The workloads consist of Video Transcoding, Media Processing, Data Analytics and Deep Learning Inference. NVIDIA has also brought forward their new NVIDIA Hyperscale Suite that is focused for max utilization of their cards in such workloads allowing real-time accelerated services for developers, optimized GPU support in FFMPEG video processing framework and efficient image compute engines for dynamic image resizing at scale. NVIDIA has not announced pricing of these two products but the Tesla M40 will be available in late Q4 (end of 2015) and M4 will be available in Q1 2016.

NVIDIA Tesla Maxwell GPUs:

NVIDIA Tesla Maxwell Lineup:

Grid 2.0 Board NameNVIDIA Tesla M60NVIDIA Tesla M40NVIDIA Tesla M10NVIDIA Tesla M6NVIDIA Tesla M4
GPU Cores2048 x 2 (Dual Config)
4096 CUDA Cores
3072 CUDA Cores2560 CUDA Cores1536 CUDA Cores1024 CUDA Cores
Memory16 GB GDDR5 (8 GB x 2)12 GB GDDR532 GB GDDR58 GB GDDR54 GB GDDR5
Memory Bus256-bit x 2384-bit128-bit x 4256-bit128-bit
Max Users 36Deep Learning Focused6418Deep Learning Focused
H.264 (1080P @ 30 FPS) Streams2-32Deep Learning Focused281-16Deep Learning Focused
Form FactorDual-Slot PCI-ExpressDual Slot PCI-Express (Passive Cooling)Dual Slot PCI-Express (Passive Cooling)MXM CardSingle Slot PCI-Express (Low Profile Passive Cooling)

NVIDIA Jetson TX1 Announced - Tegra X1 Maxwell Powered Module and Development Kit

The third product is the Jetson TX1 which is a Tegra X1, Maxwell powered module plus development kit. Being a successor to the Tegra K1 based Jetson TK1, the Jetson TX1 improves in all possible ways and the most notable difference is that the platform now comes in a credit card sized module rather than a full M-ATX form factor board. Aimed at smaller developers with focus on relatively small projects that include the like of embedded systems and even mobility devices, the Jetson board offers all the necessary hardware to begin development on such projects. The Jetson TX1 is offered in the smaller module which is a full system that is workable and a second variant that comes with separate board that offers necessary I/O.

Jetson TX1 is the first embedded computer designed to process deep neural networks -- computer software that can learn to recognize objects or interpret information. This new approach to program computers is called machine learning and can be used to perform complex tasks such as recognizing images, processing conversational speech, or analyzing a room full of furniture and finding a path to navigate across it. Machine learning is a groundbreaking technology that will give autonomous devices a giant leap in capability. via NVIDIA

NVIDIA Jetson TX1 Specifications:

  • GPU: 1 teraflops, 256-core Maxwell architecture-based GPU offering best-in-class performance
  • CPU: 64-bit ARM A57 CPUs
  • Video: 4K video encode and decode
  • Camera: Support for 1400 megapixels/second
  • Memory: 4GB LPDDR4; 25.6 gigabits/second
  • Storage: 16GB eMMC
  • Wi-Fi/Bluetooth: 802.11ac 2x2 Bluetooth ready
  • Networking: 1GB Ethernet
  • OS Support: Linux for Tegra
  • Size: 50mm x 87mm, slightly smaller than a credit card

The NVIDIA Tegra X1 SOC makes use of the 20nm ARM CPU architecture while the graphics side is powered by the ultra efficient Maxwell core. The Tegra X1 (formerly known as Tegra ERISTA) features eight 64-bit ARM CPU cores with a full fledge Maxwell GPU core that has 2 SMM units on the die enabled giving 256 CUDA Cores. The TX1 is based on a combination of four Cortex A-57 and four Cortex A-53 64/32-bit cores with the dual stacks integrated inside the die that deliver 1.0 TFlops of compute in 16-bit workloads (FP16) and around 500 GFlops for 32-bit workloads (FP32). The Jetson TX1 module consume 10W of peak power while delivering the through put as advertised.

Compared to the 192 CUDA Cores on Kepler based Tegra K1, it should be noted that Maxwell cores feature 40%  better performance and 2 times the efficiency hence delivering increased speed in gaming and other GPGPU applications which will be suited for devices based on the Tegra X1 chip. The Maxwell architecture at a high level is similar to its predecessor, the Kepler GPU architecture in the sense that it is based on fundamental compute cores called CUDA cores, Streaming Multiprocessors (SMs), Polymorph Engines, Warp Schedulers, Texture Caches, and other hardware elements. But each hardware block on Maxwell has been optimized and upgraded with an intensive focus on power efficiency.

Specifications wise, the 2 SMMs of Maxwell GPU result in a total of 256 CUDA Cores with 16 ROPs and 16 Texture units. The clock speed isn’t mentioned but the chip pumps out a good 16 GTexels/s fill rate. The Maxwell GPU has also been manufacutred on the 20nm process which will deliver improved energy efficency compared to desktop variants. Memory clock is maintained at 1.6 GHz pumping out 25.6 GB/s bandwidth and has a 256 KB L2 cache. NVIDIA's Jetson TX1 SOC comes with 4 GB LPDDR4 memory clocked at 3200 MHz, 16GB eMMC Flash module, 2x2 802.11ac / Bluetooth connectivity and a Gigabit Ethernet controller. The board that is offered separately has tons of I/O options that include WiFi, Bluetooth, HDMI, M.2 SSD slot, USB ports, PCI-e 2.0 x4 slot, 5 MP camera interface and Ethernet port. The Jetson TX1 is expected to hit retail markets on 16th November with pre-orders starting from 12th November. The retail kits will be available for $599 US and $299 for education. The stand-along module is expected to go on retail later in Q1 2016 for $299 and only 1000 units will be available.

NVIDIA Jetson TX1 Module and Development Kit: