Groq's tensor streaming processor (TSP) silicon is now available to accelerate customer's AI workloads in the cloud. This cloud service provided by Nimbix utilizes the Groq hardware as an on-demand service for "selected customers" only. Groq now joins Graphcore as the only two cloud service providers with accelerators commercially available for customers to use.
Groq's TSP silicon is now utilized in Nimbix's machine learning acceleration on-demand service for "selected customers" only.
Nimbix's CEO, Steve Hebert, stated: "Groq's simplified processing architecture is unique, providing unprecedented, deterministic performance for compute-intensive workloads, and is an exciting addition to our cloud-based AI and Deep Learning platform."
Groq's TSP chip is capable of an enormous 1,000 TOPS ( 1 Peta operations per second), this chip also launched last fall. Groq recently published results show how the chips can achieve 21,700 inferences per second for ResNet-50 v2 inference. According to Groq, this more than doubles the performance of GPU-based systems. The results posted by Groq shows that their architecture is one of the fastest and possibly the fastest commercially available neural network processor.
Jonathan Ross, Groq's co-founder, and CEO stated: "These ResNet-50 results are a validation that Groq's unique architecture and approach to machine learning acceleration delivers substantially faster inference performance than our competitors." He also stated, "These real-world proof points, based on industry-standard benchmarks and not simulations or hardware emulation, confirm the measurable performance gains for machine learning and artificial intelligence applications made possible by Groq's technologies."
One key feature is that Groq's performance advantage doesn't rely on batching. Batching is a common technique in the data center where multiple data samples are processed at a time to improve throughput. According to Groq, its architecture can reach peak performance even at batch = 1. A common requirement for inference applications that may be working on a stream of data arriving in real-time. While the new TSP chip offers a 2.5x latency advantage over GPUs at large batch sizes at batch = 1, Groq has stated that the actual latency advantage is closer to 17x.