Stanford DAWNBench v1 Results: Intel’s Xeon Takes Inference Crown, NVIDIA V100 And Google TPU Achieve New Performance Milestones
Researchers at Stanford have posted the results of the Dawn Benchmark and Competition and contains some interesting numbers that show how much of a difference optimization can make to training times and cost. Interestingly, the results appear to show that there is no single all-rounded winner when it comes to AI workloads, in fact the result is a splattering of achievements between Intel's Xeon, Google's TPU v2 and NVIDIA's graphics processors.
Intel Xeon-only configuration takes the inference latency and cost efficiency throne
While I would urge anyone seriously interested in the results to head over to the results page and see it in its entirety, we have taken the liberty to pick out some of the jucier bits and post them below. The community was able to achieve some truly impressive feats of performance optimization and cost efficiency. Where previously it took more than 10 days to train ImageNet, it can now be done in just under 31 minutes by using half a Google TPU v2 pod showcasing a speed up of 477x.
The inference and cost champion on the other hand turned out to be Intel Xeon Scalable processors (no GPUs) which were able to process 10,000 images for the mere price of $0.02 and a latency of 9.96 milliseconds. The researchers were using an Intel Optimized Caffe and the closest competition was using an NVIDIA K80 GPU along with 4 CPUs for a cost of $0.07 and a latency of 29.4 ms. Needless to say, this is quite an impressive achievement considering you can get a miulti-factor performance and cost upgrade using only CPUs.
Team from fast.AI achieves results faster than advertised by NVIDIA using 8x V100s and sets new CIFAR10 record
Another highlight of the event was the team from fast.AI which used an innovative method to drastically reduce training times and using 8x V100 GPUs set a new land speed record for CIFAR10 training. The approach initially feeds the net with low resolution images to reduce processing time in the start and gradually increases the resolution. This method cuts down on training times without compromising on any final accuracy of the model.
In fact, the fast.AI team was able to achieve a 52x speedup using the NVIDIA V100s and drop the training time from 2 hours 31 minutes all the way down to 2 minutes and 54 seconds. In doing so, they also managed to reduce the cost from $8.35 to $0.26. In fact, they even demonstrated that you can train a model on CIFAR10 in a reasonable amount of time for free using nothing but Google Colaborator.
Other curated highlights from the first iteration of DAWNBench v1:
- For ImageNet inference, Intel submitted the best result in both cost and latency. Using an Intel optimized version of Caffe on high performance AWS instances, they reduced per image latency to 9.96 milliseconds and processed 10,000 images for $0.02.
- ResNet50 can now be trained on ImageNet in as little as 30 minutes with checkpointing and 24 minutes without checkpointing using half of a Google TPUv2 Pod, representing a 477x speed-up!
- The cheapest submission for ResNet50 on ImageNet ran in 8 hours 53 minutes for a total of $58.53 on a Google TPUv2 machine using TensorFlow 1.8.0-rc1, which is a 19x cost improvement over our best seed entry that used 8 Nvidia K80 GPUs on AWS.
- Other hardware and cloud providers weren’t far behind! Using PyTorch with 8 Nivida V100 GPUs on AWS, fast.ai was able to train ResNet50 in 2 hours 58 minutes for a total of $72.50 with a progressive resizing technique from “Progressive Growing of GANs for Improved Quality, Stability, and Variation” and “Enhanced Deep Residual Networks for Single Image Super-Resolution” that increased the resolution of images over training to get higher throughput (images per second) at the beginning without loss in final accuracy.
- With only CPUs, Intel used 128 AWS instances with 36 cores each to train ImageNet in 3 hours and 26 minutes.
- ResNet164 from “Identity Mappings in Deep Residual Networks ” that trained in 2 hours and 31 minutes on a Nvidia P100, training time fell to 2 minutes and 54 seconds thanks to fast.ai and their student team. Using a Custom Wide ResNet architecture and 8 Nvidia V100s, they achieved a 52x speed-up.
- The team from fast.ai also dropped training cost from $8.35 to $0.26. Going even further they showed you can train a model on CIFAR10 in a reasonable amount of time for free using Google Colaboratory.
via DAWNBench v1, Stanford