Nvidia Maxwell Architecture Analysis – Delivering Double the Performance Per Watt on 28 NM

Usman Pirzada • Feb 18, 2014 at 08:57pm EST

Ok the NDA has finally lifted and the official architecture documentation is up. Its finally time to take a look at Maxwell in depth. It manages to Double the Performance per Watt while staying on the same 28nm Process. Which, there is no other way to put it, is nothing short of a miracle. Lets see just how exactly it manages that.

Maxwell 28nm Miracle - How the Architecture makes Doubling Performance Per Watt Possible

Firstly, as I am sure most of you are aware, Maxwell does not work on SMX units. It works on SMMs, short for Streaming Maxwell Multiprocessors. Each SMM houses 128 CUDA Cores as opposed to the 192 housed by SMXs. Now unlike in the Kepler architecture where the CUDA Cores are housed in a single core fashion, Maxwell houses Cuda Cores in 4 subsets of each SMM. Almost 4 Separate "Cores" within the SMM. Lets call these "Major Cores" (as opposed to CUDA Cores) to avoid confusion. Do realize that this only refers to the 1st Generation of Maxwell and the division by four could change in the next generation. The Major Cores tactic allows Nvidia to achieve much higher efficiency rates and increase performance by 135% Per Core. Take a look at this diagram of Maxwell SMMs.

Now since we already know that SMMs have 128 CUDA Cores, simple maths would tell you this block diagram is of the GTX 750 TI (128*5 = 640 = 750 Ti's CUDA Core Count) But the thing we are interested in is the division. Notice how each SMM is divided into 4 dedicated "major cores". This is one of the biggest changes that architecture has seen since Kepler which would have consisted of just one big sheet. Lets zoom in, straight into a Streaming Maxwell Multiprocessor.

If you were to count the CUDA Cores you would count exactly 128. It is also very interesting how they have divided up the memory interface width (bus) between the major cores giving 32-bit to each. The memory interface width ofcourse adds up to 128 Bit. Now here's the interesting part. There are two L1 Caches and each is shared by two Major Cores along with 4 Texture Units. The 64kb of Shared Memory is shared between 4 major cores, ie the entire SMM.

Here are the Kepler SMX in comparison:

By this point the major revolution of Maxwell architecture should be becoming clear. Division, division and more division. You might also have noticed that unlike in Kepler SMX the warp scheduler has control over only its own 'major core'. Nothing is being shared between the 4 major cores except FP64 and Texture units (by the warp schedulers). Taking power in numbers to an art form, it raises interesting questions whether using the same division tactics to other architectures yield the same benefits? It also implies that if we were somehow able to split the 128 Cuda Cores into not 4, but 128 Major Cores, with 1-bit each, would we have the perfect efficient architecture?

I would also like to mention concludingly that there is something in the Maxwell architecture that Nvidia is not telling us. The 'secret sauce' approach if you may, though its childish no one can argue with its effectiveness.

The NDA Has Finally Lifted. #Nvidia #Maxwell Architecture Analysis. http://t.co/TJE9k7efRE

— Usman Pirzada (@usmanpirzada) February 18, 2014

About the author: PC Hardware and Technology Enthusiast, Blood of Silicon (1 nm),

Follow Wccftech on Google to get more of our news coverage in your feeds.

Read all comments on Nvidia Maxwell Architecture Analysis – Delivering Double the Performance Per Watt on 28 NM

Nvidia Maxwell Architecture Analysis – Delivering Double the Performance Per Watt on 28 NM

Maxwell 28nm Miracle - How the Architecture makes Doubling Performance Per Watt Possible

Trending Stories

Samsung Gen 5.0 1 TB And 2 TB 9100 PRO SSDs Are Now Retailing For The Same Price As Gen 4.0 990 PRO SSD Variants

SK hynix Weighs Buying South Korean Government Bonds as KIS Analyst Suggests HBM Price Surge Shows Signs of Cooling

An iOS Developer Vibe-Coded A “Capybara Food Delivery” Game Using Claude Code, 27,000 Lines Of Programming Made Entirely By AI, And Won $25,000 In Prize Money

TSMC’s 1.4nm Process Is Seeing Zero Progression Roadblocks, As First Fab Nears Completion, With Pilot Production Starting As Early As Q3 2027

Valve Raises Steam Machine’s Red LED Alert to 100°C, Ending False Overheat Warnings That Interrupted Gameplay

Popular Discussions

AMD Radeon Drivers Silently Add Multi Frame Generation “MFG 8x”, Ray Regeneration, and Neural Radiance Overrides, Hinting At A Bigger FSR Push

AMD Prepares For Zen 6 EPYC CPUs Launch For July 22nd-23rd, Confirms AMD’s Mark Papermaster

NVIDIA’s GeForce RTX 5070 Ti SUPER – Specs, Performance, And Price, Everything We Know So Far

AMD’s Next-Gen Medusa Point “10-Core” CPU Beats Strix “10-Core” By 29% In Single-Core & 22% In Multi-Core While Running At Just 2.0 GHz

AMD Ryzen Becomes The Top CPU Choice While Radeon Powers 1 In Every 3 Desktop Gaming GPUs Sold at Microcenter

Nvidia Maxwell Architecture Analysis – Delivering Double the Performance Per Watt on 28 NM

Maxwell 28nm Miracle - How the Architecture makes Doubling Performance Per Watt Possible

Further Reading

Trending Stories

Popular Discussions