A Taiwanese company has announced its new PCIe AI accelerator card that can run 700B LLMs locally at just 240W, ending the need for large GPU clusters.
Taiwanese Company Unveils Its PCIe AI Accelerator That Devalues Large-Scale AI Installations By Running 700B LLMs on A Single Card
Skymizer, a Taiwan-based company specializing in AI software and hardware, has announced its brand new solution, the HTX301. The HTX301 is designed for On-Prem AI, offering a PCIe Add-in-Card design and offering large-scale levels of AI performance at sub-250W TDPs.
Some of the highlights of the card include:
- Run 700B-parameter model inference on a single PCIe card.
- Purpose-built decode acceleration paired with unified prefill/decode orchestration.
- On-prem AI with data sovereignty, deterministic latency, and fixed infrastructure cost.
The company says that the HTX301 PCIe AI accelerator is its first inference chip that is built upon the HyperThought platform, which features its next-generation LPU IP. The platform is purpose-built for LLMs with optimized performance and power efficiency in mind.
The HTX301 looks like a standard PCIe card, featuring a single chip with memory scattered around it. The company explains that each board will feature six HTX301 chips, and despite being based on an older 28 nm process, it delivers exceptional results, such as achieving 30 tokens/second with just 0.5 TOPS at 100 GB/s bandwidth. The LPU is also highly scalable, leading to various design options.
The Octa-Core LPU achieves 240 tokens/second in Llama2 7B prefill, and the company can connect multiple chips together for up to 1200 tokens/second in the same LLM with additional support for up to 700B models.
The PCIe card also features up to 384 GB of memory. The card uses standard LPDDR4 and LPDDR5 DRAM, so nothing fancy such as LP5X, HBM, or GDDR6/7. The design is selected for lower parameter counts and DRAM bandwidth requirements. Skymizer's HTX301 architecture also employs efficient compression techniques such as:
- Weight (long-term memory) compression outperforms open-source llama.cpp by 9% to 17.8%.
- KV cache (short-term memory) compression with minimal perplexity loss (less than 0.06% to 3.52%).
Power characteristics are also a standout with the chip consuming just 240W of power, less than half the 600W of leading PCIe AI accelerators such as the NVIDIA RTX PRO 6000 Blackwell and the AMD Instinct MI350P.
Skymizer is claiming some big numbers and will be previewing the HTX301 at Computex this year, so we will definitely visit their booth and see if the claims hold up, but overall, this sounds like an impressive AI solution (on paper), which should prompt entry-level enterprises to stick with local servers instead of investing in cloud for their AI needs.
Follow Wccftech on Google to get more of our news coverage in your feeds.
