Well, it appears that the chip startup Taalas has found a solution to LLM response latency and performance by creating dedicated hardware that 'hardwires' AI models.
Taalas Manages to Achieve 10x Higher TPS With Meta's Llama 8B LLM, That Too With 20x Lower Production Costs
When you look at today's world of AI compute, latency is emerging as a massive constraint for modern-day compute providers, mainly because, in an agentic environment, the primary moat lies in token-per-second (TPS) figures and how quickly you can get a task done. One solution the industry sees is integrating SRAM into their offerings, and companies like Cerebras and Groq are already exploring it. However, the startup Taalas has apparently explored a rather intriguing route: pivot away from general-purpose computing towards ASICs for LLMs.
Founded 2.5 years ago, Taalas developed a platform for transforming any AI model into custom silicon. From the moment a previously unseen model is received, it can be realized in hardware in only two months. The resulting Hardcore Models are an order of magnitude faster, cheaper, and lower power than software-based implementations.
- Taalas
According to the company, its approach focuses on two different fundamentals. The first is the specialization of AI workloads at the hardware level. And when we say hardware-focused, it literally means mapping specific neural networks of LLMs onto the silicon itself, to optimize infrastructure for each model. The second target area is what the company calls "merging storage and computation", and here, the focus is on overcoming memory walls and the overhead in data communications within a general-purpose system.

With their solution, all computation happens at "DRAM-level" density to ensure faster intercommunication, which is one of the reasons Taalas has managed to solve the latency problem with LLMs. Their solution doesn't include advanced cooling, HBM, packaging, and complex integration; instead, all the innovation happens within the engineering dynamics of silicon. Taalas has also showcased its first product, called HC1, which integrates Meta's Llama 3.1 8B LLM. The performance results are 'shocking' to say the least.

Taalas delivers 10x the TPS of today's "high-end" infrastructure while achieving 20x lower production costs. Well, you might think that latency and performance constraints are solved here, but let's look at the HC1 chip from a technical angle. It features TSMC's 6nm node and a chip size up to 815 mm², which is almost the size of NVIDIA's H100 chip. The HC1 hosts an eight-billion-parameter model, while today's frontier LLMs scale up to one trillion parameters. And, if you have guessed it by now, Taalas would need to rework its silicon strategy.
And the only way to scale up performance is to offer a cluster-based approach, and according to Taalas, they have already done this with DeepSeek's R1, achieving a 12,000 TPS/user figure in a 30-chip configuration. So, the primary constraints now lie in market adoption and the business model. Given this hardwired approach, hardware would indeed be specific to certain LLMs, without the option to change model weights, but given the startup's speed figures, it isn't a bad bet.
Follow Wccftech on Google to get more of our news coverage in your feeds.





