Cerebras Launches Game-Changing AI Inference Service

By Chris McKay August 27, 2024 • 2 min read

Cerebras Systems has unveiled its new AI inference service, promising a dramatic leap in performance and cost efficiency. The Cerebras Inference API is designed to run Meta’s Llama 3.1 models at unprecedented speeds—up to 1,800 tokens per second for the 8B model and 450 tokens per second for the 70B model. This makes it 20 times faster than current GPU-based solutions from NVIDIA, positioning Cerebras as a potential game-changer in AI inference.

Going from 90 tokens/s to 1,800 tokens/s is like going from dialup to broadband. It makes AI instant: pic.twitter.com/TmbWTJaRXH
— Cerebras (@CerebrasSystems) August 27, 2024

What sets Cerebras Inference apart is its ability to deliver these speeds without sacrificing accuracy. The service uses full 16-bit precision, maintaining the integrity of the original models while significantly reducing the cost of AI operations. Developers can access this power at just 10 cents per million tokens for the 8B model and 60 cents for the 70B model, a fraction of the cost of traditional hyperscaler options.

The key to Cerebras’ breakthrough lies in its custom Wafer Scale Engine-3 (WSE-3), the world’s largest AI processor. Unlike GPUs, which struggle with memory bandwidth limitations, the WSE-3 stores the entire model on-chip, enabling it to bypass the bottlenecks that typically slow down AI inference. This allows for a memory bandwidth of 21 petabytes per second, dwarfing the capabilities of even the most advanced GPUs.

Cerebras’ CEO, Andrew Feldman, described the impact of this technology as comparable to the transition from dial-up to broadband internet. “This level of speed transforms AI applications, enabling real-time processing and opening up new possibilities for AI-driven innovation,” Feldman said.

Artificial Analysis, an independent AI benchmarking firm, has verified Cerebras’ claims, noting that the performance of Cerebras Inference is unmatched in the current market. According to Micah Hill-Smith, co-founder and CEO of Artificial Analysis, “Cerebras Inference not only sets a new standard in speed but also offers a price-performance ratio that breaks the chart.”

Cerebras Inference is available across three pricing tiers: Free, Developer, and Enterprise. The Free Tier offers generous usage limits for anyone interested in exploring the service, while the Developer Tier provides flexible deployment options at competitive rates. The Enterprise Tier offers fine-tuned models and custom service level agreements, catering to organizations with sustained AI workloads.

With additional models and greater capacity on the horizon, Cerebras is positioning itself as a formidable competitor in the AI inference market, challenging the dominance of GPU-based solutions and opening the door to new AI applications that were previously constrained by hardware limitations. The company is also in talks with major cloud providers about deploying its model-loaded chips more widely.

For developers and enterprises alike, Cerebras Inference offers a compelling combination of speed, accuracy, and cost efficiency, marking a significant step forward in the evolution of AI technology.

Chris McKay is the founder and chief editor of Maginative. His thought leadership in AI literacy and strategic AI adoption has been recognized by top academic institutions, media, and global brands.