OpenAI Taps Cerebras to Speed Up Real-Time AI at Scale

OpenAI partners with Cerebras to add 750 MW of low-latency AI compute, aiming to speed up real-time inference and scale faster, more responsive AI workloads through 2028.

author-image
CIOL Bureau
New Update
openAI

As artificial intelligence pushes deeper into real-time use cases, OpenAI is reworking how its models respond, not by changing algorithms, but by rethinking the hardware underneath.

Advertisment

The company has announced a partnership with Cerebras, bringing 750 megawatts of ultra-low-latency AI compute into OpenAI’s platform over the next several years. The capacity will be integrated into OpenAI’s inference stack in phases, with rollout planned through 2028.

The move reflects a growing focus on inference performance as AI models shift from static responses to interactive, agent-driven workloads where speed directly affects user experience.

Why Low-Latency Inference Matters Now

Behind every AI interaction, whether generating code, answering complex queries, or running multi-step agents, there is a feedback loop. A request goes in, the model processes it, and a response comes back. As models grow larger and tasks become more interactive, delays in that loop become increasingly visible.

OpenAI said the Cerebras integration is designed to shorten that loop.

“When AI responds in real time, users do more with it, stay longer, and run higher-value workloads,” the company noted, outlining why response speed is becoming a strategic lever rather than a technical footnote.

What Cerebras Brings to the Stack

Cerebras builds purpose-designed AI systems optimised for long outputs and fast inference. Its approach differs from conventional GPU-based architectures by consolidating massive compute, memory, and bandwidth onto a single wafer-scale chip, reducing data movement bottlenecks.

According to Cerebras, large language models running on its low-latency inference system can respond up to 15 times faster than GPU-based systems, an advantage that becomes critical for real-time applications.

Advertisment

For OpenAI, Cerebras adds a new class of infrastructure rather than replacing existing systems. 

“OpenAI’s compute strategy is to build a resilient portfolio that matches the right systems to the right workloads. Cerebras adds a dedicated low-latency inference solution to our platform. That means faster responses, more natural interactions, and a stronger foundation to scale real-time AI to many more people,” said Sachin Katti, OpenAI.

A Multi-Year, Phased Deployment

The partnership is structured as a multi-year agreement, with the 750 MW capacity coming online in multiple tranches through 2028. The phased approach allows OpenAI to gradually expand Cerebras-backed inference across workloads as demand grows.

Cerebras described the deployment as the largest high-speed AI inference system of its kind, underscoring how inference, once overshadowed by training, has become central to AI economics and user engagement. 

“Just as broadband transformed the internet, real-time inference will transform AI, enabling entirely new ways to build and interact with AI models,” said Andrew Feldman, co-founder and CEO of Cerebras.

A Long-Running Technical Alignment

While the partnership is new, OpenAI and Cerebras are not unfamiliar collaborators. The two companies have engaged in research discussions since 2017, sharing the view that hardware architecture must evolve alongside model scale.

Advertisment

This alignment has now translated into production infrastructure, signalling OpenAI’s intent to diversify beyond traditional GPU-heavy stacks as AI workloads fragment into training, batch inference, and real-time interaction.

For enterprises building on OpenAI’s platform, the shift is subtle but consequential. Faster inference can unlock use cases where latency previously made AI impractical, live coding assistance, conversational agents that reason step-by-step, and interactive decision systems that operate in real time.

Rather than positioning the deal as raw capacity expansion, OpenAI is framing it as workload optimisation: matching infrastructure to the kind of AI users increasingly expect.

Advertisment

As real-time AI becomes table stakes, the infrastructure race is no longer just about who trains the biggest models, but who can make them respond fast enough to matter.