
Microsoft and NVIDIA say Azure is now running the first at-scale, production cluster of NVIDIA’s GB300 NVL72 for OpenAI—purpose-built to push larger, more capable “reasoning” models and serve them faster. The system marries rack-scale memory with fifth-gen NVLink and 800Gb/s networking to make dozens of cabinets behave like one enormous accelerator.
Key Points
- Rack-scale GB300 NVL72: 72 Blackwell Ultra GPUs + 36 Grace CPUs per rack.
- 130 TB/s NVLink inside the rack; 800 Gb/s per-GPU network to scale out.
- ~21 TB of HBM3e per rack; up to ~40 TB “fast” coherent memory.
- GB300 set new MLPerf reasoning records; optimized for test-time scaling.
The GB300 NVL72 is NVIDIA’s rack-scale platform for the “AI reasoning” era—where models spend more compute at inference time to think through multi-step tasks. Each rack ties 72 Blackwell Ultra GPUs and 36 Grace CPUs into a single 72-GPU NVLink domain, exposing a unified pool of fast memory and 130 TB/s of intra-rack bandwidth for giant contexts and longer chains of thought.
Azure’s cluster then stitches racks together with NVIDIA’s Quantum-X800 InfiniBand, giving each GPU up to 800 Gb/s of network throughput. That’s what lets OpenAI fan out decoding and prefill across racks while keeping latency tight—essential for interactive agents and long-context workloads.
A single NVL72 rack aggregates roughly 21 TB of HBM3e across those GPUs, and—counting CPU-GPU coherent memory—can reach ~40 TB of “fast” memory. That footprint helps serve larger models and longer prompts without constant offloading, which translates to higher tokens-per-second and fewer stalls.
On benchmarks, NVIDIA’s GB300 NVL72 recently set records in MLPerf Inference v5.1’s new reasoning tests, including higher DeepSeek-R1 throughput versus prior-gen Blackwell-based clusters. The platform is also tuned for FP4 formats (e.g., NVFP4) and Dynamo-style disaggregated serving—choices aimed squarely at lowering inference cost for massive models.
Power and grid stability are becoming first-class engineering problems at this scale. GB300 introduces rack PSUs with onboard energy storage and coordinated power-smoothing, cutting peak demand spikes by up to 30%—useful when thousands of GPUs ramp up or down in lockstep. Expect to see more of this kind of grid-aware design in hyperscale AI builds.
The move also fits a broader capacity strategy. Microsoft has been locking up supply—both in its own datacenters and via “neocloud” partners—to keep GPU-hungry projects fed, even as it readies new GB300 rollouts globally. That procurement posture, plus Azure’s GB200 footprint, is how Microsoft intends to shorten training cycles and bring bigger models to market faster.
Competitive context: CoreWeave and other clouds have begun lighting up GB300 racks, and Google Cloud plans to offer GB300 NVL72 instances as well. But Azure’s claim here is about production scale for frontier inference—and having that in OpenAI’s backyard could tilt the near-term pace of model upgrades and new agent capabilities.
As these clusters multiply, the bottlenecks will keep shifting—from chips, to memory, to networking, and now to power. Today’s win is silicon and systems; tomorrow’s constraint might be megawatts.