How Baseten Uses NVIDIA Blackwell to Achieve 5x AI Inference Throughput
Baseten, the AI inference platform pooling GPUs from 10+ cloud providers for some of the world’s fastest-growing AI companies, adopted NVIDIA Blackwell GPUs on Google Cloud alongside NVIDIA Dynamo and TensorRT-LLM. The result: 5x higher throughput for high-traffic endpoints, up to 225% better price-performance serving DeepSeek-R1 and Llama 4, and 38% lower latency for large language model serving.
Impact
5x
Throughput improvement for high-traffic endpoints
Up to 225%
Price-performance improvement for reasoning models
Up to 38%
Reduction in LLM serving latency
<5 minutes
GPU provisioning speed
Challenge
Baseten needed to serve frontier reasoning models like DeepSeek-R1 and Llama 4 in production without making unacceptable tradeoffs between latency and cost— previous GPU infrastructure couldn’t handle massive context windows and extended inference compute for reasoning models at competitive price-performance.
Solution
Baseten adopted NVIDIA Blackwell GPUs on Google Cloud—the first company to do so—paired with NVIDIA Dynamo for multi-node inference orchestration and TensorRT-LLM for hardware-optimized model serving, enabling 5x throughput improvement, up to 225% better price-performance on reasoning models, and 38% latency reduction.
Tools & Technologies
Sign up to read complete case studies, access detailed metrics, and unlock all use cases.
Full Story
Baseten operates a global AI inference platform that aggregates GPU capacity from more than 10 cloud providers across dozens of regions into a unified pool. The company’s customers are AI-native companies running production workloads on state-of-the-art large language models—and their demands are non-negotiable: low latency, high throughput, and cost efficiency, all at scale. Baseten’s orchestration layer abstracts away the complexity of managing geographically distributed GPU infrastructure, turning a fragmented set of cloud instances into a single fungible compute pool.
As AI models grew in size and reasoning capability, serving them in production became increasingly difficult to balance. Models like DeepSeek-R1 require enormous GPU memory and generate “thinking tokens” that dramatically increase inference compute. Llama 4 Scout’s 10-million-token context window created additional memory pressure. Before adopting NVIDIA Blackwell, Baseten had to make difficult tradeoffs between user latency and inference costs when serving these models, limiting what it could offer customers at competitive price points.
Baseten became the first company to adopt A4 VMs with NVIDIA Blackwell GPUs on Google Cloud, pairing them with the NVIDIA Dynamo inference framework and NVIDIA TensorRT-LLM. NVIDIA Dynamo manages multi-node inference serving across the global GPU pool, while TensorRT-LLM optimizes model execution on Blackwell hardware. The platform can now provision thousands of GPUs in under five minutes using multi-cloud capacity management built on the NVIDIA CUDA architecture.
The performance gains were substantial. Baseten can now serve five times as many user requests for custom models using the same number of GPUs. For frontier reasoning models like DeepSeek-R1 and Llama 4, price-performance improved by up to 225%. Latency for large model serving dropped by up to 38%, directly improving user experience and adoption for Baseten’s customers.
The Blackwell adoption positions Baseten at the frontier of inference infrastructure at a moment when AI model complexity—and the business value customers derive from it—is compounding rapidly. By removing the hardware and orchestration constraints that previously forced cost-latency tradeoffs, Baseten can serve cutting-edge models reliably and economically at scale.