TechnologySoftware Engineering

How Baseten Uses NVIDIA Blackwell to Achieve 5x AI Inference Throughput

Baseten, the AI inference platform pooling GPUs from 10+ cloud providers for some of the world’s fastest-growing AI companies, adopted NVIDIA Blackwell GPUs on Google Cloud alongside NVIDIA Dynamo and TensorRT-LLM. The result: 5x higher throughput for high-traffic endpoints, up to 225% better price-performance serving DeepSeek-R1 and Llama 4, and 38% lower latency for large language model serving.

Impact

5x

Throughput improvement for high-traffic endpoints

Up to 225%

Price-performance improvement for reasoning models

Up to 38%

Reduction in LLM serving latency

<5 minutes

GPU provisioning speed

Challenge

Baseten needed to serve frontier reasoning models like DeepSeek-R1 and Llama 4 in production without making unacceptable tradeoffs between latency and cost— previous GPU infrastructure couldn’t handle massive context windows and extended inference compute for reasoning models at competitive price-performance.

Solution

Baseten adopted NVIDIA Blackwell GPUs on Google Cloud—the first company to do so—paired with NVIDIA Dynamo for multi-node inference orchestration and TensorRT-LLM for hardware-optimized model serving, enabling 5x throughput improvement, up to 225% better price-performance on reasoning models, and 38% latency reduction.

Tools & Technologies

Get the full context.

Sign up to read complete case studies, access detailed metrics, and unlock all use cases.

Full Story

Baseten operates a global AI inference platform that aggregates GPU capacity from more than 10 cloud providers across dozens of regions into a unified pool. The company’s customers are AI-native companies running production workloads on state-of-the-art large language models—and their demands are non-negotiable: low latency, high throughput, and cost efficiency, all at scale. Baseten’s orchestration layer abstracts away the complexity of managing geographically distributed GPU infrastructure, turning a fragmented set of cloud instances into a single fungible compute pool.

As AI models grew in size and reasoning capability, serving them in production became increasingly difficult to balance. Models like DeepSeek-R1 require enormous GPU memory and generate “thinking tokens” that dramatically increase inference compute. Llama 4 Scout’s 10-million-token context window created additional memory pressure. Before adopting NVIDIA Blackwell, Baseten had to make difficult tradeoffs between user latency and inference costs when serving these models, limiting what it could offer customers at competitive price points.

Baseten became the first company to adopt A4 VMs with NVIDIA Blackwell GPUs on Google Cloud, pairing them with the NVIDIA Dynamo inference framework and NVIDIA TensorRT-LLM. NVIDIA Dynamo manages multi-node inference serving across the global GPU pool, while TensorRT-LLM optimizes model execution on Blackwell hardware. The platform can now provision thousands of GPUs in under five minutes using multi-cloud capacity management built on the NVIDIA CUDA architecture.

The performance gains were substantial. Baseten can now serve five times as many user requests for custom models using the same number of GPUs. For frontier reasoning models like DeepSeek-R1 and Llama 4, price-performance improved by up to 225%. Latency for large model serving dropped by up to 38%, directly improving user experience and adoption for Baseten’s customers.

The Blackwell adoption positions Baseten at the frontier of inference infrastructure at a moment when AI model complexity—and the business value customers derive from it—is compounding rapidly. By removing the hardware and orchestration constraints that previously forced cost-latency tradeoffs, Baseten can serve cutting-edge models reliably and economically at scale.

Similar Cases

P
Pfizer
93%
database reduction

Pfizer achieved a 93% database reduction and 20% cost avoidance by migrating their global SAP environment to S/4HANA on IBM Power10 infrastructure.

PharmaceuticalsTechnologyICIBM ConsultingIPIBM Power Virtual Server
A
Allspice
20% → 97%
ingredient matching accuracy

Allspice, a food technology startup building a kitchen operating system for consumers and recipe publishers, deployed Pinecone’s vector database to solve the inherent messiness of ingredient data that traditional text search could not handle. The implementation raised ingredient matching accuracy from roughly 20% to 97%, enabling the launch of recipe importing as a core product feature and expanding into a platform-wide semantic layer for search, recommendations, and conversational AI.

TechnologyTtext-embedding-3-largePPinecone
J
Jamf
Under 45 minutes
performance review skill build time

Jamf deployed Claude Enterprise across 16 departments, then built interactive workflow skills using Claude Cowork that transformed manual spreadsheet-based processes into guided, conversational experiences. Performance reviews that previously required months of effort are now built in under 45 minutes, and non-engineering teams independently create custom data dashboards.

TechnologyCEClaude EnterpriseCCClaude Cowork
R
Rappi
40%
search response latency reduction

Rappi, Latin America’s fastest-growing on-demand delivery app serving over 300 cities, replaced its keyword-based search engine with Oracle AI Vector Search and Oracle Cloud Infrastructure Generative AI to enable semantic and image-based product discovery. The upgrade reduced search response latency by 40% and improved conversion rate by 25%, driving higher engagement and order volumes across the platform.

TechnologyOAOracle AI Vector SearchOAOracle Autonomous AI Database
C
Confluent
15,000+
hours saved monthly

Confluent, a data streaming platform company with 2,000+ employees and 4,000+ customers, deployed Glean to solve the knowledge fragmentation that came with rapid growth from 250 to 2,000+ employees across 20+ systems. Glean indexed the company's full tool stack — Slack, Salesforce, Confluence, and more — enabling instant knowledge retrieval across all teams. The result: 15,000+ hours saved monthly, a 13% increase in support team satisfaction, and over 70% employee adoption.

TechnologyGGlean
H
Headstart
90–97%
code written by claude

Headstart, an AI-native software studio, uses Claude 3.5 Sonnet to write 90-97% of client code, compressing enterprise software project timelines from months to weeks and delivering 10-100x development speed.

TechnologyC3Claude 3.5 Sonnet
L
Lusha
300%
increase in outbound leads

Lusha is a B2B sales intelligence platform with 1.5 million users and a database of over 200 million business contacts. By deploying Elasticsearch as both a full-text search engine and a vector database for AI-powered lead recommendations, Lusha helps customers generate 300% more leads, achieve conversion rates up to 10x higher, and realize return on investment of up to 1,000%.

TechnologyEElasticsearch
A
Aquant
98%+
retrieval accuracy

Aquant is an agentic AI platform purpose-built for professionals servicing complex industrial and medical equipment at large manufacturing companies. When the company’s homegrown vector search infrastructure—built on PostgreSQL extensions—began to slow under real-time production demands, Aquant migrated to Pinecone as the retrieval backbone for its AI platform. The switch delivered sub-100ms semantic search, pushed retrieval accuracy above 98%, and helped Aquant’s customers cut average service resolution time by 49%.

TechnologyPPinecone