Production-grade distributed inference built for how LLMs work.

llm-d orchestrates inference workloads across your cluster — bringing LLM-aware routing, disaggregated serving, and tiered KV caching to the Kubernetes primitives you already run.

Get started View the architecture

Founded by

llm-d is a CNCF Sandbox project

Validated in production.

Performance gains from production deployments and partner benchmarks.

3×

Higher output throughput vs round-robin

Llama 3.1 70B · Tesla / Red Hat

70%

Higher tokens/sec with prefill/decode disaggregation

GPT-OSS on NVIDIA B200 · AWS

40%

Reduction in TTFT with predicted-latency scheduling

NVIDIA GPUs · Google

13.9×

Throughput with hierarchical KV offloading

4× NVIDIA H100, 250 concurrent users

50k

Tokens/sec cluster throughput with Wide Expert-Parallelism

16×16 NVIDIA B200

Explore performance analysis

Start with the pattern that matches your bottleneck.

Each guide is a tested deployment pattern with concrete configuration — pick your path to production inference.

LLM-Aware Load Balancing

Route every request to the replica that will serve it fastest.

llm-d's endpoint picker scores each replica in real time across four signals: prefix cache locality, KV-cache utilization, queue depth, and predicted latency. Each request is dispatched to the replica with the lowest expected tail latency — delivering order-of-magnitude p99 improvements over round-robin routing, with no additional hardware.

Explore LLM-aware routing

Serving Large Language Models

Scale prompt processing and token generation independently.

Prefill and decode have fundamentally different resource profiles. llm-d splits them across dedicated worker pools and transfers KV-cache between phases over RDMA via NIXL. The result is faster TTFT, more predictable TPOT, and better GPU utilization across the cluster.

See how disaggregation works

Advanced KV-Cache Management

Cache at memory speed. Spill at storage cost.

llm-d extends KV-cache beyond accelerator HBM through a configurable storage hierarchy: HBM, CPU memory, local SSD, and shared remote storage (in progress). Hot prefixes stay close to the accelerator; cold prefixes spill to cheaper tiers automatically. You serve longer contexts and higher concurrency without adding GPUs.

Configure tiered caching

Operational Excellence

Scale for the load you have, on the hardware you have.

Two complementary patterns, both built on Kubernetes primitives. HPA scales replicas using live inference signals — queue depth and request counts from the endpoint picker. The Workload Variant Autoscaler routes across model variants on heterogeneous hardware to meet SLOs at the lowest cost.

Set up autoscaling

See all guides