LLM-Aware Load Balancing
Route every request to the replica that will serve it fastest.
llm-d's endpoint picker scores each replica in real time across four signals: prefix cache locality, KV-cache utilization, queue depth, and predicted latency. Each request is dispatched to the replica with the lowest expected tail latency โ delivering order-of-magnitude p99 improvements over round-robin routing, with no additional hardware.
Explore LLM-aware routing