Ray Serve Upgrade Delivers 88% Lower Latency for AI Inference at Scale

Jessie A Ellis
Mar 24, 2026 16:58

Anyscale announces major Ray Serve optimizations with HAProxy and gRPC, achieving 11.1x throughput gains for LLM inference workloads on enterprise deployments.

Anyscale has shipped substantial performance upgrades to Ray Serve that slash P99 latency by up to 88% and boost throughput by 11.1x for large language model inference workloads. The improvements, available in Ray 2.55+, address scaling bottlenecks that have plagued enterprise AI deployments running latency-sensitive applications.

The upgrades center on two architectural changes: HAProxy integration for ingress traffic and direct gRPC communication between deployment replicas. Both bypass Python-based components that previously created chokepoints under heavy load.

What the Numbers Show

In benchmark testing of a deep learning recommendation model pipeline, the optimized configuration pushed throughput from 490 to 1,573 queries per second while cutting P99 latency by 75%. At 400 concurrent users, the performance gap widened dramatically as Ray Serve’s default Python proxy saturated while HAProxy continued scaling.

For LLM inference specifically, the results proved even more striking. Running GPT-class models on H100 GPUs at 256 concurrent users per replica, throughput scaled linearly with replica count when using HAProxy—something the default configuration couldn’t achieve as the Python process hit its ceiling.

Streaming workloads saw 8.9x throughput improvements, while unary request patterns hit the full 11.1x gain.

Technical Architecture Shift

The core problem: Ray Serve’s default proxy runs on Python’s asyncio, which struggles at high concurrency. HAProxy, written in C and battle-tested across production systems globally, handles the same traffic with significantly less overhead.

The second optimization targets inter-deployment communication. Previously, when one deployment called another, Ray Serve routed everything through Ray Core’s actor task system—useful for complex orchestration but overkill for simple request-response patterns. The new gRPC option establishes direct channels between replica actors, serializing with protobuf instead of going through Ray’s object store.

Benchmarks show gRPC alone delivers 1.5x throughput improvement for unary calls and 2.4x for streaming at equivalent latency targets.

Enterprise Implications

These aren’t academic improvements. Companies running recommendation systems, real-time fraud detection, or customer-facing LLM applications have consistently hit Ray Serve’s scaling limits. The partnership with Google Kubernetes Engine that drove these optimizations suggests enterprise demand was substantial enough to prioritize the work.

A single environment variable—RAY_SERVE_USE_GRPC_BY_DEFAULT—enables the gRPC transport. HAProxy activation requires cluster-level configuration but integrates with existing Kubernetes deployments.

Anyscale is working toward making both optimizations the default for all inter-deployment communication, with an RFC currently under discussion. For teams already running Ray Serve in production, the upgrade path is straightforward: update to Ray 2.55+ and flip the appropriate flags.

The benchmark code is publicly available on GitHub for teams wanting to validate performance gains against their specific workloads before deploying.

Image source: Shutterstock

Source link