Skip to main content

Author

Ashar Mirza - VoicePing Inc.

Recap: The Problem

In Part 1, we identified the bottleneck: our FastAPI service used multiprocessing workers with IPC queues to distribute translation tasks. This created:
  • Queue serialization overhead
  • GPU compute contention between worker processes
  • Spiky GPU utilization pattern
Baseline: 2.2 RPS at 25 concurrent requests The path forward: eliminate multiprocessing and utilize vLLM’s batch inference.

Attempt 2: Static Batching

We implemented static batching within the existing worker processes.

Implementation

# Within worker process
MAX_BATCH_SIZE = 16
BATCH_TIMEOUT = 0.05  # 50ms

while True:
    batch_keys = []
    batch_tasks = []

    # Collect first task (blocking)
    first_key = queue.get()
    batch_keys.append(first_key)
    batch_tasks.append(tasks[first_key])

    # Try to collect more tasks (non-blocking with timeout)
    batch_start = time.time()
    while len(batch_keys) < MAX_BATCH_SIZE:
        time_remaining = BATCH_TIMEOUT - (time.time() - batch_start)
        if time_remaining <= 0:
            break
        try:
            key = queue.get(timeout=time_remaining)
            batch_keys.append(key)
            batch_tasks.append(tasks[key])
        except Empty:
            break

    # Process batch using vLLM
    results = translation_provider.translate_batch(
        texts=[t.text for t in batch_tasks],
        source_langs=[t.source_lang for t in batch_tasks],
        target_langs=[t.target_lang for t in batch_tasks]
    )
Key points:
  • Batch size: 16 requests
  • Timeout: 50ms (don’t wait indefinitely for full batch)
  • vLLM processes multiple sequences together
  • Still uses multiprocessing workers

Results

Static Batching Results

Figure 1: Static batching delivers significant throughput and response time improvements

Nearly 3x throughput improvement. Per-request inference time: 452ms → 171ms.

Trade-offs

Pros:
  • Massive throughput gains
  • GPU better utilized
  • Simple implementation
Cons:
  • Head-of-line blocking: All requests wait for the slowest one
  • With variable-length inputs, short translations wait for long ones
  • Example: [50 tokens, 50 tokens, 200 tokens] – first two wait for the 200-token translation
This was good progress, but we wanted to eliminate the head-of-line blocking issue.

Attempt 3: Continuous Batching

The solution: vLLM’s AsyncLLMEngine with continuous batching.

What is Continuous Batching?

Unlike static batching, continuous batching composes batches dynamically:
  • New requests join mid-generation
  • Completed requests leave immediately (don’t wait for others)
  • Batch composition updates every token
  • vLLM’s AsyncLLMEngine handles this automatically
No head-of-line blocking. Short translations return as soon as they’re done.

Implementation

from vllm import AsyncLLMEngine, EngineArgs

engine_args = EngineArgs(
    model=model_id,
    max_num_seqs=64,  # Initial attempt
    max_num_batched_tokens=16384,
    gpu_memory_utilization=0.3,
    enable_chunked_prefill=True,
)

engine = AsyncLLMEngine.from_engine_args(engine_args)

@app.post("/translate")
async def translate(request: TranslateRequest):
    result_generator = engine.generate(
        request.text,
        sampling_params,
        request_id=generate_id()
    )

    async for output in result_generator:
        final_output = output
    return TranslateResponse(translation=final_output.text)
Architecture change:
  • AsyncLLMEngine used directly in FastAPI
  • vLLM handles batching internally via continuous batching engine
  • Pure async/await throughout

Testing Reality Check

Initial Results (Uniform Inputs)

We tested with standard uniform-length inputs (similar lengths):
Continuous Batching Uniform

Figure 2: Continuous batching with uniform inputs showing impressive 15 RPS throughput

15 RPS vs 2.2 baseline – nearly 7x improvement. This looked great.

Variable-Length Inputs (Reality)

Then we tested with realistic variable-length inputs (10-200 tokens, mixed short and long): Baseline re-run with variable inputs:
  • Very heavy load: 1.1 RPS (vs 2.2 RPS with uniform)
  • Even baseline performed worse with realistic data
Continuous batching (max_num_seqs=64) with variable inputs:
  • Very heavy load: 3.5 RPS (with max_num_seqs=16 tuning)
  • Same configuration that gave us 15 RPS with uniform inputs
Standard vs Variable

Figure 3: Performance gap between uniform test data and realistic variable-length inputs

Configuration Tuning

The poor performance with max_num_seqs=64 led us to analyze vLLM’s internal metrics.

What We Found

# vLLM Prometheus metrics we monitored:
# - vllm:time_to_first_token_seconds (TTFT)
# - vllm:time_per_output_token_seconds (decode time)
# - vllm:gpu_cache_usage_perc (KV cache utilization)
# - vllm:num_requests_running / waiting (queue depth)
The issue:
  • Actual workload: 2-20 concurrent requests per server (production peak ~20 per server)
  • Configuration: max_num_seqs=64
  • Result: 60+ empty slots creating overhead
What happens with oversized config:
  • KV cache pre-allocated for 64 sequences
  • vLLM scheduler manages 64 slots but only uses 5-10
  • Decode time per token increases
  • Memory wasted on unused sequence slots
  • Scheduler overhead for empty slots

Tuning Approach

Following vLLM continuous batching tuning guide:
  1. Measure actual concurrent request distribution in production
  2. Start with max_num_seqs=1, gradually increase: 2 → 4 → 8 → 16 → 32
  3. Monitor decode time and tail latency at each step
  4. Stop when performance degrades
max_num_seqsResult
8Good latency, but throughput limited
16Best balance
32Decode time increased, tail latency worse

Final Configuration

from translation_lib.config import AsyncVLLMTranslationProvider

provider = AsyncVLLMTranslationProvider(
    model_name=model_id,
    revision=model_revision,
    gpu_memory_utilization=0.3,  # ~10GB on RTX 5090
    max_num_seqs=16,  # Right-sized to actual workload per server
    huggingface_token=hf_token,
    supported_language_pairs=None,  # Multilingual model
)

await provider.initialize_engine()

Configuration Rationale

max_num_seqs=16:
  • Production peak: ~20 concurrent requests per server
  • Testing: Validated up to 25 concurrent
  • Provides headroom without wasting resources
  • Scheduler overhead matched to actual load
max_num_batched_tokens=8192:
  • Reduced from default 16384
  • Better suited for our average sequence lengths
  • Reduces memory pressure
gpu_memory_utilization=0.3:
  • Allocates ~10GB VRAM for model + KV cache on RTX 5090 (32GB)
  • Tracked via vllm:gpu_cache_usage_perc
  • Balanced for our configuration
The principle: match configuration to your actual workload, not theoretical limits.
Throughput Journey

Figure 4: Throughput progression through all optimization attempts

Production Results

We deployed the optimized configuration to production (RTX 5090 GPUs).

Before vs After

MetricBefore (Multiprocessing)After (Optimized AsyncLLM)Change
Throughput9.0 RPS16.4 RPS+82%
GPU UtilizationSpiky (93% → 0% → 93%)Consistent 90-95%Stable
Production Comparison

Figure 5: Production deployment results showing 82% throughput improvement

P95 Comparison

Figure 6: P95 latency improvements across optimization attempts

Response Time Evolution

Figure 7: Response time evolution with variable-length inputs

The improvement held in production. From 9 RPS to 16.4 RPS under real traffic.

Summary

What Worked

vLLM’s continuous batching
  • AsyncLLMEngine handles batching automatically
  • No manual batch collection overhead
  • Direct async/await integration with FastAPI
Right-sized configuration
  • max_num_seqs=16 (matched actual workload per server)
  • Not 64 (theoretical max that created overhead)
  • gpu_memory_utilization=0.3 for 10GB allocation
Tested with realistic data
  • Variable-length inputs exposed configuration issues
  • Uniform test data gave misleading 15 RPS
Monitored vLLM metrics
  • KV cache usage
  • Decode time per token
  • Queue depth
  • Guided configuration decisions

Complete Journey

ApproachThroughputvs BaselineNotes
Baseline (multiprocessing)2.2 RPS-IPC overhead, GPU contention
Two workers2.0 RPS-9%Made it worse
Static batching5.9 RPS+168%Head-of-line blocking
Async (64, uniform)15.0 RPS+582%Misleading test data
Async (16, variable)3.5 RPS+59%Realistic, but tuning needed
Final optimized10.7 RPS+386%Staging validation
Production16.4 RPS+82%Real traffic, RTX 5090