SchemavLLM

vLLM reference knowledge

A machine-readable knowledge base your AI SRE agent can query as a tool — the operational facts about vLLM that don't live reliably in a model's training data, modeled from first-party sources and the practitioner long tail.
Reference sources
37
Concepts modeled
59
Connections
126

Reference sources

37
The primary material behind the knowledge — official docs, releases, pull requests, issues, and research. Every item links to its real source.
Release notes4
Version releases with the changes and fixes that ship in them.
Operational guides2
Deployment and tuning guides for running vLLM in production.
Research papers1
Primary research behind vLLM's memory and throughput claims.
Benchmark reports1
Published performance runs with their methodology captured.
Coverage probes1
Searches run to map what the corpus does and does not yet cover.
  • Existing Schema coverage probe for vLLM KV cache OOM

Conceptual knowledge

59
The operational understanding modeled on top of those sources — failure modes, metrics, parameters, benchmarks, architecture, and mitigations, cross-linked into a graph.
Failure modes & risks13
Known defects, coverage gaps, and operational hazards to watch for.
  • docs guidance versus dynamic default logic
  • gpu_memory_utilization ineffective on sliced GPU stacks
  • long-context prefill OOM below advertised max model length
  • max_num_batched_tokens must exceed max_model_len when chunked prefill is disabled
  • Missing first-party KV cache capacity calculator
  • Missing routing policy tradeoff matrix
  • Missing workload-class to recommended-config matrix
  • Model Runner V2 rejection-sampling acceptance-rate gap versus MRV1
  • MTP=1 hang on DeepSeek V4 when persistent_topk path is active
  • Production Stack benchmark platform not yet published
  • ROCm DSV4-Flash dense KV cache pool materialization
  • warmup prefill kernel memory regression
  • WSL2 CUDA overhead allocator mismatch
Architecture & components13
Engine subsystems, stack components, and how serving traffic flows.
  • KEDA autoscaling
  • KEDA autoscaling on vLLM waiting requests
  • KV cache manager
  • persistent_topk path in DSA sparse-attention indexer
  • prefix aware routing
  • Production Stack Helm chart
  • Production Stack router
  • Production Stack router CLI
  • ROCm AITER MLA sparse attention path
  • route by KV cache hit rate
  • route by shared prompt prefix
  • upstream vLLM engine
  • warmup prefill kernels path
Parameters & defaults10
Tunable settings, their defaults, safe ranges, and default drift.
  • >8192 throughput guidance
  • 2048 smaller-value ITL tuning example
  • 512 chunked-prefill default in v0.4.2 docs
  • chunked prefill decode-priority scheduling
  • enable_chunked_prefill
  • gpu_memory_utilization
  • kv_cache_dtype
  • max_num_batched_tokens
  • max_num_batched_tokens default history
  • max_num_seqs
Benchmarks & workloads10
Benchmark methods, claims, and the workload classes they apply to.
  • decode-heavy benchmark workload
  • high-concurrency traffic spike
  • long-context prefill
  • offline inference throughput benchmark
  • online serving throughput benchmark
  • prefill-heavy benchmark workload
  • ShareGPT benchmark workload
  • single-batch latency benchmark
  • vLLM 0.6.0 performance-update experiment context
  • vLLM 0.6.0 throughput and TPOT improvement claim
Metrics & signals7
The numbers to watch and what healthy versus unhealthy looks like.
  • KV block lifecycle metrics
  • KV cache usage percentage
  • output token throughput
  • request throughput
  • time per output token
  • time to first token
  • vllm:num_requests_waiting
Hardware & compatibility5
GPU profiles, dependency constraints, and supported-model limits.
  • A100 and H100 benchmark hardware
  • DeepSeek V4
  • Hugging Face Transformers dependency constraint
  • Kubernetes cluster with GPU support
  • Prometheus observability stack
Mitigations & remedies1
Actions that relieve a known failure mode once you hit it.
  • enforce eager execution