Skip to content

feat(deps): update dependency vllm-charts ( v0.14.1 → v0.15.1 )#1663

Open
chaplain-grimaldus[bot] wants to merge 1 commit intomainfrom
renovate/vllm-charts-0.x
Open

feat(deps): update dependency vllm-charts ( v0.14.1 → v0.15.1 )#1663
chaplain-grimaldus[bot] wants to merge 1 commit intomainfrom
renovate/vllm-charts-0.x

Conversation

@chaplain-grimaldus
Copy link
Contributor

@chaplain-grimaldus chaplain-grimaldus bot commented Jan 29, 2026

This PR contains the following updates:

Package Update Change
vllm-charts minor v0.14.1v0.15.1

Warning

Some dependencies could not be looked up. Check the Dependency Dashboard for more information.


Release Notes

vllm-project/vllm (vllm-charts)

v0.15.1

Compare Source

v0.15.1 is a patch release with security fixes, RTX Blackwell GPU fixes support, and bug fixes.

Security

Highlights

Bugfix Hardware Support
  • RTX Blackwell (SM120): Fixed NVFP4 MoE kernel support for RTX Blackwell workstation GPUs. Previously, NVFP4 MoE models would fail to load on these GPUs (#​33417)
  • FP8 kernel selection: Fixed FP8 CUTLASS group GEMM to properly fall back to Triton kernels on SM120 GPUs (#​33285)
Model Support
  • Step-3.5-Flash: New model support (#​33523)
Bugfix Model Support
  • Qwen3-VL-Reranker: Fixed model loading (#​33298)
  • Whisper: Fixed FlashAttention2 with full CUDA graphs (#​33360)
Performance
  • torch.compile cold-start: Fixed regression that increased cold-start compilation time (Llama3-70B: ~88s → ~22s) (#​33441)
  • MoE forward pass: Optimized by caching layer name computation (#​33184)
Bug Fixes
  • Fixed prefix cache hit rate of 0% with GPT-OSS style hybrid attention models (#​33524)
  • Enabled Triton MoE backend for FP8 per-tensor dynamic quantization (#​33300)
  • Disabled unsupported Renormalize routing methods for TRTLLM per-tensor FP8 MoE (#​33620)
  • Fixed speculative decoding metrics crash when no tokens generated (#​33729)
  • Disabled fast MoE cold start optimization with speculative decoding (#​33624)
  • Fixed ROCm skinny GEMM dispatch logic (#​33366)
Dependencies
  • Pinned LMCache >= v0.3.9 for API compatibility (#​33440)

New Contributors 🎉

Full Changelog: vllm-project/vllm@v0.15.0...v0.15.1

v0.15.0

Compare Source

Highlights

This release features 335 commits from 158 contributors (39 new)!

Model Support
Engine Core
  • Async scheduling + Pipeline Parallelism: --async-scheduling now works with pipeline parallelism (#​32359).
  • Mamba prefix caching: Block-aligned prefix caching for Mamba/hybrid models with --enable-prefix-caching --mamba-cache-mode align. Achieves ~2x speedup by caching Mamba states directly (#​30877).
  • Session-based streaming input: New incremental input support for interactive workloads like ASR. Accepts async generators producing StreamingInput objects while maintaining KV cache alignment (#​28973).
  • Model Runner V2: VLM support (#​32546), architecture improvements.
  • LoRA: Inplace loading for memory efficiency (#​31326).
  • AOT compilation: torch.compile inductor artifacts support (#​25205).
  • Performance: KV cache offloading redundant load prevention (#​29087), FlashAttn attention/cache update separation (#​25954).
Hardware & Performance
NVIDIA
  • Blackwell defaults: FlashInfer MLA is now the default MLA backend on Blackwell, with TRTLLM as default prefill (#​32615).
  • MoE performance: 1.2-2% E2E throughput improvement via grouped topk kernel fusion (#​32058), NVFP4 small-batch decoding improvement (#​30885), faster cold start for MoEs with torch.compile (#​32805).
  • FP4 kernel optimization: Up to 65% faster FP4 quantization on Blackwell (SM100F) using 256-bit loads, ~4% E2E throughput improvement (#​32520).
  • Kernel improvements: topk_sigmoid kernel for MoE routing (#​31246), atomics reduce counting for SplitK skinny GEMMs (#​29843), fused cat+quant for FP8 KV cache in MLA (#​32950).
  • torch.compile: SiluAndMul and QuantFP8 CustomOp compilation (#​32806), Triton prefill attention performance (#​32403).
AMD ROCm
  • MoRI EP: High-performance all2all backend for Expert Parallel (#​28664).
  • Attention improvements: Shuffle KV cache layout and assembly paged attention kernel for AiterFlashAttentionBackend (#​29887).
  • FP4 support: MLA projection GEMMs with dynamic quantization (#​32238).
  • Consumer GPU support: Flash Attention Triton backend on RDNA3/RDNA4 (#​32944).
Other Platforms
  • TPU: Pipeline parallelism support (#​28506), backend option (#​32438).
  • Intel XPU: AgRsAll2AllManager for distributed communication (#​32654).
  • CPU: NUMA-aware acceleration for TP/DP inference on ARM (#​32792), PyTorch 2.10 (#​32869).
  • Whisper: torch.compile support (#​30385).
  • WSL: Platform compatibility fix for Windows Subsystem for Linux (#​32749).
Quantization
  • MXFP4: W4A16 support for compressed-tensors MoE models (#​32285).
  • Non-gated MoE: Quantization support with Marlin, NVFP4 CUTLASS, FP8, INT8, and compressed-tensors (#​32257).
  • Intel: Quantization Toolkit integration (#​31716).
  • FP8 KV cache: Per-tensor and per-attention-head quantization via llmcompressor (#​30141).
API & Frontend
  • Responses API: Partial message generation (#​32100), include_stop_str_in_output tuning (#​32383), prompt_cache_key support (#​32824).
  • OpenAI API: skip_special_tokens configuration (#​32345).
  • Score endpoint: Flexible input formats with data_1/data_2 and queries/documents (#​32577).
  • Render endpoints: New endpoints for prompt preprocessing (#​32473).
  • Whisper API: avg_logprob and compression_ratio in verbose_json segments (#​31059).
  • Security: FIPS 140-3 compliant hash option for enterprise/government users (#​32386), --ssl-ciphers CLI argument (#​30937).
  • UX improvements: Auto api_server_count based on dp_size (#​32525), wheel variant auto-detection during install (#​32948), custom profiler URI schemes (#​32393).
Dependencies
Breaking Changes & Deprecations
  • Metrics: Removed deprecated vllm:time_per_output_token_seconds metric - use vllm:inter_token_latency_seconds instead (#​32661).
  • Environment variables: Removed deprecated environment variables (#​32812).
  • Quantization: DeepSpeedFp8 removed (#​32679), RTN removed (#​32697), HQQ deprecated (#​32681).
Bug Fixes
  • Speculative decoding: Eagle draft_model_config fix (#​31753).
  • DeepSeek: DeepSeek-V3.1 + DeepGEMM incompatible scale shapes fix (#​32361).
  • Distributed: DP+MoE inference fix via CpuCommunicator (#​31867), P/D with non-MoE DP fix (#​33037).
  • EPLB: Possible deadlock fix (#​32418).
  • NIXL: UCX memory leak fix by exporting UCX_MEM_MMAP_HOOK_MODE=none (#​32181).
  • Structured output: Outlines byte fallback handling fix (#​31391).

New Contributors 🎉

Full Changelog: vllm-project/vllm@v0.14.1...v0.15.0


Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • If you want to rebase/retry this PR, check this box

This PR has been generated by Renovate Bot.

@chaplain-grimaldus chaplain-grimaldus bot force-pushed the renovate/vllm-charts-0.x branch from 0708e9a to 67997aa Compare February 4, 2026 19:17
@chaplain-grimaldus chaplain-grimaldus bot changed the title feat(deps): update dependency vllm-charts ( v0.14.1 → v0.15.0 ) feat(deps): update dependency vllm-charts ( v0.14.1 → v0.15.1 ) Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants