Refactor prefix caching#4618
Open
grimoire wants to merge 20 commits into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR refactors PyTorch prefix caching to support multimodal identity, SSM/state-cache checkpoints, routed-expert replay, and observability while keeping unsupported VL/spec/sliding-window paths disabled.
Changes:
- Adds multimodal-aware prefix-cache keys and content hashing for supported PyTorch VL paths.
- Adds SSM prefix-cache checkpoint lifecycle across scheduler, trie, state manager, input creation, model forward, and engine loop.
- Adds routed-expert replay, prefix-cache metrics, CLI/config options, and LongBenchV2 eval wiring.
Reviewed changes
Copilot reviewed 38 out of 38 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
lmdeploy/pytorch/paging/block_trie.py |
Adds multimodal keys, SSM checkpoint indexing/lifecycle, routed-expert replay, and stats rollback support. |
lmdeploy/pytorch/paging/scheduler.py |
Coordinates tentative prefix matches, rollback, SSM state availability, and migration match fix. |
lmdeploy/pytorch/paging/state_manager.py |
Splits shared state slots into runtime and checkpoint ownership. |
lmdeploy/pytorch/engine/inputs_maker.py |
Builds compact SSM restore/save offsets and adjusts long-context multimodal chunking. |
lmdeploy/pytorch/engine/model_agent/agent.py |
Copies SSM checkpoint state before/after model forward. |
lmdeploy/pytorch/engine/engine_loop.py |
Publishes checkpoint saves/restores around forward prefetching. |
lmdeploy/pytorch/engine/cache_engine.py |
Adds validated state-cache slot copy support. |
lmdeploy/pytorch/messages.py |
Adds prefix-cache metadata/state and multimodal clamp/hash helpers. |
lmdeploy/pytorch/multimodal/data_type.py |
Adds deterministic multimodal content hashing. |
lmdeploy/pytorch/model_inputs.py |
Carries compact SSM restore/save offsets through model inputs. |
lmdeploy/pytorch/config.py |
Adds prefix-cache SSM budget and decode checkpoint interval config. |
lmdeploy/messages.py |
Exposes new PyTorch engine config fields. |
lmdeploy/pytorch/engine/executor/base.py |
Keeps SSM prefix cache enabled and disables prefix cache for spec decoding. |
lmdeploy/pytorch/engine/config_builder.py |
Propagates new prefix-cache config fields. |
lmdeploy/pytorch/engine/engine.py |
Precomputes multimodal hashes when prefix caching is enabled. |
lmdeploy/pytorch/block.py |
Removes prefix-cache node state from logical block container. |
lmdeploy/pytorch/paging/seq_states/states.py |
Releases/discards prefix-cache state on sequence free. |
lmdeploy/pytorch/paging/eviction_helper/recompute_eviction_helper.py |
Evicts SSM checkpoints to recover runtime state slots. |
lmdeploy/pytorch/strategies/ar/model_inputs.py |
Preserves SSM save offsets during decoding reindex. |
lmdeploy/pytorch/strategies/ar/step_inputs.py |
Threads SSM save offsets through AR input reindexing. |
lmdeploy/pytorch/strategies/ar_spec/step_inputs.py |
Threads SSM save offsets through AR-spec reindexing. |
lmdeploy/pytorch/strategies/dllm/step_inputs.py |
Threads SSM save offsets through DLLM reindexing. |
lmdeploy/vl/model/preprocess_utils.py |
Expands single/per-video video multimodal items. |
lmdeploy/serve/core/vl_async_engine.py |
Enables VL prefix caching only for supported PyTorch new-preprocess paths. |
lmdeploy/pytorch/models/qwen3_vl.py |
Normalizes multimodal metadata construction. |
lmdeploy/metrics/loggers.py |
Exports prefix-cache hit rate to Prometheus/log metrics. |
lmdeploy/cli/utils.py |
Adds CLI arguments for SSM prefix-cache state settings. |
lmdeploy/cli/serve.py |
Wires new prefix-cache CLI settings into API server config. |
lmdeploy/cli/cli.py |
Adds new prefix-cache CLI settings to chat parser. |
eval/eval.py |
Adds LongBenchV2 selection and optional judger handling. |
eval/config.py |
Adds LongBenchV2 config and tunable OpenCompass parameters. |
autotest/utils/run_restful_chat.py |
Normalizes an MCQ docstring character. |
tests/pytorch/paging/test_block_trie.py |
Adds coverage for multimodal keys, routed experts, and SSM checkpoint lifecycle. |
tests/pytorch/paging/test_scheduler.py |
Adds SSM state scheduling, rollback, migration, and long-context prefix-cache tests. |
tests/pytorch/engine/test_inputs_maker.py |
Adds long-context and compact SSM offset tests. |
tests/pytorch/engine/test_executor_base.py |
Adds executor/config coverage for prefix-cache state settings. |
tests/pytorch/engine/test_cache_engine.py |
Adds state-cache copy validation tests. |
tests/test_lmdeploy/test_vl/test_preprocess_utils.py |
Adds video multimodal expansion tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR extends PyTorch prefix caching so it is correct for multimodal/VLM requests and SSM/state-cache models, and adds best-effort routed-expert replay for prefix-cache hits.
The main goals are:
Key Changes
Multimodal Prefix-Cache Identity
start/endmodalitycontent_hashBlockTriekeys from token-only matching to token + multimodal extra hashes.HistoryMultiModalspaths.SSM / State-Cache Prefix Caching
prefix_cache_decode_state_interval=0as “disable decode checkpoint saves only”; prefill/chunk checkpoint saves may still work.prefix_cache_state_budgetrepresent extra checkpoint capacity; budget0may still borrow idle runtime state slots.StateCacheEngine.copy_caches()support for sorted/coalesced offset copies.Routed Expert Replay
Metrics And Debuggability
Misc
Notes / Limitations