Refactor prefix caching by grimoire · Pull Request #4618 · InternLM/lmdeploy

grimoire · 2026-05-24T05:59:41Z

Summary

This PR extends PyTorch prefix caching so it is correct for multimodal/VLM requests and SSM/state-cache models, and adds best-effort routed-expert replay for prefix-cache hits.

The main goals are:

prevent false VLM prefix-cache hits when identical placeholder tokens represent different media
enable SSM prefix caching only when an exact recurrent-state checkpoint is available
keep prefix-cache behavior observable through hit-rate metrics
preserve existing text-only prefix-cache behavior and eviction/refcount semantics

Key Changes

Multimodal Prefix-Cache Identity

Add prefix-cache multimodal metadata for scheduler sequences:
- span start / end
- modality
- deterministic content_hash
Hash multimodal content only when effective PyTorch scheduler prefix caching can use it.
Extend BlockTrie keys from token-only matching to token + multimodal extra hashes.
Clamp prefix-cache matches so they never stop inside a multimodal span.
Keep VLM prefix caching limited to supported PyTorch new-preprocess HistoryMultiModals paths.
Keep unsupported paths, TurboMind VLM paths, legacy VLM paths, and sliding-window attention disabled.

SSM / State-Cache Prefix Caching

Add SSM prefix-cache support using exact ready recurrent-state checkpoints.
Add state checkpoint reserve/commit/restore/release handling through scheduler, trie, state manager, and cache engine paths.
Keep prefix_cache_decode_state_interval=0 as “disable decode checkpoint saves only”; prefill/chunk checkpoint saves may still work.
Make prefix_cache_state_budget represent extra checkpoint capacity; budget 0 may still borrow idle runtime state slots.
Avoid CUDA synchronization in model-agent state-copy offset handling.
Add backend-free StateCacheEngine.copy_caches() support for sorted/coalesced offset copies.
Handle stale/invalid checkpoint candidates carefully:
- request-local token/hash mismatches do not release global checkpoints
- stale sparse-index entries are unindexed
- detached/unready unpinned checkpoints are released
- pinned stale checkpoints are preserved for the restore owner to release

Routed Expert Replay

Store optional per-block routed expert history on trie nodes.
Replay cached routed experts on prefix-cache hit when:
- the target request asked for routed experts
- all matched nodes in the restored range have expert data
- sequence history alignment is still valid
Allow trie nodes created without expert data to be enriched later by equivalent recomputed requests.
Keep routed experts out of prefix-cache identity.

Metrics And Debuggability

Export existing prefix-cache hit rate through schedule metrics / Prometheus.
Add focused prefix-cache debug logs for important match/save/restore paths.

Misc

Add LongBenchV2 eval config/script wiring for benchmark use.

Notes / Limitations

Sliding-window prefix caching remains out of scope.
SSM prefix-cache hits require an exact ready recurrent-state checkpoint; KV-only SSM reuse is intentionally not allowed.
Routed expert replay is best-effort. Missing expert slices do not block KV/state prefix-cache reuse.
VLM prefix caching is intentionally narrow for this PR and only enabled on the supported PyTorch multimodal metadata path.

Copilot

Pull request overview

This PR refactors PyTorch prefix caching to support multimodal identity, SSM/state-cache checkpoints, routed-expert replay, and observability while keeping unsupported VL/spec/sliding-window paths disabled.

Changes:

Adds multimodal-aware prefix-cache keys and content hashing for supported PyTorch VL paths.
Adds SSM prefix-cache checkpoint lifecycle across scheduler, trie, state manager, input creation, model forward, and engine loop.
Adds routed-expert replay, prefix-cache metrics, CLI/config options, and LongBenchV2 eval wiring.

Reviewed changes

Copilot reviewed 38 out of 38 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`lmdeploy/pytorch/paging/block_trie.py`	Adds multimodal keys, SSM checkpoint indexing/lifecycle, routed-expert replay, and stats rollback support.
`lmdeploy/pytorch/paging/scheduler.py`	Coordinates tentative prefix matches, rollback, SSM state availability, and migration match fix.
`lmdeploy/pytorch/paging/state_manager.py`	Splits shared state slots into runtime and checkpoint ownership.
`lmdeploy/pytorch/engine/inputs_maker.py`	Builds compact SSM restore/save offsets and adjusts long-context multimodal chunking.
`lmdeploy/pytorch/engine/model_agent/agent.py`	Copies SSM checkpoint state before/after model forward.
`lmdeploy/pytorch/engine/engine_loop.py`	Publishes checkpoint saves/restores around forward prefetching.
`lmdeploy/pytorch/engine/cache_engine.py`	Adds validated state-cache slot copy support.
`lmdeploy/pytorch/messages.py`	Adds prefix-cache metadata/state and multimodal clamp/hash helpers.
`lmdeploy/pytorch/multimodal/data_type.py`	Adds deterministic multimodal content hashing.
`lmdeploy/pytorch/model_inputs.py`	Carries compact SSM restore/save offsets through model inputs.
`lmdeploy/pytorch/config.py`	Adds prefix-cache SSM budget and decode checkpoint interval config.
`lmdeploy/messages.py`	Exposes new PyTorch engine config fields.
`lmdeploy/pytorch/engine/executor/base.py`	Keeps SSM prefix cache enabled and disables prefix cache for spec decoding.
`lmdeploy/pytorch/engine/config_builder.py`	Propagates new prefix-cache config fields.
`lmdeploy/pytorch/engine/engine.py`	Precomputes multimodal hashes when prefix caching is enabled.
`lmdeploy/pytorch/block.py`	Removes prefix-cache node state from logical block container.
`lmdeploy/pytorch/paging/seq_states/states.py`	Releases/discards prefix-cache state on sequence free.
`lmdeploy/pytorch/paging/eviction_helper/recompute_eviction_helper.py`	Evicts SSM checkpoints to recover runtime state slots.
`lmdeploy/pytorch/strategies/ar/model_inputs.py`	Preserves SSM save offsets during decoding reindex.
`lmdeploy/pytorch/strategies/ar/step_inputs.py`	Threads SSM save offsets through AR input reindexing.
`lmdeploy/pytorch/strategies/ar_spec/step_inputs.py`	Threads SSM save offsets through AR-spec reindexing.
`lmdeploy/pytorch/strategies/dllm/step_inputs.py`	Threads SSM save offsets through DLLM reindexing.
`lmdeploy/vl/model/preprocess_utils.py`	Expands single/per-video video multimodal items.
`lmdeploy/serve/core/vl_async_engine.py`	Enables VL prefix caching only for supported PyTorch new-preprocess paths.
`lmdeploy/pytorch/models/qwen3_vl.py`	Normalizes multimodal metadata construction.
`lmdeploy/metrics/loggers.py`	Exports prefix-cache hit rate to Prometheus/log metrics.
`lmdeploy/cli/utils.py`	Adds CLI arguments for SSM prefix-cache state settings.
`lmdeploy/cli/serve.py`	Wires new prefix-cache CLI settings into API server config.
`lmdeploy/cli/cli.py`	Adds new prefix-cache CLI settings to chat parser.
`eval/eval.py`	Adds LongBenchV2 selection and optional judger handling.
`eval/config.py`	Adds LongBenchV2 config and tunable OpenCompass parameters.
`autotest/utils/run_restful_chat.py`	Normalizes an MCQ docstring character.
`tests/pytorch/paging/test_block_trie.py`	Adds coverage for multimodal keys, routed experts, and SSM checkpoint lifecycle.
`tests/pytorch/paging/test_scheduler.py`	Adds SSM state scheduling, rollback, migration, and long-context prefix-cache tests.
`tests/pytorch/engine/test_inputs_maker.py`	Adds long-context and compact SSM offset tests.
`tests/pytorch/engine/test_executor_base.py`	Adds executor/config coverage for prefix-cache state settings.
`tests/pytorch/engine/test_cache_engine.py`	Adds state-cache copy validation tests.
`tests/test_lmdeploy/test_vl/test_preprocess_utils.py`	Adds video multimodal expansion tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

grimoire added 20 commits May 23, 2026 18:34

finish vlm

bfd25d5

add context hash

ec8e772

early hash

a86ae37

ssm prefix caching prefill

6d13d65

decoding ssm

9ca3a3b

refactor sequence

96f6db4

add comment

1aa8773

enable when prefix_cache_decode_state_interval=0

3866b42

optimize copy state

9da79a7

better copy cache

64991ff

easy engine loop func

a6c2dde

more fix

baef92a

fix end states

90ddc95

update block trie

79838a4

refactor block trie

22167ae

add hit rate metrics

e93959b

add longbenchv2

235b43d

fix

efd2ffc

Merge branch 'main' into refactor-prefix-caching

8c8a0ba

add check and raise

ec728f7

grimoire changed the title ~~[WIP] Refactor prefix caching~~ Refactor prefix caching May 27, 2026

grimoire marked this pull request as ready for review May 27, 2026 04:44

Copilot AI review requested due to automatic review settings May 27, 2026 04:44

Copilot started reviewing on behalf of grimoire May 27, 2026 04:44 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor prefix caching#4618

Refactor prefix caching#4618
grimoire wants to merge 20 commits into
InternLM:mainfrom
grimoire:refactor-prefix-caching

grimoire commented May 24, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

grimoire commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Multimodal Prefix-Cache Identity

SSM / State-Cache Prefix Caching

Routed Expert Replay

Metrics And Debuggability

Misc

Notes / Limitations

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

grimoire commented May 24, 2026 •

edited

Loading