Skip to content

Refactor prefix caching#4618

Open
grimoire wants to merge 20 commits into
InternLM:mainfrom
grimoire:refactor-prefix-caching
Open

Refactor prefix caching#4618
grimoire wants to merge 20 commits into
InternLM:mainfrom
grimoire:refactor-prefix-caching

Conversation

@grimoire
Copy link
Copy Markdown
Collaborator

@grimoire grimoire commented May 24, 2026

Summary

This PR extends PyTorch prefix caching so it is correct for multimodal/VLM requests and SSM/state-cache models, and adds best-effort routed-expert replay for prefix-cache hits.

The main goals are:

  • prevent false VLM prefix-cache hits when identical placeholder tokens represent different media
  • enable SSM prefix caching only when an exact recurrent-state checkpoint is available
  • keep prefix-cache behavior observable through hit-rate metrics
  • preserve existing text-only prefix-cache behavior and eviction/refcount semantics

Key Changes

Multimodal Prefix-Cache Identity

  • Add prefix-cache multimodal metadata for scheduler sequences:
    • span start / end
    • modality
    • deterministic content_hash
  • Hash multimodal content only when effective PyTorch scheduler prefix caching can use it.
  • Extend BlockTrie keys from token-only matching to token + multimodal extra hashes.
  • Clamp prefix-cache matches so they never stop inside a multimodal span.
  • Keep VLM prefix caching limited to supported PyTorch new-preprocess HistoryMultiModals paths.
  • Keep unsupported paths, TurboMind VLM paths, legacy VLM paths, and sliding-window attention disabled.

SSM / State-Cache Prefix Caching

  • Add SSM prefix-cache support using exact ready recurrent-state checkpoints.
  • Add state checkpoint reserve/commit/restore/release handling through scheduler, trie, state manager, and cache engine paths.
  • Keep prefix_cache_decode_state_interval=0 as “disable decode checkpoint saves only”; prefill/chunk checkpoint saves may still work.
  • Make prefix_cache_state_budget represent extra checkpoint capacity; budget 0 may still borrow idle runtime state slots.
  • Avoid CUDA synchronization in model-agent state-copy offset handling.
  • Add backend-free StateCacheEngine.copy_caches() support for sorted/coalesced offset copies.
  • Handle stale/invalid checkpoint candidates carefully:
    • request-local token/hash mismatches do not release global checkpoints
    • stale sparse-index entries are unindexed
    • detached/unready unpinned checkpoints are released
    • pinned stale checkpoints are preserved for the restore owner to release

Routed Expert Replay

  • Store optional per-block routed expert history on trie nodes.
  • Replay cached routed experts on prefix-cache hit when:
    • the target request asked for routed experts
    • all matched nodes in the restored range have expert data
    • sequence history alignment is still valid
  • Allow trie nodes created without expert data to be enriched later by equivalent recomputed requests.
  • Keep routed experts out of prefix-cache identity.

Metrics And Debuggability

  • Export existing prefix-cache hit rate through schedule metrics / Prometheus.
  • Add focused prefix-cache debug logs for important match/save/restore paths.

Misc

  • Add LongBenchV2 eval config/script wiring for benchmark use.

Notes / Limitations

  • Sliding-window prefix caching remains out of scope.
  • SSM prefix-cache hits require an exact ready recurrent-state checkpoint; KV-only SSM reuse is intentionally not allowed.
  • Routed expert replay is best-effort. Missing expert slices do not block KV/state prefix-cache reuse.
  • VLM prefix caching is intentionally narrow for this PR and only enabled on the supported PyTorch multimodal metadata path.

@grimoire grimoire changed the title [WIP] Refactor prefix caching Refactor prefix caching May 27, 2026
@grimoire grimoire marked this pull request as ready for review May 27, 2026 04:44
Copilot AI review requested due to automatic review settings May 27, 2026 04:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors PyTorch prefix caching to support multimodal identity, SSM/state-cache checkpoints, routed-expert replay, and observability while keeping unsupported VL/spec/sliding-window paths disabled.

Changes:

  • Adds multimodal-aware prefix-cache keys and content hashing for supported PyTorch VL paths.
  • Adds SSM prefix-cache checkpoint lifecycle across scheduler, trie, state manager, input creation, model forward, and engine loop.
  • Adds routed-expert replay, prefix-cache metrics, CLI/config options, and LongBenchV2 eval wiring.

Reviewed changes

Copilot reviewed 38 out of 38 changed files in this pull request and generated no comments.

Show a summary per file
File Description
lmdeploy/pytorch/paging/block_trie.py Adds multimodal keys, SSM checkpoint indexing/lifecycle, routed-expert replay, and stats rollback support.
lmdeploy/pytorch/paging/scheduler.py Coordinates tentative prefix matches, rollback, SSM state availability, and migration match fix.
lmdeploy/pytorch/paging/state_manager.py Splits shared state slots into runtime and checkpoint ownership.
lmdeploy/pytorch/engine/inputs_maker.py Builds compact SSM restore/save offsets and adjusts long-context multimodal chunking.
lmdeploy/pytorch/engine/model_agent/agent.py Copies SSM checkpoint state before/after model forward.
lmdeploy/pytorch/engine/engine_loop.py Publishes checkpoint saves/restores around forward prefetching.
lmdeploy/pytorch/engine/cache_engine.py Adds validated state-cache slot copy support.
lmdeploy/pytorch/messages.py Adds prefix-cache metadata/state and multimodal clamp/hash helpers.
lmdeploy/pytorch/multimodal/data_type.py Adds deterministic multimodal content hashing.
lmdeploy/pytorch/model_inputs.py Carries compact SSM restore/save offsets through model inputs.
lmdeploy/pytorch/config.py Adds prefix-cache SSM budget and decode checkpoint interval config.
lmdeploy/messages.py Exposes new PyTorch engine config fields.
lmdeploy/pytorch/engine/executor/base.py Keeps SSM prefix cache enabled and disables prefix cache for spec decoding.
lmdeploy/pytorch/engine/config_builder.py Propagates new prefix-cache config fields.
lmdeploy/pytorch/engine/engine.py Precomputes multimodal hashes when prefix caching is enabled.
lmdeploy/pytorch/block.py Removes prefix-cache node state from logical block container.
lmdeploy/pytorch/paging/seq_states/states.py Releases/discards prefix-cache state on sequence free.
lmdeploy/pytorch/paging/eviction_helper/recompute_eviction_helper.py Evicts SSM checkpoints to recover runtime state slots.
lmdeploy/pytorch/strategies/ar/model_inputs.py Preserves SSM save offsets during decoding reindex.
lmdeploy/pytorch/strategies/ar/step_inputs.py Threads SSM save offsets through AR input reindexing.
lmdeploy/pytorch/strategies/ar_spec/step_inputs.py Threads SSM save offsets through AR-spec reindexing.
lmdeploy/pytorch/strategies/dllm/step_inputs.py Threads SSM save offsets through DLLM reindexing.
lmdeploy/vl/model/preprocess_utils.py Expands single/per-video video multimodal items.
lmdeploy/serve/core/vl_async_engine.py Enables VL prefix caching only for supported PyTorch new-preprocess paths.
lmdeploy/pytorch/models/qwen3_vl.py Normalizes multimodal metadata construction.
lmdeploy/metrics/loggers.py Exports prefix-cache hit rate to Prometheus/log metrics.
lmdeploy/cli/utils.py Adds CLI arguments for SSM prefix-cache state settings.
lmdeploy/cli/serve.py Wires new prefix-cache CLI settings into API server config.
lmdeploy/cli/cli.py Adds new prefix-cache CLI settings to chat parser.
eval/eval.py Adds LongBenchV2 selection and optional judger handling.
eval/config.py Adds LongBenchV2 config and tunable OpenCompass parameters.
autotest/utils/run_restful_chat.py Normalizes an MCQ docstring character.
tests/pytorch/paging/test_block_trie.py Adds coverage for multimodal keys, routed experts, and SSM checkpoint lifecycle.
tests/pytorch/paging/test_scheduler.py Adds SSM state scheduling, rollback, migration, and long-context prefix-cache tests.
tests/pytorch/engine/test_inputs_maker.py Adds long-context and compact SSM offset tests.
tests/pytorch/engine/test_executor_base.py Adds executor/config coverage for prefix-cache state settings.
tests/pytorch/engine/test_cache_engine.py Adds state-cache copy validation tests.
tests/test_lmdeploy/test_vl/test_preprocess_utils.py Adds video multimodal expansion tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants