Skip to content

Fix #1273: bug: Thread leak in searcher._retrieve_paths causes container thread exhaustion#1969

Open
Memtensor-AI wants to merge 1 commit into
dev-20260624-v2.0.22from
bugfix/autodev-1273
Open

Fix #1273: bug: Thread leak in searcher._retrieve_paths causes container thread exhaustion#1969
Memtensor-AI wants to merge 1 commit into
dev-20260624-v2.0.22from
bugfix/autodev-1273

Conversation

@Memtensor-AI

Copy link
Copy Markdown
Collaborator

Description

Fixed unbounded thread accumulation in the tree-text search pipeline (issue #1273) by replacing four per-request with ContextThreadPoolExecutor(...) blocks in src/memos/memories/textual/tree_text_memory/retrieve/searcher.py with two shared, class-level executors created once in Searcher.__init__: _search_executor (max_workers=10, prefix search) for the outer A–F retrieval paths and _search_subtask_executor (max_workers=10, prefix search-sub) for the nested sub-paths in _retrieve_from_long_term_and_user, _retrieve_from_tool_memory, and _deduplicate_rawfile_results. The two-pool split eliminates nested-submission deadlock; total worker threads are now bounded at 20 per Searcher regardless of request volume or downstream latency.

Every Future.result(...) and the as_completed(...) iterator in the four affected methods now carry timeout=SEARCH_TASK_TIMEOUT_SECONDS (30 s default). On concurrent.futures.TimeoutError the call site logs a warning identifying the path and skips that contribution, so a single stuck Neo4j/embedding/HTTP call can no longer block the caller indefinitely while threads accumulate.

Tests: added 12 regression tests in tests/memories/textual/test_tree_searcher_thread_pool.py covering executor singleton identity, pool sizing/prefixes, no-per-request-construction guarantee, bounded thread count under 16 sequential searches, timeout propagation to Future.result, hanging-subtask recovery, and per-method use of the correct pool. Real pytest output: 12 new + 4 pre-existing searcher tests pass; the wider tests/memories/textual suite is green (70 passed); all other observed failures across the repo are pre-existing on dev-20260624-v2.0.22 (confirmed via git stash). Ruff lint and format both clean.

Related Issue (Required): Fixes #1273

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (does not change functionality, e.g. code style improvements, linting)
  • Documentation update

How Has This Been Tested?

Automated tests are pending.

  • Unit Test
  • Test Script Or Test Steps (please provide)
  • Pipeline Automated API Test (please provide)

Checklist

  • I have performed a self-review of my own code
  • I have commented my code in hard-to-understand areas
  • I have added tests that prove my fix is effective or that my feature works
  • I have created related documentation issue/PR in MemOS-Docs (if applicable)
  • I have linked the issue to this PR (if applicable)
  • I have mentioned the person who will review this PR

@MatthewZhuang, @CarltonXiang, @syzsunshine219, @World-controller please review this PR.

Reviewer Checklist

Replace per-request ContextThreadPoolExecutor instantiation across the
tree-text search pipeline with two shared class-level executors plus a
30 s per-future timeout. The previous code opened a fresh pool inside
`with` blocks at four call sites (`_retrieve_paths`,
`_retrieve_from_long_term_and_user`, `_retrieve_from_tool_memory`,
`_deduplicate_rawfile_results`); whenever a worker hung on a slow
Neo4j / embedding / HTTP call, `shutdown(wait=True)` never returned
and the worker threads accumulated unboundedly across requests, eventually
exhausting the container's pthread limit (reporter observed 8744
threads; /search returned HTTP 200 with empty results, /chat HTTP 503,
even docker exec failing).

Searcher now owns `_search_executor` (max_workers=10, prefix `search`)
for outer A-F retrieval paths and `_search_subtask_executor`
(max_workers=10, prefix `search-sub`) for the nested sub-paths and the
rawfile-dedup. The two-pool split prevents nested-submission deadlock;
total threads are bounded at 20 per Searcher regardless of request
volume or downstream latency. Every `Future.result(...)` and the
`as_completed(...)` iterator now carry `timeout=SEARCH_TASK_TIMEOUT_SECONDS`
(30 s default); on TimeoutError the call site logs a warning and skips
that contribution so the caller can still return.

Closes #1273
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-generated bug Something isn't working | 功能异常

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants