Run cost-usage corpus scans off the Swift cooperative thread pool by ProspectOre · Pull Request #1402 · steipete/CodexBar

ProspectOre · 2026-06-10T20:21:35Z

Summary

Run cost-usage corpus scans and persisted-cache decoding on a dedicated serial utility queue (CostUsageScanExecutor) instead of inline on Swift's cooperative thread pool, with task cancellation bridged into the scanner-level cancellation checks.

Context

This targets the systemic freeze mechanism behind the #1387 reports (and the environment of #1392). CostUsageScanner / PiSessionCostScanner scans are synchronous and can run for minutes on large session archives. CostUsageFetcher.loadTokenSnapshot executed them inline inside cooperative-pool tasks, which violates the pool's forward-progress contract: while a scan runs, it owns a pool thread for minutes, and overlapping provider scans (startup force-refresh racing the hourly token timer) own several. Every other async continuation in the process — menu rebuild tasks, refresh completions — queues behind them, so the UI freezes while the main thread itself samples idle.

Two amplifiers make this a release-day event for heavy users:

Cache schema bumps (e.g. the claude artifact moving v2 → v4 in 0.32.6-dev) invalidate the whole per-file scan cache, so the first launch after such a release rescans the full corpus.
The reporter machine for this PR holds a 2.5 GB corpus (903 MB ~/.claude/projects, 1.6 GB / 316 files ~/.codex/sessions); the post-bump rescan ran 7+ minutes at full tilt.

Field sample of that storm (v0.32.6-dev pre-fix, 1 ms interval, 5 s window, redacted to frame counts): 2,437 samples in the claude scan and 1,807 in the codex scan concurrently on two cooperative-pool threads, sustained across every capture window for ~3 minutes, while menu opens visibly froze — with the main thread parked in mach_msg the whole time.

Change

New CostUsageScanExecutor: one serial DispatchQueue (com.steipete.codexbar.cost-usage-scan, utility QoS). run bridges Swift task cancellation into a @Sendable check handed to the scanners, and work cancelled while still queued resumes immediately with CancellationError instead of waiting behind an in-flight scan.
CostUsageFetcher.loadTokenSnapshot moves the scan + vertex-fallback + pi-merge block onto the executor; loadCachedCodexTokenSnapshot moves its multi-megabyte cache JSON decode there too (it previously ran on the pool via Task.detached).
Serialization also removes concurrent provider scans hammering the same disk.

Validation

swift test --filter CostUsageScanExecutorTests — 5 new tests: labeled-queue execution, error propagation, overlap serialization, in-flight cancellation through the bridged check, queued-work cancellation.
swift test --filter CostUsageScanExecutorLinuxTests — 2 Linux-compatible executor tests pass.
swift test --filter CostUsage — 183 tests in 18 suites pass.
make check — 0 violations; git diff --check clean.
Structured autoreview — no accepted/actionable findings.

Runtime Proof

A/B on the same machine, same protocol per phase: quit app → delete ~/Library/Caches/CodexBar/cost-usage → relaunch → sample CodexBar 5 1 at T+30s during the full-corpus rescan. Thread placement of the scan frames:

# BEFORE (current main scan code)
3515 Thread_1092763   DispatchQueue_12: com.apple.root.utility-qos.cooperative  (concurrent)
     ... CostUsageScanner.loadClaudeDaily / CostUsageJsonl.scan ...

# AFTER (this branch)
3191 Thread_1094843   DispatchQueue_267: com.steipete.codexbar.cost-usage-scan  (serial)
     ... CostUsageScanner.loadClaudeDaily / CostUsageJsonl.scan ...
# zero scan frames on cooperative-pool threads

The scan can no longer occupy cooperative-pool threads at all, so async-runtime starvation by scans is structurally impossible rather than merely less likely. The post-fix full rescan completed and rewrote the per-provider caches normally (cost data intact).

Honesty / scope notes

This makes the app responsive during scans; it does not make scans cheaper. Steady-state ticks stay incremental via the existing per-file mtime/size cache; the full-rescan-after-schema-bump cost itself is Optimize Codex cost refresh policy for large history windows #1392's refresh-policy territory.
Serializing providers can lengthen total storm wall-time on multi-provider setups (scans no longer overlap), and the 10-minute per-provider timeout now includes time queued behind another provider's scan. On very large corpora that could surface as a timeout that previously squeaked through; the next tick resumes against the warm per-file cache.
The 1.6s AX click-roundtrip measured in both phases is automation overhead, not app latency; the responsiveness claim rests on the structural thread-placement proof plus the documented forward-progress contract of the cooperative pool.

ProspectOre · 2026-06-10T21:16:12Z

@clawsweeper re-review

clawsweeper · 2026-06-10T21:16:16Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

ProspectOre · 2026-06-10T21:52:36Z

@clawsweeper re-run

clawsweeper · 2026-06-10T21:52:39Z

🦞👀
ClawSweeper picked this up.

Command router queued. I will update this comment with the next step.

Yuxin-Qiao · 2026-06-11T02:58:24Z

Local validation on 136b226469b9343a80cb78cb9aa43f1bdbf23a26:

git diff --check HEAD^ HEAD
HOME=$PWD/.home CLANG_MODULE_CACHE_PATH=$PWD/.module-cache swift test --filter CostUsageScanExecutorTests
HOME=$PWD/.home CLANG_MODULE_CACHE_PATH=$PWD/.module-cache swift test --filter CostUsage

Result:

git diff --check: passed
CostUsageScanExecutorTests: passed, 5 tests in 1 suite
CostUsage: passed, 181 tests in 17 suites
Environment: /Applications/Xcode.app/Contents/Developer, Apple Swift 6.3.2

The executor-focused tests covered the core behavior this PR is adding: dedicated queue execution, overlapping scan serialization, error propagation, queued-work cancellation, and in-flight cancellation through the bridged cancellation check. The wider CostUsage filter also passed, so the scan executor change did not break the surrounding cost usage cache/scanner/fetcher coverage in this local run.

Scope note: I did not reproduce the author's large-corpus runtime sample locally; this is focused local regression validation plus the broader CostUsage suite.

Yuxin-Qiao · 2026-06-11T03:05:33Z

Checked the current merge conflict against origin/main (02c94032). After deepening the shallow checkout, git merge --no-commit --no-ff origin/main reports only one unresolved file: CHANGELOG.md.

The conflict is limited to the 0.32.6 — Unreleased / Fixed list: this PR’s cost-usage scan changelog line needs to be kept alongside the newer main cost-history entry from #1370. I did not see source-level unresolved conflicts in this merge check.

So the rebase looks low-risk: keep both changelog items under 0.32.6 — Unreleased / Fixed, then rerun the focused CostUsage tests/CI. I reset the local merge state after inspection.

CostUsageScanner and PiSessionCostScanner scans execute synchronously for minutes on large session archives. Running them inline on cooperative-pool task threads starves every other async task in the process: menus freeze while the main thread sits idle, and overlapping provider scans multiply the pressure. Field samples on a 2.5GB corpus showed both provider scans saturating pool threads for 7+ minutes after a cache schema bump while menu opens stalled. All corpus scans and persisted-cache decoding now run on one dedicated serial utility queue (CostUsageScanExecutor), with task cancellation bridged into the scanner-level cancellation checks. Serialization also removes concurrent provider scans racing the same disk.

clawsweeper · 2026-06-11T03:47:31Z

Codex review: needs maintainer review before merge. Reviewed June 10, 2026, 11:57 PM ET / 03:57 UTC.

Summary
The PR moves synchronous cost-usage corpus scans, provider fallback/merge work, and persisted-cache decoding onto a dedicated serial utility queue with task-cancellation bridging and focused macOS/Linux tests.

Reproducibility: yes. The PR documents a concrete cache-clear, relaunch, and process-sampling path on a real 2.5 GB corpus, with before/after stack placement demonstrating the current-main starvation mechanism and its removal.

Review metrics: 3 noteworthy metrics.

Patch surface: 5 files; +365/-39. The change is bounded to cost-usage execution, focused tests, and release context.
Focused tests: 7 added. Five macOS and two Linux tests directly exercise scheduling, serialization, errors, and cancellation.
Runtime corpus: 2.5 GB sampled. The contributor validated the actual starvation path on a workload large enough to sustain multi-minute scans.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🦞 diamond lobster
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

Confirm whether serial queue wait should count against each provider's existing timeout budget.

Risk before merge

[P2] One global serial queue prevents concurrent disk-heavy provider scans, but a later provider can wait behind an earlier multi-minute scan; if an outer ten-minute timeout includes that queue wait, an unusually large multi-provider setup may time out where concurrent scans previously completed.
[P1] The patch deliberately improves responsiveness rather than total scan cost, so full rescans after cache invalidation can still run for minutes and remain dependent on the separate refresh-policy work tracked at Optimize Codex cost refresh policy for large history windows #1392.

Maintainer options:

Accept the documented timeout edge case (recommended)
Merge after explicitly accepting that extreme multi-provider full rescans may time out and recover on a later refresh using warmed per-file caches.
Preserve each provider's scan budget
Move timeout accounting around active executor work rather than queue wait and add focused coverage for overlapping long provider scans.

Next step before merge

[P2] The code appears correct and proof is sufficient; the next action is maintainer acceptance or refinement of the disclosed timeout compatibility behavior, not an automated repair.

Security
Cleared: The internal scheduling patch adds no dependencies, downloads, workflow permissions, credential handling, package-resolution changes, or other concrete security or supply-chain concerns.

Review details

Best possible solution:

Keep the dedicated off-pool executor, while explicitly accepting the extreme-corpus timeout behavior or preserving each provider's active-scan timeout budget by excluding serial-queue wait from it.

Do we have a high-confidence way to reproduce the issue?

Yes. The PR documents a concrete cache-clear, relaunch, and process-sampling path on a real 2.5 GB corpus, with before/after stack placement demonstrating the current-main starvation mechanism and its removal.

Is this the best way to solve the issue?

Yes. Moving long synchronous scans to a dedicated serial utility queue is the narrow structural solution to cooperative-pool starvation; the only unresolved choice is whether to accept or adjust the changed timeout accounting for queued providers.

AGENTS.md: found and applied where relevant.

Codex review notes: model internal, reasoning high; reviewed against febf562741e5.

Label changes

Label justifications:

P2: This addresses severe but workload-dependent menu responsiveness degradation with a focused fix and limited blast radius.
merge-risk: 🚨 compatibility: Serializing all provider scans can reduce the effective timeout budget for a later provider in unusually large existing setups.
rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (logs): The PR contains after-fix real-app process samples showing scan frames on the dedicated serial queue, none on cooperative-pool workers, and successful cache completion on the same large corpus.
proof: sufficient: Contributor real behavior proof is sufficient. The PR contains after-fix real-app process samples showing scan frames on the dedicated serial queue, none on cooperative-pool workers, and successful cache completion on the same large corpus.

Evidence reviewed

What I checked:

Current-main problem path: Current main performs the cancellable cost scanners synchronously inside the async token-snapshot path, so a large corpus scan can occupy a Swift cooperative-pool worker for its full duration. (Sources/CodexBarCore/CostUsageFetcher.swift:146, 920997c6a365)
Dedicated executor: The branch introduces a serial utility DispatchQueue and a locked lifecycle state machine covering pre-install cancellation, queued cancellation, running cancellation checks, and one-time completion. (Sources/CodexBarCore/CostUsageScanExecutor.swift:9, e7aa65290b69)
Fetcher integration: The synchronous scanner, Vertex fallback, Pi report merge, and multi-megabyte persisted-cache decode move behind the executor while preserving the existing report-selection and merge behavior. (Sources/CodexBarCore/CostUsageFetcher.swift:158, e7aa65290b69)
Cancellation and serialization coverage: Five macOS tests cover queue identity, error propagation, overlap serialization, in-flight cancellation, and immediate queued cancellation; two Linux tests cover execution and cancellation. (Tests/CodexBarTests/CostUsageScanExecutorTests.swift:5, e7aa65290b69)
Independent validation: A contributor independently reported the five executor tests and the broader CostUsage suite passing on Apple Swift 6.3.2, while noting that the author's large-corpus runtime sample was not reproduced locally. (136b226469b9)
Real behavior proof: The PR body reports before/after samples from the same real 2.5 GB corpus: scan frames moved from concurrent cooperative utility workers to the named serial queue, with zero post-fix scan frames on the cooperative pool and normal cache rewrites. (Sources/CodexBarCore/CostUsageScanExecutor.swift:10, e7aa65290b69)

Likely related people:

steipete: He is the dominant contributor to the cost fetcher/scanner history, introduced the current released path, and authored this PR's cancellation and test-isolation refinements. (role: feature owner and recent area contributor; confidence: high; commits: 920997c6a365, 01350b23cc6f, e7aa65290b69; files: Sources/CodexBarCore/CostUsageFetcher.swift, Sources/CodexBarCore/CostUsageScanner.swift, Sources/CodexBarCore/CostUsageScanExecutor.swift)
Yuxin-Qiao: They ran the executor-focused and broader CostUsage test suites and inspected the earlier merge conflict, providing useful independent review context. (role: independent validator; confidence: medium; commits: 136b226469b9; files: Sources/CodexBarCore/CostUsageScanExecutor.swift, Tests/CodexBarTests/CostUsageScanExecutorTests.swift)

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

clawsweeper · 2026-06-11T03:50:46Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Superseded
Detail: A newer re-review for this item started before this run finished, so GitHub cancelled this older run. Check the latest ClawSweeper run for the current result.
Run: https://github.com/openclaw/clawsweeper/actions/runs/27322540466
Updated: 2026-06-11T03:53:29.271Z

steipete · 2026-06-11T04:41:52Z

Exact-head CI proof completed after merge:

macOS lint-build-test: passed
Linux x64 build/test/smoke: passed
Linux arm64 build/test/smoke: passed
GitGuardian: passed

Run: https://github.com/steipete/CodexBar/actions/runs/27322622815
Merged as 4c964ccc91e0f40bb713b14b58b8e8f901a44c68.

ProspectOre mentioned this pull request Jun 10, 2026

Resolve codex priority turns incrementally per refresh #1404

Merged

ProspectOre mentioned this pull request Jun 10, 2026

Merge Icons mode causes system-wide input freezes / beachballs on macOS 26 — WindowServer event-buffer overflow evidence (not an in-process hang) #1399

Closed

ProspectOre and others added 2 commits June 11, 2026 04:40

fix: make cost scan cancellation immediate

01350b2

steipete force-pushed the perf/cost-scan-off-cooperative-pool branch from 136b226 to 01350b2 Compare June 11, 2026 03:46

test: isolate cost scan executor queues

e7aa652

steipete merged commit 4c964cc into steipete:main Jun 11, 2026
4 checks passed

ProspectOre mentioned this pull request Jun 11, 2026

Persist the codex priority-turns memo across launches #1421

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run cost-usage corpus scans off the Swift cooperative thread pool#1402

Run cost-usage corpus scans off the Swift cooperative thread pool#1402
steipete merged 3 commits into
steipete:mainfrom
ProspectOre:perf/cost-scan-off-cooperative-pool

ProspectOre commented Jun 10, 2026 •

edited by steipete

Loading

Uh oh!

ProspectOre commented Jun 10, 2026

Uh oh!

clawsweeper Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

ProspectOre commented Jun 10, 2026

Uh oh!

clawsweeper Bot commented Jun 10, 2026

Uh oh!

Yuxin-Qiao commented Jun 11, 2026

Uh oh!

Yuxin-Qiao commented Jun 11, 2026

Uh oh!

clawsweeper Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

steipete commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ProspectOre commented Jun 10, 2026 • edited by steipete Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Change

Validation

Runtime Proof

Honesty / scope notes

Uh oh!

ProspectOre commented Jun 10, 2026

Uh oh!

clawsweeper Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ProspectOre commented Jun 10, 2026

Uh oh!

clawsweeper Bot commented Jun 10, 2026

Uh oh!

Yuxin-Qiao commented Jun 11, 2026

Uh oh!

Yuxin-Qiao commented Jun 11, 2026

Uh oh!

clawsweeper Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clawsweeper Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

steipete commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ProspectOre commented Jun 10, 2026 •

edited by steipete

Loading

clawsweeper Bot commented Jun 10, 2026 •

edited

Loading

clawsweeper Bot commented Jun 11, 2026 •

edited

Loading

clawsweeper Bot commented Jun 11, 2026 •

edited

Loading