Run cost-usage corpus scans off the Swift cooperative thread pool#1402
Conversation
|
@clawsweeper re-review |
|
🦞🧹 I asked ClawSweeper to review this item again. |
|
@clawsweeper re-run |
|
🦞👀 Command router queued. I will update this comment with the next step. |
|
Local validation on Result:
The executor-focused tests covered the core behavior this PR is adding: dedicated queue execution, overlapping scan serialization, error propagation, queued-work cancellation, and in-flight cancellation through the bridged cancellation check. The wider Scope note: I did not reproduce the author's large-corpus runtime sample locally; this is focused local regression validation plus the broader CostUsage suite. |
|
Checked the current merge conflict against The conflict is limited to the So the rebase looks low-risk: keep both changelog items under |
CostUsageScanner and PiSessionCostScanner scans execute synchronously for minutes on large session archives. Running them inline on cooperative-pool task threads starves every other async task in the process: menus freeze while the main thread sits idle, and overlapping provider scans multiply the pressure. Field samples on a 2.5GB corpus showed both provider scans saturating pool threads for 7+ minutes after a cache schema bump while menu opens stalled. All corpus scans and persisted-cache decoding now run on one dedicated serial utility queue (CostUsageScanExecutor), with task cancellation bridged into the scanner-level cancellation checks. Serialization also removes concurrent provider scans racing the same disk.
136b226 to
01350b2
Compare
|
Codex review: needs maintainer review before merge. Reviewed June 10, 2026, 11:57 PM ET / 03:57 UTC. Summary Reproducibility: yes. The PR documents a concrete cache-clear, relaunch, and process-sampling path on a real 2.5 GB corpus, with before/after stack placement demonstrating the current-main starvation mechanism and its removal. Review metrics: 3 noteworthy metrics.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Risk before merge
Maintainer options:
Next step before merge
Security Review detailsBest possible solution: Keep the dedicated off-pool executor, while explicitly accepting the extreme-corpus timeout behavior or preserving each provider's active-scan timeout budget by excluding serial-queue wait from it. Do we have a high-confidence way to reproduce the issue? Yes. The PR documents a concrete cache-clear, relaunch, and process-sampling path on a real 2.5 GB corpus, with before/after stack placement demonstrating the current-main starvation mechanism and its removal. Is this the best way to solve the issue? Yes. Moving long synchronous scans to a dedicated serial utility queue is the narrow structural solution to cooperative-pool starvation; the only unresolved choice is whether to accept or adjust the changed timeout accounting for queued providers. AGENTS.md: found and applied where relevant. Codex review notes: model internal, reasoning high; reviewed against febf562741e5. Label changesLabel justifications:
Evidence reviewedWhat I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
|
Exact-head CI proof completed after merge:
Run: https://github.com/steipete/CodexBar/actions/runs/27322622815 |
Summary
Run cost-usage corpus scans and persisted-cache decoding on a dedicated serial utility queue (
CostUsageScanExecutor) instead of inline on Swift's cooperative thread pool, with task cancellation bridged into the scanner-level cancellation checks.Context
This targets the systemic freeze mechanism behind the #1387 reports (and the environment of #1392).
CostUsageScanner/PiSessionCostScannerscans are synchronous and can run for minutes on large session archives.CostUsageFetcher.loadTokenSnapshotexecuted them inline inside cooperative-pool tasks, which violates the pool's forward-progress contract: while a scan runs, it owns a pool thread for minutes, and overlapping provider scans (startup force-refresh racing the hourly token timer) own several. Every other async continuation in the process — menu rebuild tasks, refresh completions — queues behind them, so the UI freezes while the main thread itself samples idle.Two amplifiers make this a release-day event for heavy users:
~/.claude/projects, 1.6 GB / 316 files~/.codex/sessions); the post-bump rescan ran 7+ minutes at full tilt.Field sample of that storm (v0.32.6-dev pre-fix, 1 ms interval, 5 s window, redacted to frame counts): 2,437 samples in the claude scan and 1,807 in the codex scan concurrently on two cooperative-pool threads, sustained across every capture window for ~3 minutes, while menu opens visibly froze — with the main thread parked in
mach_msgthe whole time.Change
CostUsageScanExecutor: one serialDispatchQueue(com.steipete.codexbar.cost-usage-scan, utility QoS).runbridges Swift task cancellation into a@Sendablecheck handed to the scanners, and work cancelled while still queued resumes immediately withCancellationErrorinstead of waiting behind an in-flight scan.CostUsageFetcher.loadTokenSnapshotmoves the scan + vertex-fallback + pi-merge block onto the executor;loadCachedCodexTokenSnapshotmoves its multi-megabyte cache JSON decode there too (it previously ran on the pool viaTask.detached).Validation
swift test --filter CostUsageScanExecutorTests— 5 new tests: labeled-queue execution, error propagation, overlap serialization, in-flight cancellation through the bridged check, queued-work cancellation.swift test --filter CostUsageScanExecutorLinuxTests— 2 Linux-compatible executor tests pass.swift test --filter CostUsage— 183 tests in 18 suites pass.make check— 0 violations;git diff --checkclean.Runtime Proof
A/B on the same machine, same protocol per phase: quit app → delete
~/Library/Caches/CodexBar/cost-usage→ relaunch →sample CodexBar 5 1at T+30s during the full-corpus rescan. Thread placement of the scan frames:The scan can no longer occupy cooperative-pool threads at all, so async-runtime starvation by scans is structurally impossible rather than merely less likely. The post-fix full rescan completed and rewrote the per-provider caches normally (cost data intact).
Honesty / scope notes