Skip to content

Run cost-usage corpus scans off the Swift cooperative thread pool#1402

Merged
steipete merged 3 commits into
steipete:mainfrom
ProspectOre:perf/cost-scan-off-cooperative-pool
Jun 11, 2026
Merged

Run cost-usage corpus scans off the Swift cooperative thread pool#1402
steipete merged 3 commits into
steipete:mainfrom
ProspectOre:perf/cost-scan-off-cooperative-pool

Conversation

@ProspectOre

@ProspectOre ProspectOre commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Run cost-usage corpus scans and persisted-cache decoding on a dedicated serial utility queue (CostUsageScanExecutor) instead of inline on Swift's cooperative thread pool, with task cancellation bridged into the scanner-level cancellation checks.

Context

This targets the systemic freeze mechanism behind the #1387 reports (and the environment of #1392). CostUsageScanner / PiSessionCostScanner scans are synchronous and can run for minutes on large session archives. CostUsageFetcher.loadTokenSnapshot executed them inline inside cooperative-pool tasks, which violates the pool's forward-progress contract: while a scan runs, it owns a pool thread for minutes, and overlapping provider scans (startup force-refresh racing the hourly token timer) own several. Every other async continuation in the process — menu rebuild tasks, refresh completions — queues behind them, so the UI freezes while the main thread itself samples idle.

Two amplifiers make this a release-day event for heavy users:

  • Cache schema bumps (e.g. the claude artifact moving v2 → v4 in 0.32.6-dev) invalidate the whole per-file scan cache, so the first launch after such a release rescans the full corpus.
  • The reporter machine for this PR holds a 2.5 GB corpus (903 MB ~/.claude/projects, 1.6 GB / 316 files ~/.codex/sessions); the post-bump rescan ran 7+ minutes at full tilt.

Field sample of that storm (v0.32.6-dev pre-fix, 1 ms interval, 5 s window, redacted to frame counts): 2,437 samples in the claude scan and 1,807 in the codex scan concurrently on two cooperative-pool threads, sustained across every capture window for ~3 minutes, while menu opens visibly froze — with the main thread parked in mach_msg the whole time.

Change

  • New CostUsageScanExecutor: one serial DispatchQueue (com.steipete.codexbar.cost-usage-scan, utility QoS). run bridges Swift task cancellation into a @Sendable check handed to the scanners, and work cancelled while still queued resumes immediately with CancellationError instead of waiting behind an in-flight scan.
  • CostUsageFetcher.loadTokenSnapshot moves the scan + vertex-fallback + pi-merge block onto the executor; loadCachedCodexTokenSnapshot moves its multi-megabyte cache JSON decode there too (it previously ran on the pool via Task.detached).
  • Serialization also removes concurrent provider scans hammering the same disk.

Validation

  • swift test --filter CostUsageScanExecutorTests — 5 new tests: labeled-queue execution, error propagation, overlap serialization, in-flight cancellation through the bridged check, queued-work cancellation.
  • swift test --filter CostUsageScanExecutorLinuxTests — 2 Linux-compatible executor tests pass.
  • swift test --filter CostUsage — 183 tests in 18 suites pass.
  • make check — 0 violations; git diff --check clean.
  • Structured autoreview — no accepted/actionable findings.

Runtime Proof

A/B on the same machine, same protocol per phase: quit app → delete ~/Library/Caches/CodexBar/cost-usage → relaunch → sample CodexBar 5 1 at T+30s during the full-corpus rescan. Thread placement of the scan frames:

# BEFORE (current main scan code)
3515 Thread_1092763   DispatchQueue_12: com.apple.root.utility-qos.cooperative  (concurrent)
     ... CostUsageScanner.loadClaudeDaily / CostUsageJsonl.scan ...

# AFTER (this branch)
3191 Thread_1094843   DispatchQueue_267: com.steipete.codexbar.cost-usage-scan  (serial)
     ... CostUsageScanner.loadClaudeDaily / CostUsageJsonl.scan ...
# zero scan frames on cooperative-pool threads

The scan can no longer occupy cooperative-pool threads at all, so async-runtime starvation by scans is structurally impossible rather than merely less likely. The post-fix full rescan completed and rewrote the per-provider caches normally (cost data intact).

Honesty / scope notes

  • This makes the app responsive during scans; it does not make scans cheaper. Steady-state ticks stay incremental via the existing per-file mtime/size cache; the full-rescan-after-schema-bump cost itself is Optimize Codex cost refresh policy for large history windows #1392's refresh-policy territory.
  • Serializing providers can lengthen total storm wall-time on multi-provider setups (scans no longer overlap), and the 10-minute per-provider timeout now includes time queued behind another provider's scan. On very large corpora that could surface as a timeout that previously squeaked through; the next tick resumes against the warm per-file cache.
  • The 1.6s AX click-roundtrip measured in both phases is automation overhead, not app latency; the responsiveness claim rests on the structural thread-placement proof plus the documented forward-progress contract of the cooperative pool.

@ProspectOre

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented Jun 10, 2026

Copy link
Copy Markdown

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

@ProspectOre

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-run

@clawsweeper

clawsweeper Bot commented Jun 10, 2026

Copy link
Copy Markdown

🦞👀
ClawSweeper picked this up.

Command router queued. I will update this comment with the next step.

Copy link
Copy Markdown
Contributor

Local validation on 136b226469b9343a80cb78cb9aa43f1bdbf23a26:

git diff --check HEAD^ HEAD
HOME=$PWD/.home CLANG_MODULE_CACHE_PATH=$PWD/.module-cache swift test --filter CostUsageScanExecutorTests
HOME=$PWD/.home CLANG_MODULE_CACHE_PATH=$PWD/.module-cache swift test --filter CostUsage

Result:

  • git diff --check: passed
  • CostUsageScanExecutorTests: passed, 5 tests in 1 suite
  • CostUsage: passed, 181 tests in 17 suites
  • Environment: /Applications/Xcode.app/Contents/Developer, Apple Swift 6.3.2

The executor-focused tests covered the core behavior this PR is adding: dedicated queue execution, overlapping scan serialization, error propagation, queued-work cancellation, and in-flight cancellation through the bridged cancellation check. The wider CostUsage filter also passed, so the scan executor change did not break the surrounding cost usage cache/scanner/fetcher coverage in this local run.

Scope note: I did not reproduce the author's large-corpus runtime sample locally; this is focused local regression validation plus the broader CostUsage suite.

@Yuxin-Qiao

Copy link
Copy Markdown
Contributor

Checked the current merge conflict against origin/main (02c94032). After deepening the shallow checkout, git merge --no-commit --no-ff origin/main reports only one unresolved file: CHANGELOG.md.

The conflict is limited to the 0.32.6 — Unreleased / Fixed list: this PR’s cost-usage scan changelog line needs to be kept alongside the newer main cost-history entry from #1370. I did not see source-level unresolved conflicts in this merge check.

So the rebase looks low-risk: keep both changelog items under 0.32.6 — Unreleased / Fixed, then rerun the focused CostUsage tests/CI. I reset the local merge state after inspection.

ProspectOre and others added 2 commits June 11, 2026 04:40
CostUsageScanner and PiSessionCostScanner scans execute synchronously
for minutes on large session archives. Running them inline on
cooperative-pool task threads starves every other async task in the
process: menus freeze while the main thread sits idle, and overlapping
provider scans multiply the pressure. Field samples on a 2.5GB corpus
showed both provider scans saturating pool threads for 7+ minutes after
a cache schema bump while menu opens stalled.

All corpus scans and persisted-cache decoding now run on one dedicated
serial utility queue (CostUsageScanExecutor), with task cancellation
bridged into the scanner-level cancellation checks. Serialization also
removes concurrent provider scans racing the same disk.
@steipete steipete force-pushed the perf/cost-scan-off-cooperative-pool branch from 136b226 to 01350b2 Compare June 11, 2026 03:46
@clawsweeper

clawsweeper Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codex review: needs maintainer review before merge. Reviewed June 10, 2026, 11:57 PM ET / 03:57 UTC.

Summary
The PR moves synchronous cost-usage corpus scans, provider fallback/merge work, and persisted-cache decoding onto a dedicated serial utility queue with task-cancellation bridging and focused macOS/Linux tests.

Reproducibility: yes. The PR documents a concrete cache-clear, relaunch, and process-sampling path on a real 2.5 GB corpus, with before/after stack placement demonstrating the current-main starvation mechanism and its removal.

Review metrics: 3 noteworthy metrics.

  • Patch surface: 5 files; +365/-39. The change is bounded to cost-usage execution, focused tests, and release context.
  • Focused tests: 7 added. Five macOS and two Linux tests directly exercise scheduling, serialization, errors, and cancellation.
  • Runtime corpus: 2.5 GB sampled. The contributor validated the actual starvation path on a workload large enough to sustain multi-minute scans.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🦞 diamond lobster
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • Confirm whether serial queue wait should count against each provider's existing timeout budget.

Risk before merge

  • [P2] One global serial queue prevents concurrent disk-heavy provider scans, but a later provider can wait behind an earlier multi-minute scan; if an outer ten-minute timeout includes that queue wait, an unusually large multi-provider setup may time out where concurrent scans previously completed.
  • [P1] The patch deliberately improves responsiveness rather than total scan cost, so full rescans after cache invalidation can still run for minutes and remain dependent on the separate refresh-policy work tracked at Optimize Codex cost refresh policy for large history windows #1392.

Maintainer options:

  1. Accept the documented timeout edge case (recommended)
    Merge after explicitly accepting that extreme multi-provider full rescans may time out and recover on a later refresh using warmed per-file caches.
  2. Preserve each provider's scan budget
    Move timeout accounting around active executor work rather than queue wait and add focused coverage for overlapping long provider scans.

Next step before merge

  • [P2] The code appears correct and proof is sufficient; the next action is maintainer acceptance or refinement of the disclosed timeout compatibility behavior, not an automated repair.

Security
Cleared: The internal scheduling patch adds no dependencies, downloads, workflow permissions, credential handling, package-resolution changes, or other concrete security or supply-chain concerns.

Review details

Best possible solution:

Keep the dedicated off-pool executor, while explicitly accepting the extreme-corpus timeout behavior or preserving each provider's active-scan timeout budget by excluding serial-queue wait from it.

Do we have a high-confidence way to reproduce the issue?

Yes. The PR documents a concrete cache-clear, relaunch, and process-sampling path on a real 2.5 GB corpus, with before/after stack placement demonstrating the current-main starvation mechanism and its removal.

Is this the best way to solve the issue?

Yes. Moving long synchronous scans to a dedicated serial utility queue is the narrow structural solution to cooperative-pool starvation; the only unresolved choice is whether to accept or adjust the changed timeout accounting for queued providers.

AGENTS.md: found and applied where relevant.

Codex review notes: model internal, reasoning high; reviewed against febf562741e5.

Label changes

Label justifications:

  • P2: This addresses severe but workload-dependent menu responsiveness degradation with a focused fix and limited blast radius.
  • merge-risk: 🚨 compatibility: Serializing all provider scans can reduce the effective timeout budget for a later provider in unusually large existing setups.
  • rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (logs): The PR contains after-fix real-app process samples showing scan frames on the dedicated serial queue, none on cooperative-pool workers, and successful cache completion on the same large corpus.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR contains after-fix real-app process samples showing scan frames on the dedicated serial queue, none on cooperative-pool workers, and successful cache completion on the same large corpus.
Evidence reviewed

What I checked:

  • Current-main problem path: Current main performs the cancellable cost scanners synchronously inside the async token-snapshot path, so a large corpus scan can occupy a Swift cooperative-pool worker for its full duration. (Sources/CodexBarCore/CostUsageFetcher.swift:146, 920997c6a365)
  • Dedicated executor: The branch introduces a serial utility DispatchQueue and a locked lifecycle state machine covering pre-install cancellation, queued cancellation, running cancellation checks, and one-time completion. (Sources/CodexBarCore/CostUsageScanExecutor.swift:9, e7aa65290b69)
  • Fetcher integration: The synchronous scanner, Vertex fallback, Pi report merge, and multi-megabyte persisted-cache decode move behind the executor while preserving the existing report-selection and merge behavior. (Sources/CodexBarCore/CostUsageFetcher.swift:158, e7aa65290b69)
  • Cancellation and serialization coverage: Five macOS tests cover queue identity, error propagation, overlap serialization, in-flight cancellation, and immediate queued cancellation; two Linux tests cover execution and cancellation. (Tests/CodexBarTests/CostUsageScanExecutorTests.swift:5, e7aa65290b69)
  • Independent validation: A contributor independently reported the five executor tests and the broader CostUsage suite passing on Apple Swift 6.3.2, while noting that the author's large-corpus runtime sample was not reproduced locally. (136b226469b9)
  • Real behavior proof: The PR body reports before/after samples from the same real 2.5 GB corpus: scan frames moved from concurrent cooperative utility workers to the named serial queue, with zero post-fix scan frames on the cooperative pool and normal cache rewrites. (Sources/CodexBarCore/CostUsageScanExecutor.swift:10, e7aa65290b69)

Likely related people:

  • steipete: He is the dominant contributor to the cost fetcher/scanner history, introduced the current released path, and authored this PR's cancellation and test-isolation refinements. (role: feature owner and recent area contributor; confidence: high; commits: 920997c6a365, 01350b23cc6f, e7aa65290b69; files: Sources/CodexBarCore/CostUsageFetcher.swift, Sources/CodexBarCore/CostUsageScanner.swift, Sources/CodexBarCore/CostUsageScanExecutor.swift)
  • Yuxin-Qiao: They ran the executor-focused and broader CostUsage test suites and inspected the earlier merge conflict, providing useful independent review context. (role: independent validator; confidence: medium; commits: 136b226469b9; files: Sources/CodexBarCore/CostUsageScanExecutor.swift, Tests/CodexBarTests/CostUsageScanExecutorTests.swift)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added proof: sufficient Contributor real behavior proof is sufficient. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. P2 Normal priority bug or improvement with limited blast radius. merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. labels Jun 11, 2026
@clawsweeper

clawsweeper Bot commented Jun 11, 2026

Copy link
Copy Markdown

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@steipete steipete merged commit 4c964cc into steipete:main Jun 11, 2026
4 checks passed
@steipete

Copy link
Copy Markdown
Owner

Exact-head CI proof completed after merge:

  • macOS lint-build-test: passed
  • Linux x64 build/test/smoke: passed
  • Linux arm64 build/test/smoke: passed
  • GitGuardian: passed

Run: https://github.com/steipete/CodexBar/actions/runs/27322622815
Merged as 4c964ccc91e0f40bb713b14b58b8e8f901a44c68.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. P2 Normal priority bug or improvement with limited blast radius. proof: sufficient Contributor real behavior proof is sufficient. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants