Skip to content

Add performance gates for incremental Codex cost scans#1434

Merged
steipete merged 1 commit into
mainfrom
test/codex-cost-performance-gates
Jun 12, 2026
Merged

Add performance gates for incremental Codex cost scans#1434
steipete merged 1 commit into
mainfrom
test/codex-cost-performance-gates

Conversation

@steipete

@steipete steipete commented Jun 11, 2026

Copy link
Copy Markdown
Owner

Summary

  • add a regression gate proving unchanged Codex session corpora reuse the warm cache
  • add a regression gate proving priority-turn refreshes scan appended SQLite rows only
  • make both gates deterministic by asserting cache and cursor behavior instead of wall-clock ratios
  • keep the test-only change separate from production scanner behavior

Supersedes #1423 because its fork branch does not allow maintainer edits.

Proof

  • swift test --filter CostUsagePerformanceGateTests twice: 2 tests passed on both runs
  • make check: clean
  • git diff --check: clean
  • structured autoreview: clean, no accepted/actionable findings (0.86 confidence)

@clawsweeper

clawsweeper Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codex review: needs maintainer review before merge. Reviewed June 11, 2026, 6:18 PM ET / 22:18 UTC.

Summary
Review failed before ClawSweeper could summarize the requested change.

Reproducibility: unclear. The review failed before ClawSweeper could establish a reproduction path.

Review metrics: none identified.

Merge readiness
Overall: 🌊 off-meta tidepool
Proof: 🌊 off-meta tidepool
Patch quality: 🌊 off-meta tidepool
Result: rating does not apply to this item.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Risk before merge

  • [P1] No close action taken because the review did not complete.

Maintainer options:

  1. Decide the mitigation before merge
    Retry the Codex review after fixing the execution failure.
  2. Pause or close
    Do not merge this PR until maintainers decide whether the risk is worth taking.

Next step before merge

  • Review did not complete, so no work-lane recommendation was made.
Review details

Best possible solution:

Retry the Codex review after fixing the execution failure.

Do we have a high-confidence way to reproduce the issue?

Unclear. The review failed before ClawSweeper could establish a reproduction path.

Is this the best way to solve the issue?

Unclear. Retry the review first so ClawSweeper can evaluate the actual issue and fix direction.

AGENTS.md: unclear because the file could not be read completely.

Codex review notes: model internal, reasoning high; reviewed against 1912f75f4962.

Label changes

Label changes:

  • remove P3: Current review triage priority is none.
  • remove merge-risk: 🚨 automation: Current PR review selected no merge-risk labels.

Label justifications:

  • rating: 🌊 off-meta tidepool: Overall readiness is 🌊 off-meta tidepool; proof is 🌊 off-meta tidepool and patch quality is 🌊 off-meta tidepool.
Evidence reviewed

What I checked:

  • failure reason: retryable codex transport failure.
  • codex failure detail: Codex review failed for this PR with exit 1.
  • codex stderr: record.\n- The suite is .serialized so the two timed gates don't contend with each other.\n\n## Runtime Proof (local run, Apple Silicon)\n\ntext\nPERF-GATE codex-session-corpus: cold=3004ms warm=17ms ratio=173x\n􁁛 Test \"warm codex refresh over an unchanged session corpus must not re-parse it\" passed after 3.196 seconds.\nPERF-GATE priority-turns: full=25ms incremental=0ms ratio=95x\n􁁛 Test \"priority turns refresh must scan only appended trace rows\" passed after 0.201 seconds.\n􁁛 Test run with 2 tests in 1 suite passed after 3.398 seconds.\n\n\nNegative test — the gate fires when the protection is removed. Forcing state = nil per refresh (the pre-memo full-scan behavior) and rerunning:\n\ntext\nPERF-GATE priority-turns: full=25ms incremental=24ms ratio=1x\nTest \"priority turns refresh must scan only appended trace rows\" recorded an issue:\nExpectation failed: (refreshDuration * 5 → 0.1189...) < (fullDuration → 0.0254...)\nSuite CostUsagePerformanceGateTests failed after 0.186 seconds with 1 issue.\n\n\nReverting the sabotage restores ratio=95x and the gate passes.\n\n## Validation\n\n- swift test --filter CostUsagePerformanceGate — 2/2 pass.\n- make check — 0 violations.\n\nNo CHANGELOG edit per current review guidance (release-owned).\n".
  • codex stdout: No stdout captured.

Likely related people:

  • unknown: Codex failed before it could trace repository history. (role: review did not complete; confidence: low)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. P3 Low-risk cleanup, docs, polish, ergonomics, or speculative feature. merge-risk: 🚨 automation 🚨 Merging this PR could break CI, automerge, proof capture, label sync, or automation. labels Jun 11, 2026
@steipete steipete force-pushed the test/codex-cost-performance-gates branch from 82f5852 to a9acafa Compare June 11, 2026 14:20
@steipete

Copy link
Copy Markdown
Owner Author

Rebased onto current main, fixed the timing gate exposed by local validation, and revalidated exact head a9acafaf.

The original 200k-row fixture produced only a 4x cold/incremental ratio on this machine. The fixture now uses 500k rows while retaining the 5x requirement; two consecutive runs passed with roughly 70 ms full scans and sub-millisecond incremental scans.

Proof:

  • swift test --filter CostUsagePerformanceGateTests — 2 tests passed twice
  • make check — clean
  • autoreview --mode branch --base origin/main — clean, no actionable findings

@clawsweeper clawsweeper Bot added rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. and removed rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels Jun 11, 2026
@steipete steipete force-pushed the test/codex-cost-performance-gates branch from a9acafa to 16ae060 Compare June 11, 2026 14:59
@steipete

Copy link
Copy Markdown
Owner Author

Updated the PR head to 16ae06071814e1cdbce7a90b31e7732dc68c732c.

  • rebased onto current main
  • swift test --filter CostUsagePerformanceGateTests: 2 passed
  • measured unchanged-session warm/cold ratio: 433x
  • measured priority-turn incremental/full ratio: 116x
  • make check: clean
  • structured autoreview: clean, no actionable findings (0.86 confidence)

Fresh exact-head CI is now running.

@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. labels Jun 11, 2026
@steipete steipete force-pushed the test/codex-cost-performance-gates branch from 16ae060 to da0c456 Compare June 11, 2026 15:56
@steipete

Copy link
Copy Markdown
Owner Author

Rebased on current main (88c43eeb) and pushed head da0c4562.

Autoreview found and I fixed a CI-flakiness blocker: the gates no longer assert wall-clock ratios. They now prove cache reuse by changing old content while preserving cache metadata, and prove row-cursor behavior by changing an old SQLite row before appending a new one.

Proof:

  • swift test --filter CostUsagePerformanceGateTests: 2 tests passed
  • make check: passed, 0 lint violations
  • final autoreview: clean, confidence 0.84
  • diff check: clean

@clawsweeper clawsweeper Bot added rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. and removed rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels Jun 11, 2026
@steipete steipete force-pushed the test/codex-cost-performance-gates branch from da0c456 to 69b2818 Compare June 11, 2026 16:43
@steipete

Copy link
Copy Markdown
Owner Author

Rebased onto current main and replaced timing-ratio assertions with deterministic behavior gates.

Proof on 69b28189:

  • swift test --filter CostUsagePerformanceGateTests twice: 2 tests passed on both runs
  • make check: clean
  • git diff --check: clean
  • structured autoreview: no actionable findings (0.86 confidence)

Exact-head CI is now running.

@steipete steipete force-pushed the test/codex-cost-performance-gates branch from 69b2818 to 3df376e Compare June 11, 2026 20:58
Co-authored-by: pickaxe <54486432+ProspectOre@users.noreply.github.com>
@steipete steipete force-pushed the test/codex-cost-performance-gates branch from 3df376e to c70bb0d Compare June 11, 2026 22:15
@steipete steipete merged commit ff6c42d into main Jun 12, 2026
4 of 7 checks passed
@steipete

Copy link
Copy Markdown
Owner Author

Validated exact head c70bb0d367c45e09198c34c889cdf7488dc199a2.
Landed as ff6c42d47967ca0d4057fe5bbbea33ff0e26e5fc.

  • swift test --filter CostUsagePerformanceGateTests twice (2 tests each run)
  • make check (SwiftFormat and SwiftLint clean)
  • git diff --check origin/main...HEAD
  • released/fictitious model-name gate clean
  • structured autoreview clean with no actionable findings (0.94 confidence)
  • exact-head macOS, Linux x64, Linux arm64, and GitGuardian checks green
  • current-main merge-tree clean

The new deterministic gates prove that unchanged Codex session files retain cached parse results and that priority-turn refreshes process appended SQLite rows without replaying mutated historical rows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 automation 🚨 Merging this PR could break CI, automerge, proof capture, label sync, or automation. P3 Low-risk cleanup, docs, polish, ergonomics, or speculative feature. rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant