Skip to content

feat(ops): forward-progress watchdog — catch "busy but not progressing"#457

Open
hadamrd wants to merge 1 commit into
trunkfrom
feat/progress-watchdog
Open

feat(ops): forward-progress watchdog — catch "busy but not progressing"#457
hadamrd wants to merge 1 commit into
trunkfrom
feat/progress-watchdog

Conversation

@hadamrd

@hadamrd hadamrd commented Jun 26, 2026

Copy link
Copy Markdown
Owner

What

Adds scripts/progress-watchdog.sh — a repo-agnostic supervisor that watches forward progress (merges landing), not just liveness.

Why

Event watchers (PR_OPEN / STALL / DEAD) are blind to a loop that is busy but not progressing — spinning for hours re-reviewing one approved-but-blocked PR while every liveness check stays green. That blind spot cost ~9h of wasted grind once. The invariant that actually matters is merges landing.

Tripwires (exit non-zero so a cron/systemd/operator/agent can react)

  • 10 loop entrypoint absent (DEAD) — matches the real bin/forge-loop, not the string forge-loop run (which self-matches an interactive shell)
  • 11 stop-file present
  • 12 two-factor stall: no merge in 90min AND the event log has been quiet 30min. A fresh event log means the loop is legitimately building/repairing a risk-gated PR that parks for human review — that is progress, not a stall, so this avoids false alarms
  • 13 an open PR piling up comments past a cap (busy-but-stuck signature)

Usage

REPO_PATH=/path/to/checkout GH_REPO=owner/name scripts/progress-watchdog.sh

All thresholds are env-overridable (STALL_SECS / QUIET_SECS / POLL_SECS / COMMENT_CAP). Merge polling uses token-free git ls-remote so no GH_TOKEN is required (which would otherwise break the gh CLI).

Test

  • bash -n syntax-clean
  • The same logic has been running live supervising the pulsar loop, where it correctly caught a dead loop and a real risk-gated-PR review checkpoint (and the two-factor change eliminated the false 'stall' alarms the naive version produced)

🤖 Generated with Claude Code

Event watchers (PR_OPEN/STALL/DEAD) are blind to a loop that is busy but not
making forward progress — e.g. spinning for hours re-reviewing one
approved-but-blocked PR while every liveness check stays green (cost ~9h once).
The invariant that matters is merges landing, not liveness.

scripts/progress-watchdog.sh watches forward progress and exits non-zero on a
tripwire so a supervisor/operator/agent can react:
- 10 loop entrypoint absent (matches the real `bin/forge-loop`, not the
  self-matching string "forge-loop run")
- 11 stop-file present
- 12 two-factor stall: no merge in 90min AND event log quiet 30min (a fresh
  event log = the loop is legitimately building/repairing a risk-gated PR that
  parks for human review, which IS progress)
- 13 an open PR piling up comments past a cap (busy-but-stuck signature)

Repo-agnostic via REPO_PATH/GH_REPO env; polls merges with token-free ls-remote.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant