Skip to content

feat(classify): skip_label input + gh retry resilience; non-fatal annotate#95

Merged
topcoder1 merged 1 commit into
mainfrom
fix/classify-resilience-and-skip-label
Jun 10, 2026
Merged

feat(classify): skip_label input + gh retry resilience; non-fatal annotate#95
topcoder1 merged 1 commit into
mainfrom
fix/classify-resilience-and-skip-label

Conversation

@topcoder1

Copy link
Copy Markdown
Owner

Root cause this responds to

On 2026-06-10 15:20–16:21Z, GitHub had a platform incident: "sporadic authentication failures, impacting approximately 15% of API traffic. Erroneous 401 responses" (githubstatus.com). Valid GITHUB_TOKENs were intermittently treated as anonymous. In whois-api-llc/disposable-email-domains (first blocked-path PR since codex-review install), this reddened classify / Classify PR Risk — a required check there — on two consecutive pushes (runs 27288936851, 27289393205, 27289393136). Same job, same token: gh pr edit succeeded at 16:02:59 and a gh write 401'd at 16:03:01. A rerun 20h later passed unchanged — there was no token-plumbing bug (the suspected missing GH_TOKEN/secrets: inherit was a red herring; github.token was set at every step).

The incident exposed two real structural weaknesses, fixed here:

Changes

  1. retry on every gh call that gates the job (3 attempts, 5s/10s backoff). Stdout is buffered per attempt so command substitutions never capture error bodies from failed attempts. Turns a 15% per-call platform blip into ~0.3%.
  2. Annotate step is now truly non-fatal — its own comment has always promised "this step does NOT exit non-zero", but a gh failure violated that on 2026-06-10. Comment-write failures downgrade to ::warning:: (it's a policy notice; gating lives in the label + claude-author-automerge regex).
  3. New skip_label input — fleet repos run this reusable from TWO workflows (standalone PR Classify + embedded in PR Codex Review), racing to write the same label and sticky comment, and producing two check contexts both named classify / Classify PR Risk (polluting required-check evaluation). Embedded callers will pass skip_label: true and just consume the risk_class output. Follow-up: fleet caller-template rollout renames the embedded job to codex-classify so the contexts stop colliding.
  4. Fix stale 'exits 1' comment in this repo's own codex caller (exit-1 dropped in fix(classify): drop false-red exit-1 on blocked class #21). NOTE: this repo's own caller does not get skip_label: true — it's the only labeler here (no standalone classify caller in ci-workflows).

Backward compatible: skip_label defaults to false; existing callers see retry hardening only.

Verification

  • actionlint (incl. shellcheck on run blocks): PASS
  • retry helper unit-tested locally: flaky-cmd (fail×2 with garbage stdout, then success) → substitution captured only clean output, 3 attempts; always-fail in $(… || true) → empty capture, no script abort

Codex pre-review: PASS — no P1. Two P2s: (1) gh pr view in label step un-retried → fixed in this PR (wrapped in retry); (2) pre-existing word-splitting if a risk label name contained spaces → declined, label names come from the classifier's fixed enum (no spaces possible).

Auto-merge rationale: manual click-merge — .github/workflows/** is on the high-risk list, and this is the fleet-wide reusable: a bad merge breaks classify on every PR in 40+ repos until reverted.

🤖 Generated with Claude Code

…otate

The 2026-06-10 GitHub incident ('erroneous 401 responses', ~15% of API
traffic) turned classify red fleet-wide: any unguarded gh call that lost
the coin flip failed the job, including the policy-notice comment step
that documents itself as never exiting non-zero.

- retry (3 attempts, 5s/10s backoff) on every gh call that gates the job;
  buffered stdout so command substitutions never capture error bodies
  from failed attempts
- annotate step now downgrades gh failures to ::warning:: (it is a
  policy notice, not a gate — the label and automerge regex gate)
- new skip_label input: embedded classify jobs (pr-codex-review callers)
  only need the risk_class output; the standalone PR Classify caller owns
  the label + sticky comment, ending duplicate writes and write races
- fix stale 'exits 1' comment in our own codex caller (exit-1 removed
  in #21)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

Coverage Floor — mode: enforce

metric value
measured 100.0%
floor (current) 99.0%
target 100.0%
last bumped 2026-05-12

@claude

claude Bot commented Jun 10, 2026

Copy link
Copy Markdown

No issues found. Retry logic, skip_label gating, and non-fatal annotate are all correctly implemented; backwards compatibility is preserved.

@topcoder1 topcoder1 merged commit 1efb9b6 into main Jun 10, 2026
11 checks passed
@topcoder1 topcoder1 deleted the fix/classify-resilience-and-skip-label branch June 10, 2026 17:20
topcoder1 added a commit to topcoder1/nanoclaw that referenced this pull request Jun 11, 2026
…86)

Refreshes the pr-codex-review.yml caller to the current template
(topcoder1/ci-workflows#95 companion).

**What changes:**
- Embedded classify job renamed `classify` → `codex-classify`: its check
context no longer collides with the standalone PR Classify workflow's
`classify / Classify PR Risk`, which is a required status check in some
rulesets. The 2026-06-10 GitHub 401 incident reddened the duplicate
context and polluted required-check evaluation.
- Passes `skip_label: true`: this embedded run only computes the
`risk_class` output. The standalone PR Classify workflow remains the
single writer of the `risk:*` label and the blocked-PR sticky comment
(no more write races).
- Adds `reopened` to trigger types (parity with the standalone classify
caller).

**Behavior unchanged:** Codex review still runs on risk:sensitive and
risk:blocked PRs; automerge gating (label + path regex) is untouched.

**Auto-merge rationale:** manual click-merge — touches
`.github/workflows/**` (fleet policy).

🤖 Generated with [Claude Code](https://claude.com/claude-code)
topcoder1 added a commit to whois-api-llc/ProfessionalServices that referenced this pull request Jun 11, 2026
…14)

Refreshes the pr-codex-review.yml caller to the current template
(topcoder1/ci-workflows#95 companion).

**What changes:**
- Embedded classify job renamed `classify` → `codex-classify`: its check
context no longer collides with the standalone PR Classify workflow's
`classify / Classify PR Risk`, which is a required status check in some
rulesets. The 2026-06-10 GitHub 401 incident reddened the duplicate
context and polluted required-check evaluation.
- Passes `skip_label: true`: this embedded run only computes the
`risk_class` output. The standalone PR Classify workflow remains the
single writer of the `risk:*` label and the blocked-PR sticky comment
(no more write races).
- Adds `reopened` to trigger types (parity with the standalone classify
caller).

**Behavior unchanged:** Codex review still runs on risk:sensitive and
risk:blocked PRs; automerge gating (label + path regex) is untouched.

**Auto-merge rationale:** manual click-merge — touches
`.github/workflows/**` (fleet policy).

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant