Skip to content

fix(gastown): distinguish null causes in PR status polling (#3149)#3154

Closed
jrf0110 wants to merge 1 commit into
gastown-stagingfrom
convoy/gastown-observability-3149-fixes-re-slin/acf7c452/gt/maple/c412265a
Closed

fix(gastown): distinguish null causes in PR status polling (#3149)#3154
jrf0110 wants to merge 1 commit into
gastown-stagingfrom
convoy/gastown-observability-3149-fixes-re-slin/acf7c452/gt/maple/c412265a

Conversation

@jrf0110
Copy link
Copy Markdown
Contributor

@jrf0110 jrf0110 commented May 9, 2026

Summary

Replaces the PRStatusResult | null return type from checkPRStatus with a discriminated PRStatusOutcome union, so each null cause surfaces a structured PRStatusError with an actionable failure message. Fixes #3149.

Key changes:

  • resolveGitHubToken now returns GitHubTokenResolution ({ ok: true, token, source } | { ok: false, tried }) instead of string | null, capturing the full resolution chain for the no-token error message.
  • checkPRStatus now returns PRStatusOutcome ({ ok: true, result } | { ok: false, error }) instead of PRStatusResult | null, with five discriminated error kinds: no_token, http_error, invalid_response, unrecognized_url, host_mismatch.
  • Error-specific failure thresholds: no_token and non-transient HTTP errors (401/403/404) fail immediately (1 strike); invalid_response/unrecognized_url/host_mismatch fail after 3 consecutive strikes; transient HTTP errors (5xx/429) keep the existing 10-strike behavior.
  • poll_null_count resets to 0 on successful poll at both call sites.
  • failureKind persisted to bead metadata for analytics; AE event pr.poll_failed emitted on terminal failure.
  • Updated all callers: ApplyActionContext.checkPRStatus type, Town.do.ts wrapper, refresh-git-token.handler.ts, and the three other resolveGitHubToken callers in town-scm.ts.

Verification

  • Unit tests for checkPRStatus cover all five error kinds, transient/non-transient HTTP status discrimination, sampleKeys capture, and host mismatch.
  • Unit tests for resolveGitHubToken cover the resolution chain with and without tokens.
  • Unit tests for failureMessageFor, shouldFailImmediately, shouldCountAsTransient cover all error kinds and threshold decisions.
  • Integration test for no_token immediate-fail path through the DO alarm cycle.
  • pnpm --filter cloudflare-gastown typecheck passes cleanly.

Visual Changes

N/A

Reviewer Notes

The biggest change is in actions.ts — the poll_pr handler's error branch was rewritten from a single null-counting block to three discriminated paths (immediate-fail, transient threshold, non-transient threshold). The writeEvent calls use ctx.emitEvent instead of writeEvent directly because ApplyActionContext doesn't expose env. The rig_id for the AE event is looked up from the bead's SQL row since poll_pr actions don't carry it.

Replace PRStatusResult | null return type with discriminated PRStatusOutcome
union in checkPRStatus. Each null cause (no token, HTTP error, invalid
response, unrecognized URL, host mismatch) now surfaces a structured
PRStatusError with actionable failure messages.

- resolveGitHubToken returns GitHubTokenResolution with resolution chain
- no_token and non-transient HTTP errors (401/403/404) fail immediately
- invalid_response/unrecognized_url/host_mismatch fail after 3 strikes
- Transient HTTP errors (5xx/429) keep existing 10-strike behavior
- poll_null_count resets to 0 on successful poll at both call sites
- failureKind persisted to bead metadata for analytics
- AE event pr.poll_failed emitted on terminal failure
- Unit tests for checkPRStatus, resolveGitHubToken, failureMessageFor,
  and threshold logic
- Integration test for no_token immediate-fail path
switch (error.kind) {
case 'no_token':
return (
`No GitHub token resolved for this town. Tried (in order): ` +
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: The no_token failure message hardcodes "GitHub token" but this function is also called for GitLab no_token errors (when error.provider === 'gitlab'). A user polling a GitLab MR with no gitlab_token configured would see a misleading "No GitHub token resolved for this town" message.

Suggested change
`No GitHub token resolved for this town. Tried (in order): ` +
`No ${error.provider === 'gitlab' ? 'GitLab' : 'GitHub'} token resolved for this town. Tried (in order): ` +

err
);
}
} else if (!integrationId) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Missing tried entry when integrationId is set but GIT_TOKEN_SERVICE binding is absent. The condition else if (!integrationId) only fires when no integration ID is configured at all. When an integrationId exists but ctx.env.GIT_TOKEN_SERVICE is falsy, sourceLabel is never added to tried, so the resolution chain shown in the no_token error message silently omits a source that was nominally available. Add a third branch:

  } else if (integrationId && !ctx.env.GIT_TOKEN_SERVICE) {
    tried.push(`${sourceLabel} (GIT_TOKEN_SERVICE binding missing)`);
  } else if (!integrationId) {
    tried.push('platform integration (none configured)');
  }

),
];
const nullCount = Number(rows[0]?.null_count ?? 0);
if (nullCount >= PR_POLL_NON_TRANSIENT_THRESHOLD) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Both the transient (10-strike) path and this non-transient (3-strike) path share the same $.poll_null_count counter. This means 2 transient errors followed by 1 non-transient error will produce a poll_null_count of 3, which trips this threshold even though only 1 non-transient error has occurred. The word "consecutive" in the log message (line 1383) is also inaccurate — the counter is never reset between error-kind switches.

Consider using a separate counter key (e.g. $.poll_non_transient_count) for the non-transient path so the two thresholds are independently tracked, or document this shared-counter behavior explicitly if it is intentional.

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented May 9, 2026

Code Review Summary

Status: 3 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 3
SUGGESTION 0
Issue Details (click to expand)

WARNING

File Line Issue
services/gastown/src/dos/town/actions.ts 343 failureMessageFor hardcodes "No GitHub token" even for GitLab no_token errors (error.provider === 'gitlab')
services/gastown/src/dos/town/town-scm.ts 62 Missing tried entry in resolution chain when integrationId is set but GIT_TOKEN_SERVICE binding is absent
services/gastown/src/dos/town/actions.ts 1381 Transient and non-transient error paths share the same $.poll_null_count counter, allowing mixed-kind errors to falsely trip the 3-strike non-transient threshold
Other Observations (not in diff)

No additional issues in unchanged code.

Files Reviewed (4 files)
  • services/gastown/src/dos/town/actions.ts — 2 issues
  • services/gastown/src/dos/town/town-scm.ts — 1 issue
  • services/gastown/src/handlers/refresh-git-token.handler.ts — no issues
  • services/gastown/test/integration/pr-poll-errors.test.ts — no issues
  • services/gastown/test/unit/pr-poll-errors.test.ts — no issues
  • services/gastown/test/unit/pr-poll-thresholds.test.ts — no issues

Fix these issues in Kilo Cloud


Reviewed by claude-4.6-sonnet-20260217 · 1,483,161 tokens

@jrf0110
Copy link
Copy Markdown
Contributor Author

jrf0110 commented May 10, 2026

Superseded by #3160. Closing as duplicate from convoy retry loop.

@jrf0110 jrf0110 closed this May 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant