Skip to content

fix(gastown): distinguish null causes in PR status polling (#3149)#3157

Closed
jrf0110 wants to merge 3 commits intomainfrom
convoy/gastown-observability-3149-fixes-re-slin/acf7c452/gt/toast/c412265a
Closed

fix(gastown): distinguish null causes in PR status polling (#3149)#3157
jrf0110 wants to merge 3 commits intomainfrom
convoy/gastown-observability-3149-fixes-re-slin/acf7c452/gt/toast/c412265a

Conversation

@jrf0110
Copy link
Copy Markdown
Contributor

@jrf0110 jrf0110 commented May 9, 2026

Summary

Replace checkPRStatus's PRStatusResult | null return with a discriminated PRStatusOutcome union ({ ok: true, result } | { ok: false, error }) so that each failure mode surfaces a specific, actionable error instead of collapsing into a generic "null" count.

Key changes:

  • Discriminated union: PRStatusError variants: no_token, unrecognized_url, http_error, invalid_response, host_mismatch — each with structured fields (resolution chain, HTTP status, transient flag, sample keys).
  • resolveGitHubToken returns GitHubTokenResolution ({ ok: true, token, source } | { ok: false, tried }) capturing which credential sources were attempted, so no_token errors can enumerate exactly what the user should configure.
  • Three-bucket error handling: no_token and non-transient HTTP errors (401/403/404) fail immediately (1 strike); invalid_response/unrecognized_url/host_mismatch fail after 3 strikes; transient HTTP 5xx/429 keep the existing 10-strike behavior.
  • Specific failure messages per error kind (e.g., "Town's GitHub token is invalid or expired (HTTP 401)" vs generic null message).
  • poll_null_count reset on successful poll.
  • AE telemetry: pr.poll_failed event emitted on terminal failure with failureKind, provider, and HTTP status.
  • failureKind persisted to bead metadata for analytics grouping.

Fixes #3149.

Verification

Manual verification:

  • Confirmed all 40 PR-poll unit tests pass (error discrimination, failure messages, threshold logic).
  • Confirmed the integration test for no_token end-to-end through the DO alarm passes.
  • Full unit test suite: 237 passed, 2 pre-existing failures unrelated to this change (container/plugin/client.test.ts JWT assertion).

Visual Changes

N/A

Reviewer Notes

  • The platformIntegrationId gap in Town.do.ts's checkPRStatus context is pre-existing (not introduced by this change) — the SCMContext passed to scm.checkPRStatus doesn't include it, which means rig-level platform integrations aren't tried during PR polling. This is out of scope for this fix.
  • The 2 pre-existing test failures in container/plugin/client.test.ts are unrelated — they expect Bearer test-jwt-token but the client now generates a real JWT.

jrf0110 and others added 2 commits May 9, 2026 16:21
…ayor tools on prewarm

Three independent fixes for the startAgentInContainer timeout
regression introduced by #2974, plus a tighter container-instance cap.

1. Hydration gate (control-server.ts, process-manager.ts)
   The control server starts accepting requests immediately at boot,
   while bootHydration runs concurrently and serialises every registry
   agent + the mayor prewarm through the global sdkServerLock. Fresh
   /agents/start, /refresh-token, and PATCH /agents/:id/model requests
   queued behind that work and the DO-side AbortSignal.timeout(60s)
   fired before they ever got the lock — surfacing as
   "TimeoutError: aborted due to timeout" and "timeout after 6000ms:
   ensureSDKServer for <agentId>". A new awaitHydration() promise is
   awaited at the top of those handlers (before any process.env
   mutation in the model PATCH path) so they don't compound the queue.

2. Prewarm config matches /agents/start (Town.do.ts, gastown.worker.ts,
   process-manager.ts)
   buildPrewarmEnv was constructing KILO_CONFIG_CONTENT from hardcoded
   defaults (anthropic/claude-sonnet-4.6 / claude-haiku-4.5), so the
   real /agents/start with the user's actual model triggered
   ensureSDKServer's "config mismatch, evicting prewarmed server" path
   on every warm restart — doubling lock-holding time on the critical
   path the prewarm was supposed to speed up. The /api/towns/:id/mayor-id
   endpoint now returns the full prewarm context (model, smallModel,
   kilocodeToken, organizationId) resolved the same way _ensureMayor
   resolves it, and the container builds the prewarm KILO_CONFIG_CONTENT
   to match. Falls back gracefully to a skip when the worker hasn't
   deployed the richer endpoint yet.

3. Mayor workdir + plugin env (agent-runner.ts, process-manager.ts)
   prewarmMayorSDK called mayorWorkdirForTown (which only returns a
   string) and went straight to ensureSDKServer's process.chdir,
   throwing ENOENT on cold containers because createMayorWorkspace
   only ran from runAgent. Exported ensureMayorWorkspaceForTown so
   prewarm materialises the workspace first.

   More critically, buildPrewarmEnv was missing GASTOWN_AGENT_ROLE,
   GASTOWN_AGENT_ID, and GASTOWN_TOWN_ID — env vars the kilo serve
   plugin (plugin/index.ts) reads at spawn to decide whether to
   register mayor tools. Without them the prewarmed server booted with
   NO mayor tools, and the cache hit on the next /agents/start handed
   that defective instance back to the user. Now mirrors the mayor-
   shaped subset of buildAgentEnv. Added an end-to-end test that
   intercepts createKilo and asserts the env at spawn time.

4. wrangler.jsonc: lower TownContainerDO max_instances from 800 to 500.

Verified with pnpm --filter gastown-container test (67/67 pass),
pnpm --filter cloudflare-gastown typecheck, oxlint, and pnpm format.
Replace PRStatusResult | null return type with discriminated PRStatusOutcome
union in checkPRStatus. Each null cause (no token, HTTP error, invalid
response, unrecognized URL, host mismatch) now surfaces a structured
PRStatusError with actionable failure messages.

- resolveGitHubToken returns GitHubTokenResolution with resolution chain
- no_token and non-transient HTTP errors (401/403/404) fail immediately
- invalid_response/unrecognized_url/host_mismatch fail after 3 strikes
- Transient HTTP errors (5xx/429) keep existing 10-strike behavior
- poll_null_count resets to 0 on successful poll at both call sites
- failureKind persisted to bead metadata for analytics
- AE event pr.poll_failed emitted on terminal failure
- Unit tests for checkPRStatus, resolveGitHubToken, failureMessageFor,
  and threshold logic
- Integration test for no_token immediate-fail path
statusCode: error.kind === 'http_error' ? error.status : undefined,
});
}
} else {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: poll_null_count is shared between the transient-error bucket (10-strike) and the non-transient-non-immediate bucket (3-strike), so accumulation in one bucket contaminates the other.

Scenario: 2 transient 5xx errors set poll_null_count = 2. The next poll returns invalid_response. The 3-strike path reads poll_null_count = 3 (already at threshold) and immediately fails the bead, even though there has only been 1 invalid_response error.

Conversely, 2 unrecognized_url errors (poll_null_count = 2) followed by 8 transient 5xx errors would reach the 10-strike threshold after only 8 transient errors, not 10.

Consider using separate counter keys, e.g. $.poll_transient_count and $.poll_soft_error_count, so each bucket's threshold is applied against its own count.

*/
export async function bootHydration(): Promise<void> {
const LOG = '[boot-hydration]';
_hydrationComplete = new Promise<void>(resolve => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: bootHydration is not re-entrant-safe. If called a second time while the first invocation is still in flight, _hydrationComplete is replaced with a new promise and _resolveHydration is overwritten. The first invocation's finally block will then call the second invocation's resolver, resolving the second promise prematurely while the first call is still running. Any awaitHydration() callers that were waiting on the first promise are now waiting on a promise that will never be resolved.

In practice bootHydration is called once from main.ts, so this is low-risk today, but the module-level mutable state makes it fragile. A guard like if (_hydrationComplete !== Promise.resolve()) return _hydrationComplete; (or a simpler let _booted = false flag) would prevent the hazard.

if (ctx) {
return c.json({ success: true, ...ctx });
}
const agentId = await town.getMayorAgentId();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SUGGESTION: When getMayorPrewarmContext() returns null because the mayor exists but has no kilocode token, the fallback path calls getMayorAgentId() — a second RPC into the same DO. Since getMayorPrewarmContext already fetches the mayor agent internally, you could return { agentId } in that case rather than null, saving the extra round-trip.

Not a bug — the contract comment above explains the intent — but worth noting for efficiency.

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented May 9, 2026

Code Review Summary

Status: 2 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 2
SUGGESTION 1
Issue Details (click to expand)

WARNING

File Line Issue
services/gastown/src/dos/town/actions.ts 1350 Shared poll_null_count counter contaminates both the 10-strike transient bucket and 3-strike soft-error bucket. 2 transient 5xx errors + 1 invalid_response immediately triggers the 3-strike threshold.
services/gastown/container/src/process-manager.ts 2822 bootHydration re-entrancy hazard: concurrent or sequential calls overwrite _hydrationComplete/_resolveHydration, leaving awaiters from the first call stranded on an unresolvable promise.

SUGGESTION

File Line Issue
services/gastown/src/gastown.worker.ts 688 When getMayorPrewarmContext returns null (mayor exists but no token), a second RPC to getMayorAgentId fires unnecessarily.
Other Observations (not in diff)

The failureMessageFor function at actions.ts:339 uses error.provider in the http_error generic case (line 360) but hardcodes "GitHub token" in the 401/403/404 specific messages. If GitLab ever returns 401/403/404 these messages would be misleading ("Town's GitHub token is invalid" for a GitLab error). Low-risk for now since GitLab errors go through a different code path with a pre-check, but worth noting for future maintainability.

Files Reviewed (13 files)
  • services/gastown/container/src/agent-runner.ts — no issues
  • services/gastown/container/src/control-server.ts — no issues
  • services/gastown/container/src/process-manager.test.ts — no issues
  • services/gastown/container/src/process-manager.ts — 1 warning
  • services/gastown/src/dos/Town.do.ts — no issues
  • services/gastown/src/dos/town/actions.ts — 1 warning, 1 observation (formatting only in latest commit)
  • services/gastown/src/dos/town/town-scm.ts — no issues (formatting only in latest commit)
  • services/gastown/src/gastown.worker.ts — 1 suggestion
  • services/gastown/src/handlers/refresh-git-token.handler.ts — no issues
  • services/gastown/test/integration/pr-poll-errors.test.ts — no issues (formatting only in latest commit)
  • services/gastown/test/unit/pr-poll-errors.test.ts — no issues (formatting only in latest commit)
  • services/gastown/test/unit/pr-poll-thresholds.test.ts — no issues
  • services/gastown/wrangler.jsonc — no issues

Fix these issues in Kilo Cloud


Reviewed by claude-4.6-sonnet-20260217 · 287,135 tokens

@jrf0110
Copy link
Copy Markdown
Contributor Author

jrf0110 commented May 10, 2026

Superseded by #3160. Closing as duplicate from convoy retry loop. Branch contaminated with unrelated changes from another bead — do not reuse.

@jrf0110 jrf0110 closed this May 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Gastown] Misleading "GitHub API returned null" error when town has no GitHub token

1 participant