fix(gastown): distinguish null causes in PR status polling (#3149)#3157
fix(gastown): distinguish null causes in PR status polling (#3149)#3157
Conversation
…ayor tools on prewarm Three independent fixes for the startAgentInContainer timeout regression introduced by #2974, plus a tighter container-instance cap. 1. Hydration gate (control-server.ts, process-manager.ts) The control server starts accepting requests immediately at boot, while bootHydration runs concurrently and serialises every registry agent + the mayor prewarm through the global sdkServerLock. Fresh /agents/start, /refresh-token, and PATCH /agents/:id/model requests queued behind that work and the DO-side AbortSignal.timeout(60s) fired before they ever got the lock — surfacing as "TimeoutError: aborted due to timeout" and "timeout after 6000ms: ensureSDKServer for <agentId>". A new awaitHydration() promise is awaited at the top of those handlers (before any process.env mutation in the model PATCH path) so they don't compound the queue. 2. Prewarm config matches /agents/start (Town.do.ts, gastown.worker.ts, process-manager.ts) buildPrewarmEnv was constructing KILO_CONFIG_CONTENT from hardcoded defaults (anthropic/claude-sonnet-4.6 / claude-haiku-4.5), so the real /agents/start with the user's actual model triggered ensureSDKServer's "config mismatch, evicting prewarmed server" path on every warm restart — doubling lock-holding time on the critical path the prewarm was supposed to speed up. The /api/towns/:id/mayor-id endpoint now returns the full prewarm context (model, smallModel, kilocodeToken, organizationId) resolved the same way _ensureMayor resolves it, and the container builds the prewarm KILO_CONFIG_CONTENT to match. Falls back gracefully to a skip when the worker hasn't deployed the richer endpoint yet. 3. Mayor workdir + plugin env (agent-runner.ts, process-manager.ts) prewarmMayorSDK called mayorWorkdirForTown (which only returns a string) and went straight to ensureSDKServer's process.chdir, throwing ENOENT on cold containers because createMayorWorkspace only ran from runAgent. Exported ensureMayorWorkspaceForTown so prewarm materialises the workspace first. More critically, buildPrewarmEnv was missing GASTOWN_AGENT_ROLE, GASTOWN_AGENT_ID, and GASTOWN_TOWN_ID — env vars the kilo serve plugin (plugin/index.ts) reads at spawn to decide whether to register mayor tools. Without them the prewarmed server booted with NO mayor tools, and the cache hit on the next /agents/start handed that defective instance back to the user. Now mirrors the mayor- shaped subset of buildAgentEnv. Added an end-to-end test that intercepts createKilo and asserts the env at spawn time. 4. wrangler.jsonc: lower TownContainerDO max_instances from 800 to 500. Verified with pnpm --filter gastown-container test (67/67 pass), pnpm --filter cloudflare-gastown typecheck, oxlint, and pnpm format.
Replace PRStatusResult | null return type with discriminated PRStatusOutcome union in checkPRStatus. Each null cause (no token, HTTP error, invalid response, unrecognized URL, host mismatch) now surfaces a structured PRStatusError with actionable failure messages. - resolveGitHubToken returns GitHubTokenResolution with resolution chain - no_token and non-transient HTTP errors (401/403/404) fail immediately - invalid_response/unrecognized_url/host_mismatch fail after 3 strikes - Transient HTTP errors (5xx/429) keep existing 10-strike behavior - poll_null_count resets to 0 on successful poll at both call sites - failureKind persisted to bead metadata for analytics - AE event pr.poll_failed emitted on terminal failure - Unit tests for checkPRStatus, resolveGitHubToken, failureMessageFor, and threshold logic - Integration test for no_token immediate-fail path
| statusCode: error.kind === 'http_error' ? error.status : undefined, | ||
| }); | ||
| } | ||
| } else { |
There was a problem hiding this comment.
WARNING: poll_null_count is shared between the transient-error bucket (10-strike) and the non-transient-non-immediate bucket (3-strike), so accumulation in one bucket contaminates the other.
Scenario: 2 transient 5xx errors set poll_null_count = 2. The next poll returns invalid_response. The 3-strike path reads poll_null_count = 3 (already at threshold) and immediately fails the bead, even though there has only been 1 invalid_response error.
Conversely, 2 unrecognized_url errors (poll_null_count = 2) followed by 8 transient 5xx errors would reach the 10-strike threshold after only 8 transient errors, not 10.
Consider using separate counter keys, e.g. $.poll_transient_count and $.poll_soft_error_count, so each bucket's threshold is applied against its own count.
| */ | ||
| export async function bootHydration(): Promise<void> { | ||
| const LOG = '[boot-hydration]'; | ||
| _hydrationComplete = new Promise<void>(resolve => { |
There was a problem hiding this comment.
WARNING: bootHydration is not re-entrant-safe. If called a second time while the first invocation is still in flight, _hydrationComplete is replaced with a new promise and _resolveHydration is overwritten. The first invocation's finally block will then call the second invocation's resolver, resolving the second promise prematurely while the first call is still running. Any awaitHydration() callers that were waiting on the first promise are now waiting on a promise that will never be resolved.
In practice bootHydration is called once from main.ts, so this is low-risk today, but the module-level mutable state makes it fragile. A guard like if (_hydrationComplete !== Promise.resolve()) return _hydrationComplete; (or a simpler let _booted = false flag) would prevent the hazard.
| if (ctx) { | ||
| return c.json({ success: true, ...ctx }); | ||
| } | ||
| const agentId = await town.getMayorAgentId(); |
There was a problem hiding this comment.
SUGGESTION: When getMayorPrewarmContext() returns null because the mayor exists but has no kilocode token, the fallback path calls getMayorAgentId() — a second RPC into the same DO. Since getMayorPrewarmContext already fetches the mayor agent internally, you could return { agentId } in that case rather than null, saving the extra round-trip.
Not a bug — the contract comment above explains the intent — but worth noting for efficiency.
Code Review SummaryStatus: 2 Issues Found | Recommendation: Address before merge Overview
Issue Details (click to expand)WARNING
SUGGESTION
Other Observations (not in diff)The Files Reviewed (13 files)
Fix these issues in Kilo Cloud Reviewed by claude-4.6-sonnet-20260217 · 287,135 tokens |
|
Superseded by #3160. Closing as duplicate from convoy retry loop. Branch contaminated with unrelated changes from another bead — do not reuse. |
Summary
Replace
checkPRStatus'sPRStatusResult | nullreturn with a discriminatedPRStatusOutcomeunion ({ ok: true, result }|{ ok: false, error }) so that each failure mode surfaces a specific, actionable error instead of collapsing into a generic "null" count.Key changes:
PRStatusErrorvariants:no_token,unrecognized_url,http_error,invalid_response,host_mismatch— each with structured fields (resolution chain, HTTP status, transient flag, sample keys).resolveGitHubTokenreturnsGitHubTokenResolution({ ok: true, token, source }|{ ok: false, tried }) capturing which credential sources were attempted, sono_tokenerrors can enumerate exactly what the user should configure.no_tokenand non-transient HTTP errors (401/403/404) fail immediately (1 strike);invalid_response/unrecognized_url/host_mismatchfail after 3 strikes; transient HTTP 5xx/429 keep the existing 10-strike behavior.poll_null_countreset on successful poll.pr.poll_failedevent emitted on terminal failure withfailureKind, provider, and HTTP status.failureKindpersisted to bead metadata for analytics grouping.Fixes #3149.
Verification
Manual verification:
no_tokenend-to-end through the DO alarm passes.container/plugin/client.test.tsJWT assertion).Visual Changes
N/A
Reviewer Notes
platformIntegrationIdgap in Town.do.ts'scheckPRStatuscontext is pre-existing (not introduced by this change) — the SCMContext passed toscm.checkPRStatusdoesn't include it, which means rig-level platform integrations aren't tried during PR polling. This is out of scope for this fix.container/plugin/client.test.tsare unrelated — they expectBearer test-jwt-tokenbut the client now generates a real JWT.