Skip to content

fix: prevent NEXRAD ingester from hanging permanently on stalled S3/redis operations#9

Merged
cwdaniel merged 1 commit into
mainfrom
multi-level-nexrad-recovery
Apr 12, 2026
Merged

fix: prevent NEXRAD ingester from hanging permanently on stalled S3/redis operations#9
cwdaniel merged 1 commit into
mainfrom
multi-level-nexrad-recovery

Conversation

@cwdaniel

Copy link
Copy Markdown
Owner

Summary

  • Fix NEXRAD ingester process hanging permanently after days of uptime, causing all station data to disappear
  • Add three layers of timeout protection so no single operation can ever block the poll loop again
  • Add observability — every poll cycle logs results, timeouts and failures are logged at warn level

Changes

Root cause: hung process

The NEXRAD ingester's main poll loop (src/nexrad/main.ts) has multiple await points that can block forever:

  1. await fileResp.arrayBuffer() — AbortSignal.timeout on fetch() only covers the connection/headers. If S3 sends headers then stalls the body transfer, the body read hangs indefinitely.
  2. await redis.hset(...) / await redis.expire(...) — ioredis has no default commandTimeout, so a stuck Redis command hangs forever.
  3. Promise.allSettled in the batch — if one fetchLatest in a batch of 10 hangs, the entire batch and main loop hang permanently.

With 159 stations polled every 60 seconds (~187K HTTP+Redis operations over 3 days), one stalled operation is statistically inevitable. restart: unless-stopped only catches exits, not hung event loops.

File: three layers of timeout protection (src/nexrad/main.ts)

File: scan key self-healing (src/nexrad/main.ts, src/nexrad/redis-scan-store.ts)

  • refreshScanTTL now returns boolean — whether the scan key still exists in Redis
  • When the "data unchanged" path detects an expired key, it clears latestVolume for that station and forces a re-download on the same cycle
  • Previously, redis.expire() on a non-existent key silently returned 0, and the ingester never rewrote the data

Observability (src/nexrad/main.ts)

  • Every poll cycle now logs updated, timedOut, failed, and total counts (previously only logged when updated > 0, so a stuck ingester was indistinguishable from a quiet one)
  • Timeouts from withTimeout wrapper logged at warn level with station ID
  • Timeouts from AbortSignal/Redis inside fetchLatest logged at warn level
  • Transient S3 errors remain at debug level

Testing

  • npx tsc --noEmit passes
  • All 216 unit tests pass
  • Docker builds successfully
  • Verified NEXRAD stations recover after deploy
  • Monitored logs for timeout/failure counts over 24+ hours

@cwdaniel

Copy link
Copy Markdown
Owner Author

This PR Addressed #9

@cwdaniel cwdaniel merged commit f32eaee into main Apr 12, 2026
1 check passed
@cwdaniel cwdaniel added the bug Something isn't working label Apr 12, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e152d45617

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/nexrad/main.ts
Comment on lines +147 to +149
const results = await Promise.allSettled(
batch.map(id => withTimeout(fetchLatest(redis, id), FETCH_LATEST_TIMEOUT_MS)),
);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Cancel timed-out station work before starting next poll

withTimeout only rejects the wrapper promise after 45s; it does not stop the underlying fetchLatest task. If a station run is slow (not dead) and crosses 45s, pollAllStations treats it as finished and future cycles can launch another fetchLatest for the same station while the first one is still writing Redis/status state. That overlap can let an older in-flight run finish later and overwrite newer station data, and repeated stalls can accumulate orphaned in-flight operations.

Useful? React with 👍 / 👎.

Comment thread src/nexrad/main.ts
Comment on lines +124 to 127
logger.debug({ err, stationId }, 'Failed to fetch station');
}
return false;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Surface fetch errors to keep failure counters accurate

The new cycle summary increments failed only for rejected promises, but this catch block converts non-timeout exceptions into false and resolves normally. In practice, S3/Redis/parser failures are now hidden from the failed metric (except wrapper timeouts), so poll-cycle logs can report failed: 0 even when many stations are erroring.

Useful? React with 👍 / 👎.

@cwdaniel cwdaniel linked an issue Apr 12, 2026 that may be closed by this pull request
@cwdaniel cwdaniel deleted the multi-level-nexrad-recovery branch April 12, 2026 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: NEXRAD ingester hangs when a single item fails to download.

1 participant