fix: prevent NEXRAD ingester from hanging permanently on stalled S3/redis operations by cwdaniel · Pull Request #9 · cwdaniel/RadrView

cwdaniel · 2026-04-12T22:34:15Z

Summary

Fix NEXRAD ingester process hanging permanently after days of uptime, causing all station data to disappear
Add three layers of timeout protection so no single operation can ever block the poll loop again
Add observability — every poll cycle logs results, timeouts and failures are logged at warn level

Changes

Root cause: hung process

The NEXRAD ingester's main poll loop (src/nexrad/main.ts) has multiple await points that can block forever:

await fileResp.arrayBuffer() — AbortSignal.timeout on fetch() only covers the connection/headers. If S3 sends headers then stalls the body transfer, the body read hangs indefinitely.
await redis.hset(...) / await redis.expire(...) — ioredis has no default commandTimeout, so a stuck Redis command hangs forever.
Promise.allSettled in the batch — if one fetchLatest in a batch of 10 hangs, the entire batch and main loop hang permanently.

With 159 stations polled every 60 seconds (~187K HTTP+Redis operations over 3 days), one stalled operation is statistically inevitable. restart: unless-stopped only catches exits, not hung event loops.

File: three layers of timeout protection (src/nexrad/main.ts)

File: scan key self-healing (src/nexrad/main.ts, src/nexrad/redis-scan-store.ts)

refreshScanTTL now returns boolean — whether the scan key still exists in Redis
When the "data unchanged" path detects an expired key, it clears latestVolume for that station and forces a re-download on the same cycle
Previously, redis.expire() on a non-existent key silently returned 0, and the ingester never rewrote the data

Observability (src/nexrad/main.ts)

Every poll cycle now logs updated, timedOut, failed, and total counts (previously only logged when updated > 0, so a stuck ingester was indistinguishable from a quiet one)
Timeouts from withTimeout wrapper logged at warn level with station ID
Timeouts from AbortSignal/Redis inside fetchLatest logged at warn level
Transient S3 errors remain at debug level

Testing

npx tsc --noEmit passes
All 216 unit tests pass
Docker builds successfully
Verified NEXRAD stations recover after deploy
Monitored logs for timeout/failure counts over 24+ hours

…edis operations

cwdaniel · 2026-04-12T22:36:18Z

This PR Addressed #9

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e152d45617

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-12T22:37:41Z

+    const results = await Promise.allSettled(
+      batch.map(id => withTimeout(fetchLatest(redis, id), FETCH_LATEST_TIMEOUT_MS)),
+    );


Cancel timed-out station work before starting next poll

withTimeout only rejects the wrapper promise after 45s; it does not stop the underlying fetchLatest task. If a station run is slow (not dead) and crosses 45s, pollAllStations treats it as finished and future cycles can launch another fetchLatest for the same station while the first one is still writing Redis/status state. That overlap can let an older in-flight run finish later and overwrite newer station data, and repeated stalls can accumulate orphaned in-flight operations.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-12T22:37:41Z

+      logger.debug({ err, stationId }, 'Failed to fetch station');
+    }
    return false;
  }


Surface fetch errors to keep failure counters accurate

The new cycle summary increments failed only for rejected promises, but this catch block converts non-timeout exceptions into false and resolves normally. In practice, S3/Redis/parser failures are now hidden from the failed metric (except wrapper timeouts), so poll-cycle logs can report failed: 0 even when many stations are erroring.

Useful? React with 👍 / 👎.

fix: prevent NEXRAD ingester from hanging permanently on stalled S3/r…

e152d45

…edis operations

cwdaniel merged commit f32eaee into main Apr 12, 2026
1 check passed

cwdaniel added the bug Something isn't working label Apr 12, 2026

chatgpt-codex-connector Bot reviewed Apr 12, 2026

View reviewed changes

cwdaniel linked an issue Apr 12, 2026 that may be closed by this pull request

[Bug]: NEXRAD ingester hangs when a single item fails to download. #10

Closed

cwdaniel deleted the multi-level-nexrad-recovery branch April 12, 2026 22:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: prevent NEXRAD ingester from hanging permanently on stalled S3/redis operations#9

fix: prevent NEXRAD ingester from hanging permanently on stalled S3/redis operations#9
cwdaniel merged 1 commit into
mainfrom
multi-level-nexrad-recovery

cwdaniel commented Apr 12, 2026

Uh oh!

cwdaniel commented Apr 12, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 12, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

cwdaniel commented Apr 12, 2026

Uh oh!

cwdaniel commented Apr 12, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant