fix: prevent NEXRAD ingester from hanging permanently on stalled S3/redis operations#9
Conversation
|
This PR Addressed #9 |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e152d45617
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const results = await Promise.allSettled( | ||
| batch.map(id => withTimeout(fetchLatest(redis, id), FETCH_LATEST_TIMEOUT_MS)), | ||
| ); |
There was a problem hiding this comment.
Cancel timed-out station work before starting next poll
withTimeout only rejects the wrapper promise after 45s; it does not stop the underlying fetchLatest task. If a station run is slow (not dead) and crosses 45s, pollAllStations treats it as finished and future cycles can launch another fetchLatest for the same station while the first one is still writing Redis/status state. That overlap can let an older in-flight run finish later and overwrite newer station data, and repeated stalls can accumulate orphaned in-flight operations.
Useful? React with 👍 / 👎.
| logger.debug({ err, stationId }, 'Failed to fetch station'); | ||
| } | ||
| return false; | ||
| } |
There was a problem hiding this comment.
Surface fetch errors to keep failure counters accurate
The new cycle summary increments failed only for rejected promises, but this catch block converts non-timeout exceptions into false and resolves normally. In practice, S3/Redis/parser failures are now hidden from the failed metric (except wrapper timeouts), so poll-cycle logs can report failed: 0 even when many stations are erroring.
Useful? React with 👍 / 👎.
Summary
Changes
Root cause: hung process
The NEXRAD ingester's main poll loop (src/nexrad/main.ts) has multiple await points that can block forever:
With 159 stations polled every 60 seconds (~187K HTTP+Redis operations over 3 days), one stalled operation is statistically inevitable. restart: unless-stopped only catches exits, not hung event loops.
File: three layers of timeout protection (src/nexrad/main.ts)
File: scan key self-healing (src/nexrad/main.ts, src/nexrad/redis-scan-store.ts)
Observability (src/nexrad/main.ts)
Testing