fix(cold): replace read-arm TaskTracker backpressure with Semaphore#56
fix(cold): replace read-arm TaskTracker backpressure with Semaphore#56
Conversation
The cold storage task's run loop wedges once `MAX_CONCURRENT_READERS` (64) read handlers are in flight. The read arm's backpressure waits on `TaskTracker::wait()`, but that future only resolves when the tracker is both *closed* and *empty*. The read arm never closes the tracker (only the write arm does, during drain-before-write), so the inner `wait()` never completes, the outer `select!` never re-polls, and no further reads *or* writes are dispatched. Replace the TaskTracker-based backpressure with a `Semaphore`: - Each reader acquires one permit before being spawned; the permit is released when the spawned task completes or panics. - Writes acquire all `MAX_CONCURRENT_READERS` permits via `acquire_many_owned`, which unblocks only after every in-flight reader has released its permit — preserving the existing drain-before-write invariant with no close/reopen dance. `TaskTracker` is kept solely for graceful shutdown, which is where its close+wait semantics are actually wanted. Add regression tests that spawn 256 concurrent reads (4× the in-flight cap) and verify both reads and an interleaved write complete within a 15s guard. Before this change, both tests hit the guard; after, they pass in milliseconds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
|
gonna have to sit down with the laptop to review this feel free to patch node to this branch in the meantime |
|
this does seem to fix things when patched locally and built into an image |
|
[Claude Code] @Fraser999 — could I get a second pair of eyes on this design spec? It extends the semaphore fix in this PR with a broader refactor. Landing this design would replace the current diff. Cold storage read/write permit refactorContextPR #56 on The semaphore fix is correct and resolves the observed crash. During
This spec describes a refactor that resolves all four in one change. Goals
Non-goals
DesignArchitectureReplace the single-run-loop model with two concurrently-running tasks:
Both tasks share one Permit-attached-to-message
pub async fn get_header(&self, spec: HeaderSpecifier)
-> ColdResult<Option<SealedHeader>>
{
let permit = self.read_sem.clone().acquire_owned().await
.map_err(|_| ColdStorageError::Cancelled)?;
let (tx, rx) = oneshot::channel();
self.read_sender
.send(PermittedReadRequest {
permit,
req: ColdReadRequest::GetHeader { spec, resp: tx },
})
.await
.map_err(|_| ColdStorageError::Cancelled)?;
rx.await.map_err(|_| ColdStorageError::Cancelled)?
}The permit travels in the message, is moved into the spawned handler Channel sizing:
|
|
[Claude Code] Spec is directionally right — the permit-attached-to-message, split dispatcher/writer, cancellable drain, and Core issue: tokio-level
|
Adds two accessors to the `ColdStorageBackend` trait:
fn read_timeout(&self) -> Option<Duration> { None }
fn write_timeout(&self) -> Option<Duration> { None }
Wired through `MdbxColdBackend`, `SqlColdBackend`, and `EitherCold`.
`MemColdBackend` returns `None` (already-documented test exemption).
Two behaviour changes use these:
1. The advisory write-SLO WARN moves from the MDBX backend
(`warn_on_overrun` per-method) to `ColdStorage::spawn_write`. Timing
is now captured before `write_sem` acquisition, so the elapsed
value covers the queue wait, the read drain, and the commit
end-to-end. The failure shape that wedged production at #56 — slow
readers gating writes — now surfaces as a write-SLO violation
rather than as a sub-threshold backend timing.
2. `stream_logs`'s setup `get_latest_block` is wrapped in
`tokio::time::timeout(backend.read_timeout(), ...)`. Without this,
a stuck point lookup (cold MDBX page) or a saturated PG pool
parking on `acquire_timeout` could pin N concurrent setup callers
indefinitely with no permit cap. The setup read still bypasses
`read_sem` and the drain barrier by design.
Also drops the now-unused `tracing` dep from `signet-cold-mdbx` and
updates the type docs to point at the handle's new WARN path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(cold): ColdStorageWrite takes &self; all backends updated in lockstep * refactor(cold): unify handle around Arc<Inner>; remove channels and dispatcher * fix(cold): stream permit acquired in handle; streams do not hold a read permit * feat(cold): drain barrier moves to handle write path * feat(cold): shutdown coordinator closes semaphores on cancel * refactor(cold-mdbx): spawn_blocking reads, block_in_place writes, in-body iterator deadline Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cold-sql): mandatory statement_timeout; read_timeout and write_timeout builders * feat(cold): metrics and tracing spans across all operations Adds a `metrics` module under `crates/cold/src/metrics.rs` with const metric names, help strings, a `LazyLock` describe block, and `pub(crate)` helper functions for recording: - `cold.reads_in_flight`, `cold.writes_in_flight`, `cold.streams_active` (gauges) - `cold.op_duration_us` (histogram, labeled by op) - `cold.permit_wait_us` (histogram, labeled by sem: read/write/drain/stream) - `cold.op_errors_total` (counter, labeled by op and error kind) - `cold.stream_lifetime_ms` (histogram) Wires the helpers into every `ColdStorage<B>` handle method: `spawn_read` and `spawn_write` time permit acquisition, bump in-flight, measure op duration, record errors, and dec in-flight after the backend call. Cache hits in `get_header`/`get_transaction`/`get_receipt` record op duration only (no permit wait, no in-flight). `stream_logs` instruments stream permit wait and records stream lifetime + gauge in the spawned producer. Adds `ColdStorageError::kind()` for the error metric label. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(cold): trait impl guide documents mandatory timeouts * test(cold): concurrency suite covers new architecture * fix(cold): shutdown coordinator holds Weak<Inner>, not Arc The coordinator task previously moved Arc<Inner<B>> into its body and awaited the user's cancel token. If callers dropped all ColdStorage clones without firing cancel, Inner (and the backend's file/DB handles) stayed pinned until process exit. Switch the coordinator to Weak<Inner>, and put a DropGuard on Inner that fires a child cancel token. shutdown now fires on either user-side cancel OR Inner drop; in the drop case upgrade() returns None and the coordinator exits without pinning anything. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cold-mdbx): preserve TooManyLogs via From impl, not backend wrapper ColdStorageError::backend unconditionally wraps as Backend(Box<_>), which hid MdbxColdError::TooManyLogs behind the generic backend variant and broke the conformance suite's max_logs assertion. The From<MdbxColdError> for ColdStorageError impl already translates TooManyLogs correctly and wraps the rest. Route all spawn_blocking result conversions through ::from so the translation runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cold): address review on permits, gauges, errors, cache - stream_logs resolves `to` (and get_latest_block fallback) before acquiring stream_sem, so a stuck backend no longer pins all 8 permits across setup I/O. - In-flight gauges are now maintained by an InFlightGuard RAII wrapper so the decrement survives a panic in the spawned body; previously a panic left cold.reads_in_flight / writes_in_flight / streams_active drifting up and poisoning the Prometheus alert signal. - Promote timeout to a first-class ColdStorageError::DeadlineExceeded variant. MDBX Timeout now routes through it (not Backend), and downstream callers can match without downcasting. Fixes stale Backpressure references in the cold and storage READMEs and the signet-storage skill doc. - ColdCache switches from tokio::sync::Mutex to parking_lot::Mutex. The cache only ever holds the lock across synchronous LRU ops, so the async mutex's yield-on-lock was pure overhead. - MemColdBackend now explicitly documents its exemption from the trait's mandatory-timeout contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cold-mdbx): spawn_blocking writes, per-item deadlines, docs - Writes (append_block, append_blocks, truncate_above, drain_above) now use tokio::task::spawn_blocking instead of block_in_place. block_in_place panics on a current_thread runtime, so any consumer wiring MdbxColdBackend into a single-threaded Tokio would hit the first write and crash. Added writes_work_on_current_thread_runtime regression test. - Overrun WARN fires only on successful writes. A failed write that took > 2 s already surfaces Backend(...) to the caller; a spurious advisory-write-timeout WARN on the error path would poison any SLO alert built on that signal. - Iterator reads gained inner-loop deadline checks: per-receipt in get_logs_inner, per-event in collect_signet_events_in_range, per-receipt + per-log in produce_log_stream_blocking. A block with many matching logs (or a slow stream consumer) can no longer run unbounded past the configured deadline. - MdbxColdError::Timeout now maps to ColdStorageError::DeadlineExceeded (new variant) instead of Backend. Updated the existing timeout test to match on the variant directly. - Documented the point-lookup timeout exemption: MDBX page I/O on cold pages can stall arbitrarily, and the handle does not wrap point lookups in a tokio::time::timeout, so a stuck lookup ties up a spawn_blocking worker AND a read_sem permit. Callers that need fail-fast behavior should wrap at the call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(storage): align unified::drain_above doc with silent-swallow impl The PR-#58 doc rewrite advertised a `Cold` error path, but the impl collapses every cold error into `Vec::new()`. Update the doc + comment to admit silent-swallow behaviour. ENG-2210 tracks the propagation decision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cold-sql): map PG statement_timeout to DeadlineExceeded SQLSTATE 57014 (query_canceled, emitted on `statement_timeout` expiry) was wrapped as `Sqlx(...)` and surfaced to the handle as `Backend(...)`, breaking symmetry with the MDBX backend (which routes its `Timeout` to `ColdStorageError::DeadlineExceeded`). The metric `cold.op_errors_total{error="backend"}` therefore conflated "query too slow" with "backend down". `From<sqlx::Error> for SqlColdError` now detects 57014 and produces a dedicated `Timeout` variant; `From<SqlColdError> for ColdStorageError` maps it to `DeadlineExceeded`. The configured deadline is not threaded to this conversion boundary so the surfaced duration is `ZERO`; threading the real value is a separate refactor (left for a follow-up once the call sites are confirmed to need it). The `pg_statement_timeout` test is rewritten to match on the typed variant rather than a substring of the error message — a future refactor that drops 57014 detection now fails the test instead of silently passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cold): hoist write SLO and stream-setup timeout into the handle Adds two accessors to the `ColdStorageBackend` trait: fn read_timeout(&self) -> Option<Duration> { None } fn write_timeout(&self) -> Option<Duration> { None } Wired through `MdbxColdBackend`, `SqlColdBackend`, and `EitherCold`. `MemColdBackend` returns `None` (already-documented test exemption). Two behaviour changes use these: 1. The advisory write-SLO WARN moves from the MDBX backend (`warn_on_overrun` per-method) to `ColdStorage::spawn_write`. Timing is now captured before `write_sem` acquisition, so the elapsed value covers the queue wait, the read drain, and the commit end-to-end. The failure shape that wedged production at #56 — slow readers gating writes — now surfaces as a write-SLO violation rather than as a sub-threshold backend timing. 2. `stream_logs`'s setup `get_latest_block` is wrapped in `tokio::time::timeout(backend.read_timeout(), ...)`. Without this, a stuck point lookup (cold MDBX page) or a saturated PG pool parking on `acquire_timeout` could pin N concurrent setup callers indefinitely with no permit cap. The setup read still bypasses `read_sem` and the drain barrier by design. Also drops the now-unused `tracing` dep from `signet-cold-mdbx` and updates the type docs to point at the handle's new WARN path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cold): reject zero timeouts; log JoinError panics; misc nits Builders for `read_timeout` / `write_timeout` on both the MDBX and SQL backends and connectors now panic on zero. Postgres treats `statement_timeout = 0` as "no timeout", so a caller passing `Duration::ZERO` (or computing one from a config that defaults to zero) would silently disable the trait-level mandatory-timeout contract. MDBX accepts the same assert for symmetry — zero there is a useless config rather than a silent disable, but the trait says non-zero and the assert keeps the surface honest. `spawn_read` / `spawn_write` now log spawned-task `JoinError`s before mapping to `TaskTerminated`. A backend panic was previously indistinguishable from graceful shutdown for the on-call: panics fire ERROR with the panic message, cancellations fire DEBUG. The error variant still collapses to `TaskTerminated` per design. `MdbxColdBackend::get_logs_inner` now checks the deadline inside the inner per-log loop, mirroring the streaming path. Previously a single receipt with thousands of matching logs would iterate unchecked past the configured `read_timeout`. The two `std::time::Instant::now()` sites in `produce_log_stream_blocking` are also folded into the already-imported `Instant`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(cold): stream_logs setup fails fast on hung get_latest_block Pins the new setup-timeout behaviour against regression. The test parks `GatedBackend::get_latest_block` indefinitely and asserts that `stream_logs` (with no `to_block` on the filter, forcing the "resolve to=latest" path) returns `DeadlineExceeded` within the configured 50 ms `read_timeout` rather than hanging. Adds `GatedBackend::with_read_timeout` so tests can advertise a custom read timeout to the handle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Fixes a deadlock in
signet_cold::ColdStorageTaskthat wedges the entire cold-storage task the moment 64 concurrent reads are in flight. Once wedged, no reads or writes are dispatched; upstream RPCs stall, write mpsc fills (~256 chain advances ≈ 51 min at 12 s/block),try_sendindispatch_append_blocksreturnsBackpressure, and the node crashes via?atsignet-node/src/node.rs:285.Observed in production on signet-sidecar running in the dev
mainnetnamespace: every ~50 min the pod crashed withcold storage backpressure: channel full; all cold-path RPCs (eth_getBlockByNumber,eth_getLogs,eth_gasPrice,eth_feeHistory) hung at the 60 s client timeout while hot-only RPCs (eth_getBalance,eth_blockNumber,eth_chainId) stayed sub-millisecond. Aurora sat idle at its min-pool 5 connections with zeroReadIOPS.The bug
crates/cold/src/task/runner.rs, read arm (before this change):tokio_util::TaskTracker::wait()documents:The read arm never calls
close()— only the write arm does, as part of its drain-before-write (close(); wait(); reopen();). So the instant the 65th concurrent read arrives, the innerwait()future has no path to completion.tokio::select!inside aselect!arm does not cause the outerselect!to re-poll — the outer future is still "inside" the read arm's body. The whole run loop is pinned there forever.Observable fallout, all matching telemetry:
read_receivermpsc;send().awaiton the RPC side blocks; clients see exactly-60 s hangs when they time out.ReadIOPSand 100 % buffer cache ratio because nothing new queries it.write_receiver(256 slots). Chain advances every ~12 s ⇒ 256 × 12 s ≈ 51 min, after whichtry_send(AppendBlocks)returnsTrySendError::Full→ColdStorageError::Backpressure, which bubbles throughself.storage.append_blocks(...)?insignet-nodeand kills the process.handle_new_head/commitspans look fast in tracing because they measure the node-sidetry_send, which is fire-and-forget. They don't indicate the cold task is actually making progress.The fix
Replace the
TaskTracker-based backpressure with aSemaphoresized toMAX_CONCURRENT_READERS:Key properties:
acquire_owned()wakes immediately when any sibling reader releases a permit on completion. No closed-tracker invariant required.acquire_many_owned(64)only resolves once every reader has released its permit, which is exactly the oldclose() + wait()semantics — minus the footgun.TaskTrackeris retained solely for graceful shutdown, whereclose() + wait()semantics are what we want (the shutdown path explicitly closes the tracker after the main loop exits).JoinHandle's future (on handler panic) drops_permit, so the semaphore count is never leaked.Tests
Added
crates/cold/tests/concurrency.rswith two regression tests:reads_above_concurrency_cap_do_not_deadlock— issues 256 concurrent reads (4× the in-flight cap) and asserts all 256 complete within a 15 s guard.write_after_saturating_reads_makes_progress— interleaves a write into the same 256-reader flood and asserts it completes within the same guard.Both pass in milliseconds with this change. With the fix reverted, both reliably hit the 15 s deadlock guard:
Test plan
cargo test -p signet-cold(existing conformance + new concurrency tests)cargo +nightly fmt -- --checkcargo clippy --workspace --all-targets -- -D warningsRUSTDOCFLAGS=\"-D warnings\" cargo doc --workspace --no-depsorigin/main'srunner.rsand pass with this change.signet-cold+signet-storage, bumpinit4tech/node-components, rebuildsignet-sidecar:latest, redeploy tomainnetdev, confirm the sidecar stops crashing every 51 min andeth_getBlockByNumberstops timing out.🤖 Generated with Claude Code