fix(cold): replace read-arm TaskTracker backpressure with Semaphore by rswanson · Pull Request #56 · init4tech/storage

rswanson · 2026-04-16T19:43:33Z

Summary

Fixes a deadlock in signet_cold::ColdStorageTask that wedges the entire cold-storage task the moment 64 concurrent reads are in flight. Once wedged, no reads or writes are dispatched; upstream RPCs stall, write mpsc fills (~256 chain advances ≈ 51 min at 12 s/block), try_send in dispatch_append_blocks returns Backpressure, and the node crashes via ? at signet-node/src/node.rs:285.

Observed in production on signet-sidecar running in the dev mainnet namespace: every ~50 min the pod crashed with cold storage backpressure: channel full; all cold-path RPCs (eth_getBlockByNumber, eth_getLogs, eth_gasPrice, eth_feeHistory) hung at the 60 s client timeout while hot-only RPCs (eth_getBalance, eth_blockNumber, eth_chainId) stayed sub-millisecond. Aurora sat idle at its min-pool 5 connections with zero ReadIOPS.

The bug

crates/cold/src/task/runner.rs, read arm (before this change):

maybe_read = self.read_receiver.recv() => {
    ...
    while self.task_tracker.len() >= MAX_CONCURRENT_READERS {   // 64
        tokio::select! {
            _ = self.cancel_token.cancelled() => { break; }
            _ = self.task_tracker.wait() => {}                    // <-- never resolves
        }
    }
    self.task_tracker.spawn(async move { inner.handle_read(req).await; });
}

tokio_util::TaskTracker::wait() documents:

"Waits until this TaskTracker is both closed and empty."

The read arm never calls close() — only the write arm does, as part of its drain-before-write (close(); wait(); reopen();). So the instant the 65th concurrent read arrives, the inner wait() future has no path to completion. tokio::select! inside a select! arm does not cause the outer select! to re-poll — the outer future is still "inside" the read arm's body. The whole run loop is pinned there forever.

Observable fallout, all matching telemetry:

New reads pile in the 256-slot read_receiver mpsc; send().await on the RPC side blocks; clients see exactly-60 s hangs when they time out.
Already-spawned 64 read tasks complete and release their connections. Pool returns to idle min (5). Aurora reports zero ReadIOPS and 100 % buffer cache ratio because nothing new queries it.
Writes queue in write_receiver (256 slots). Chain advances every ~12 s ⇒ 256 × 12 s ≈ 51 min, after which try_send(AppendBlocks) returns TrySendError::Full → ColdStorageError::Backpressure, which bubbles through self.storage.append_blocks(...)? in signet-node and kills the process.
handle_new_head / commit spans look fast in tracing because they measure the node-side try_send, which is fire-and-forget. They don't indicate the cold task is actually making progress.

The fix

Replace the TaskTracker-based backpressure with a Semaphore sized to MAX_CONCURRENT_READERS:

maybe_write = self.write_receiver.recv() => {
    ...
    // Drain all in-flight reads by acquiring every permit.
    let _drain = self.read_semaphore.clone()
        .acquire_many_owned(MAX_CONCURRENT_READERS as u32)
        .await
        .expect(\"read semaphore outlives the run loop\");
    self.handle_write(req).await;
    // _drain drops here, restoring permits.
}

maybe_read = self.read_receiver.recv() => {
    ...
    let permit = tokio::select! {
        _ = self.cancel_token.cancelled() => { break; }
        permit = self.read_semaphore.clone().acquire_owned() => {
            permit.expect(\"read semaphore outlives the run loop\")
        }
    };
    let inner = Arc::clone(&self.inner);
    self.task_tracker.spawn(async move {
        let _permit = permit;   // released on completion / panic
        inner.handle_read(req).await;
    });
}

Key properties:

Reader backpressure is now correct. acquire_owned() wakes immediately when any sibling reader releases a permit on completion. No closed-tracker invariant required.
Drain-before-write is preserved. acquire_many_owned(64) only resolves once every reader has released its permit, which is exactly the old close() + wait() semantics — minus the footgun.
TaskTracker is retained solely for graceful shutdown, where close() + wait() semantics are what we want (the shutdown path explicitly closes the tracker after the main loop exits).
Panic-safe. Dropping the spawned JoinHandle's future (on handler panic) drops _permit, so the semaphore count is never leaked.

Tests

Added crates/cold/tests/concurrency.rs with two regression tests:

reads_above_concurrency_cap_do_not_deadlock — issues 256 concurrent reads (4× the in-flight cap) and asserts all 256 complete within a 15 s guard.
write_after_saturating_reads_makes_progress — interleaves a write into the same 256-reader flood and asserts it completes within the same guard.

Both pass in milliseconds with this change. With the fix reverted, both reliably hit the 15 s deadlock guard:

thread 'reads_above_concurrency_cap_do_not_deadlock' panicked:
    cold storage task deadlocked under concurrent reads: Elapsed(())
thread 'write_after_saturating_reads_makes_progress' panicked:
    write was starved by saturated readers: Elapsed(())
test result: FAILED. 0 passed; 2 failed; ... finished in 15.01s

Test plan

cargo test -p signet-cold (existing conformance + new concurrency tests)
cargo +nightly fmt -- --check
cargo clippy --workspace --all-targets -- -D warnings
RUSTDOCFLAGS=\"-D warnings\" cargo doc --workspace --no-deps
Manually verified the new tests fail on origin/main's runner.rs and pass with this change.
Reviewer spot-check: cut a patch release of signet-cold + signet-storage, bump init4tech/node-components, rebuild signet-sidecar:latest, redeploy to mainnet dev, confirm the sidecar stops crashing every 51 min and eth_getBlockByNumber stops timing out.

🤖 Generated with Claude Code

The cold storage task's run loop wedges once `MAX_CONCURRENT_READERS` (64) read handlers are in flight. The read arm's backpressure waits on `TaskTracker::wait()`, but that future only resolves when the tracker is both *closed* and *empty*. The read arm never closes the tracker (only the write arm does, during drain-before-write), so the inner `wait()` never completes, the outer `select!` never re-polls, and no further reads *or* writes are dispatched. Replace the TaskTracker-based backpressure with a `Semaphore`: - Each reader acquires one permit before being spawned; the permit is released when the spawned task completes or panics. - Writes acquire all `MAX_CONCURRENT_READERS` permits via `acquire_many_owned`, which unblocks only after every in-flight reader has released its permit — preserving the existing drain-before-write invariant with no close/reopen dance. `TaskTracker` is kept solely for graceful shutdown, which is where its close+wait semantics are actually wanted. Add regression tests that spawn 256 concurrent reads (4× the in-flight cap) and verify both reads and an interleaved write complete within a 15s guard. Before this change, both tests hit the guard; after, they pass in milliseconds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rswanson · 2026-04-16T19:50:51Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

prestwich · 2026-04-16T20:01:47Z

gonna have to sit down with the laptop to review this feel free to patch node to this branch in the meantime

rswanson · 2026-04-16T20:41:46Z

this does seem to fix things when patched locally and built into an image

prestwich · 2026-04-17T03:35:05Z

[Claude Code]

@Fraser999 — could I get a second pair of eyes on this design spec? It extends the semaphore fix in this PR with a broader refactor. Landing this design would replace the current diff.

Cold storage read/write permit refactor

Context

PR #56 on swanny/fix-cold-reader-deadlock replaces the old
TaskTracker-based reader backpressure in signet_cold::ColdStorageTask
with a Semaphore, fixing a production deadlock that wedged the entire
cold-storage task the moment the 65th concurrent reader arrived.

The semaphore fix is correct and resolves the observed crash. During
review we identified a handful of smaller issues in the same area that
warrant a second pass:

The write arm's acquire_many_owned(64).await is not wrapped in a
cancel-select. A single stuck reader holds its permit indefinitely,
preventing the writer from accumulating all 64 permits, and the
outer select! can't re-poll the cancel arm while it's inside the
write arm. Shutdown during a drain hangs.
Backpressure is expressed through two independently-bounded
mechanisms — the 256-slot read mpsc and the 64-permit semaphore —
that have to stay in sync with each other. They don't today; the
channel buffer is 4× larger than the permit pool, and the overlap
serves no correctness purpose.
At the steady-state 12s write cadence, a single stuck reader
extends every subsequent drain and, over time, fills the write
mpsc and regenerates the exact backpressure-induced crash the PR
was written to fix. The semaphore change alone does not close this
path.
There is no handler-side operation deadline. Non-stream reads that
don't complete within a reasonable bound have no mechanism to
release their permit back to the pool.

This spec describes a refactor that resolves all four in one change.

Goals

Unify read backpressure into a single mechanism whose invariants are
obvious.
Make shutdown responsive regardless of reader state.
Bound the worst-case drain duration so it cannot exceed the write
cadence.
Preserve the existing drain-before-write correctness guarantee.
Preserve the existing graceful-shutdown semantics for both reads and
long-lived stream producers.

Non-goals

Changing how streaming reads (StreamLogs) work. Streams continue
to run independently with backend-level snapshot isolation and are
not drained before writes.
Adding per-operation deadlines. A single global deadline is
sufficient; per-op tuning can come later with evidence.
Interrupting in-progress backend calls on deadline expiry. The
deadline drops our handle on the future; the backend may continue
working until it completes or its connection is torn down. That is
a backend-layer concern.
Changing the ColdStorage trait surface or the external
ColdStorageHandle API beyond what's needed to attach permits to
read requests.

Design

Architecture

Replace the single-run-loop model with two concurrently-running tasks:

Read dispatcher — always polling the read channel. Its only
work is recv → spawn read handler. Never blocks on anything other
than the channel itself.
Writer — serializes writes. For each write, acquires all 64
permits (cancellably), executes the write, releases permits.

Both tasks share one read_semaphore: Arc<Semaphore> sized to
MAX_CONCURRENT_READERS = 64. This semaphore is the single
backpressure mechanism. It has no siblings.

Permit-attached-to-message

ColdStorageHandle's read methods acquire a permit before sending
into the channel:

pub async fn get_header(&self, spec: HeaderSpecifier)
    -> ColdResult<Option<SealedHeader>>
{
    let permit = self.read_sem.clone().acquire_owned().await
        .map_err(|_| ColdStorageError::Cancelled)?;
    let (tx, rx) = oneshot::channel();
    self.read_sender
        .send(PermittedReadRequest {
            permit,
            req: ColdReadRequest::GetHeader { spec, resp: tx },
        })
        .await
        .map_err(|_| ColdStorageError::Cancelled)?;
    rx.await.map_err(|_| ColdStorageError::Cancelled)?
}

The permit travels in the message, is moved into the spawned handler
by the dispatcher, and is dropped when the handler completes or panics.
Its lifetime spans: caller hold time → channel transit → handler
execution. Dropping the handler future (panic, timeout, abort) drops
the permit, so the pool is never leaked.

Channel sizing: `READ_CHANNEL_SIZE = MAX_CONCURRENT_READERS`

With permits acquired pre-send, the total count of outstanding permits
(= held-by-caller + in-channel + in-handler) is bounded by 64. So the
read channel can never hold more than 64 items. We size it to exactly
64.

The channel is now a passive rendezvous. It cannot back-pressure
(that job belongs to the semaphore), and a caller holding a permit is
guaranteed space. Use try_send from the handle:

self.read_sender.try_send(PermittedReadRequest { permit, req })
    .map_err(|e| match e {
        TrySendError::Full(_) => unreachable!(
            "semaphore permit implies channel capacity"
        ),
        TrySendError::Closed(_) => ColdStorageError::Cancelled,
    })?;

Full is impossible by construction — it would indicate a permit
miscount. Closed maps to shutdown.

Write channel size is unchanged (WRITE_CHANNEL_SIZE = 256). Writes
are bounded separately by their own send-side mpsc capacity.

Writer task with cancellable drain

async fn run_writer(mut self) {
    while let Some(req) = self.write_receiver.recv().await {
        let _drain = tokio::select! {
            _ = self.cancel_token.cancelled() => break,
            d = self.read_semaphore.clone().acquire_many_owned(
                MAX_CONCURRENT_READERS as u32,
            ) => d.expect("read semaphore outlives the writer task"),
        };
        self.handle_write(req).await;
        // _drain drops here, restoring all permits.
    }
}

The cancel-select closes the "stuck reader + shutdown = hang" window
identified in PR #56 review.

Dispatcher task

async fn run_dispatcher(self) {
    while let Some(PermittedReadRequest { permit, req })
        = self.read_receiver.recv().await
    {
        let inner = Arc::clone(&self.inner);
        let deadline = self.read_deadline;
        self.task_tracker.spawn(async move {
            let _permit = permit;
            inner.handle_read_with_deadline(req, deadline).await;
        });
    }
}

handle_read_with_deadline refactors today's handle_read so the
response sender is extracted from the request up front, the body is
wrapped in tokio::time::timeout, and on timeout the sender receives
Err(ColdStorageError::Timeout) explicitly while a WARN is emitted
with the operation variant. Sketch:

impl<B: ColdStorageRead> ColdStorageTaskInner<B> {
    async fn handle_read_with_deadline(
        self: &Arc<Self>,
        req: ColdReadRequest,
        deadline: Duration,
    ) {
        let op = req.variant_name();
        let (resp, body) = req.split_response();
        match tokio::time::timeout(deadline, self.handle_read_body(body)).await {
            Ok(result) => { let _ = resp.send(result); }
            Err(_) => {
                warn!(operation = op, "cold read deadline exceeded");
                let _ = resp.send(Err(ColdStorageError::Timeout));
            }
        }
    }
}

The contract: on deadline expiry, the caller receives
ColdStorageError::Timeout and the permit is released. The exact
split_response / variant_name / handle_read_body surface is a
refactor of the current handle_read match arms; each read variant
carries its own response type so the split is mechanical.

Deadline value

DEFAULT_READ_DEADLINE = Duration::from_secs(5).

Rationale:

Every non-stream operation should complete in well under a second
under normal conditions (point lookups on indexed columns, bounded
batch lookups, filter scans bounded by max_logs).
The deadline is a guardrail against backend-layer pathology (stuck
connection, runaway query, deadlocked transaction), not a ceiling
for legitimate work.
Comfortably below the 12s write cadence, so a worst-case drain of
5s leaves 7s of headroom per cycle before write mpsc pressure
accumulates.
Well below the 60s RPC client timeout, so the caller sees a
structured Timeout error rather than RecvError at the client
layer.
A legitimate GetLogs that trips the 5s deadline is a signal that
the caller should have used StreamLogs. The WARN log with the
variant name makes that diagnosable.

Expose the deadline as a field on ColdStorageTask, configurable via
a builder method analogous to max_stream_deadline:

impl<B: ColdStorage> ColdStorageTask<B> {
    pub fn with_read_deadline(mut self, deadline: Duration) -> Self {
        // ...
    }
}

Drain semantics preserved

The writer's acquire_many_owned(64) still serves as the drain
barrier. It resolves only once every permit is free, which means:

No callers are mid-send (they're blocked at acquire_owned()).
No messages are waiting in the channel (each held a permit; the
dispatcher has moved them all into handlers).
No handlers are running (each held its permit; all have completed).

The dispatcher keeps running during drain, so queued permits do not
get stranded. Tokio's FIFO semaphore fairness ensures that once the
writer is queued for 64 permits, later callers queue behind it and do
not re-pressurize the pool.

Shutdown

Each of the two tasks checks the cancel token in its outer select, so
either the cancel token firing or the relevant channel closing (all
ColdStorageHandle clones dropped) terminates it.

Dispatcher:

loop {
    tokio::select! {
        biased;
        _ = self.cancel_token.cancelled() => break,
        maybe = self.read_receiver.recv() => {
            let Some(PermittedReadRequest { permit, req }) = maybe else {
                break;
            };
            // spawn handler
        }
    }
}

Writer: same shape — cancel token arm plus write_receiver.recv()
plus the cancel-gated acquire_many_owned inside the write handling
block.

Full shutdown sequence (driven by the top-level task that owns both):

Cancel token fires. Both tasks break out of their outer loops.
- Dispatcher stops accepting new reads. In-flight queued messages
  that were already recv'd are spawned as handlers; any that
  remain in the channel stay there until all ColdStorageHandles
  are dropped (they get dropped on the handle side at shutdown).
- Writer exits from either its outer cancel arm or its inner
  cancel-select around acquire_many.
Wait for the handler task_tracker:
task_tracker.close(); task_tracker.wait().await.
Bounded by the 5s read deadline — no handler can run longer.
Wait for the stream_tracker:
stream_tracker.close(); stream_tracker.wait().await.
Bounded by the per-stream deadline.

Order: handlers must drain before streams, because StreamLogs
handlers spawn stream producers. Draining streams first could miss a
producer spawned by a late-completing handler.

Implementation: ColdStorageTask::run becomes a thin orchestrator
that spawns run_dispatcher and run_writer on a JoinSet (or two
plain tokio::spawn handles), awaits both, then runs the two
tracker-drain steps in order.

API changes

Public surface changes are intentionally minimal:

ColdStorageTask::new gains a read_deadline parameter (with a
convenience constructor that uses the default).
ColdStorageTask::with_read_deadline(Duration) -> Self builder.
ColdStorageError::Timeout variant added.

Internal:

PermittedReadRequest { permit: OwnedSemaphorePermit, req: ColdReadRequest }
wraps messages on the read channel.
ColdStorageHandle holds Arc<Semaphore> for permit acquisition
and exposes unchanged per-operation methods.
ColdStorageTask is split into run_dispatcher and run_writer;
run becomes a thin wrapper that spawns both and manages shutdown.

No changes to ColdStorageRead, ColdStorageWrite, or ColdStorage
traits.

Testing

Add a new integration-test file crates/cold/tests/concurrency.rs
(replacing the one in PR #56) plus a small GatedBackend test helper.

`GatedBackend<B>`

Test-local wrapper over any ColdStorage backend. Each read method
acquires a permit on a test-controlled Arc<Semaphore> before
delegating to the inner backend:

struct GatedBackend<B> {
    inner: B,
    read_gate: Arc<Semaphore>,
}

impl<B: ColdStorageRead> ColdStorageRead for GatedBackend<B> {
    async fn get_header(&self, spec: HeaderSpecifier)
        -> ColdResult<Option<SealedHeader>>
    {
        let _p = self.read_gate.acquire().await.unwrap();
        self.inner.get_header(spec).await
    }
    // ... all other read methods follow the same pattern.
}

Tests control when readers release by adjusting the gate's permit
count.

Regression tests

reads_above_concurrency_cap_do_not_deadlock (carried over from
PR fix(cold): replace read-arm TaskTracker backpressure with Semaphore #56). 256 concurrent reads against a non-gated backend, 15s
guard. Verifies the original bug stays fixed.
write_after_saturating_reads_makes_progress (carried over).
Same flood with a write interleaved.
fairness_write_serves_before_later_readers (new). Gate the
backend at 0 permits. Spawn 64 readers (they hang in the gate).
Send a write. Spawn another 64 readers. Release all gate permits at
once. Assert: the write's oneshot resolves before any of the
later 64 readers' oneshots.
cancel_during_reader_backpressure_shuts_down (new). Gate at
0. Saturate with 64 readers plus a 65th queued on
acquire_owned. Fire the cancel token. Assert the task shuts down
within a deadlock guard (few seconds, well under 15s).
cancel_during_write_drain_shuts_down (new). Gate at 0.
Saturate 64 readers. Send a write (blocks in acquire_many). Fire
the cancel token. Assert shutdown within the guard. This test
would fail without the cancel-select fix.
operation_deadline_releases_permit (new). Configure the task
with a 200ms deadline. Gate at 0. Send one read. After 200ms,
assert: the caller receives ColdStorageError::Timeout, and a
subsequent read (issued after the deadline) acquires a permit and
proceeds.

All tests use MemColdBackend wrapped in GatedBackend as the
fixture. Per-test timeouts are enforced via tokio::time::timeout so
a regression trips fast.

Rollout

Single PR rebased on main, replacing the current PR #56. Commit
structure:

Introduce PermittedReadRequest and plumb permit acquisition
through ColdStorageHandle.
Split ColdStorageTask::run into dispatcher + writer tasks.
Add cancellable drain in the writer.
Add read_deadline field, with_read_deadline builder, and
deadline-wrapping logic in the dispatcher. Add
ColdStorageError::Timeout.
Replace PR fix(cold): replace read-arm TaskTracker backpressure with Semaphore #56's test file with the expanded test suite and add
GatedBackend helper.

Reviewer validation path unchanged from PR #56: cut a patch release,
bump init4tech/node-components, rebuild signet-sidecar:latest,
redeploy to dev mainnet, confirm no backpressure-induced crashes.

Open questions

None currently outstanding — scope, deadline value, unification
question, shutdown ordering, and single-PR-vs-split all resolved in
brainstorming.

Fraser999 · 2026-04-17T10:26:50Z

[Claude Code]

Spec is directionally right — the permit-attached-to-message, split dispatcher/writer, cancellable drain, and GatedBackend tests all land cleanly and close concrete gaps in the current PR. Two pieces of the design don't do what they claim to, though, and I think tweaking them changes the shape of the fix enough to mention before landing.

Core issue: tokio-level `timeout` doesn't cancel work, it only drops our handle on it

tokio::time::timeout(d, fut).await races a sleeper against fut and, on fire, drops fut. "Dropped" means Rust runs Drop impls; it does not mean work the future kicked off on another thread, connection, or DB server stops. The dispatcher-side timeout in the spec therefore releases only the one thing it owns — the semaphore permit — and nothing else. Three cases, all currently relevant:

MDBX point lookups as the backend is written today

Every ColdStorageRead impl in crates/cold-mdbx/src/backend.rs is an async fn wrapping a synchronous body (self.env.tx() → cursor ops → return). There are no .await points inside the body. The dispatcher polls once, the sync body runs to completion on that poll, Ready(...) is returned, and only then is control handed back to timeout. The sleeper never wakes; the deadline never fires.

So the 5s deadline as specified is silently inert for MDBX — the backend we care most about for this bug, and the one the deadline is advertised to guard against in the spec text:

The deadline is a guardrail against backend-layer pathology (stuck connection, runaway query, deadlocked transaction) ...

It isn't, for MDBX as written. The precondition — backend yields at await points — is unstated.

MDBX after moving reads to `spawn_blocking`

We need this change independently for runtime liveness (a synchronous MDBX call inside async fn currently blocks a tokio worker). Once it's in place, timeout can fire cleanly at the JoinHandle — but dropping a spawn_blocking JoinHandle does not cancel the blocking task, which tokio documents. The worker thread runs to completion, still holding the MDBX read tx (blocking writer page reclamation). Permit back, real resource not.

sqlx PG

Timeout-drop here works at the client: the future drops, the Transaction guard drops, sqlx evicts/resets the connection. But PG keeps executing the query server-side until it notices the cancel or finishes. On Aurora that can be seconds. Our pool sees connection churn; Aurora sees the query count go up. The permit returns before the real resource does.

Consequence: the permit count lies about backend load

If permits release on timeout-drop while backend work is still running, the semaphore stops reflecting in-flight backend work. Under repeated timeouts the cold task will admit 64 "fresh" reads while 64 "zombie" calls are still alive behind the scenes. Backend gets overcommitted even though our semaphore says 64. The semaphore is supposed to be the one backpressure mechanism; this breaks that invariant.

Proposed tweaks

Everything else in the spec (permit-attached-to-message, split dispatcher/writer, cancellable drain, GatedBackend + the five new tests) stays. Changes:

1. Permits release on backend completion, not on caller-side drop

The permit rides the handler's future, not the caller's. When a caller times out, the caller's oneshot::Receiver drops — the handler's send becomes a no-op, work continues, permit releases when the backend call returns. The semaphore then honestly reflects in-flight backend work at all times.

2. Remove the dispatcher-side `timeout` entirely

Delete DEFAULT_READ_DEADLINE, handle_read_with_deadline, and ColdStorageError::Timeout. The deadline as specified can't actually cancel MDBX work and has the side effect of breaking the permit accounting invariant from point 1. Don't expose a guarantee we can't keep.

3. Push timeouts to layers where they can actually cancel work

PG: issue SET LOCAL statement_timeout = <ms> at the top of each transaction. PG cancels the query server-side on expiry and the connection returns clean. Real cancellation, enforced by the backend.
MDBX: every read moves to spawn_blocking (required anyway for runtime liveness). Pass deadline: Instant into the blocking body; the body checks it between cursor advances / between per-block iterations — exactly the pattern produce_log_stream_blocking already uses. Syscall-level hangs remain unkillable, but those are a kernel / filesystem concern and no level of tokio magic fixes them.

4. Opt-in caller-side deadline on the handle

For callers that want fast-fail semantics (RPC clients with their own budget), expose get_header_with_deadline(spec, d) etc. that wraps send + recv in tokio::time::timeout. Caller gets Err(Timeout); handler keeps running; permit held until real completion. Honest — caller abandons interest, backend paces itself. Most callers probably don't need this; it's there for the ones that do.

5. The deadline's stated purpose — bounding drain duration — is really a node-layer concern

The failure chain in the PR description is: slow reader → long drain → write mpsc fills → dispatch_append_blocks returns Backpressure → ? in signet-node/src/node.rs:285 crashes the node. The deadline tries to bound the first step so the last step never happens. But Backpressure is designed as a transient signal; treating it as fatal is a node-side bug. Fixing that one line — queue / drop / log rather than ? — removes the pressure on the storage layer to pretend it can bound backend latency. Storage can then be honest about its limits and surface them via metrics (permit saturation, drain duration, backend p99) rather than hiding them in the dispatcher.

Net design

Permit = real backend slot. Released only on backend completion.
Timeouts land where cancellation is actually possible: caller side (abandon interest) and backend side (statement_timeout, in-body deadline checks).
ColdStorageError::Timeout gone. ColdStorageError::Backpressure remains and is treated as non-fatal upstream.
Metrics expose backend health honestly. Slow backend = slow cold task = visible in dashboards, not papered over.

Scope grows by: node-side backpressure handling fix + MDBX spawn_blocking migration. First is one line. Second is mechanical but touches every impl ColdStorageRead for MdbxColdBackend method. Both are required for this PR's fix to hold under production load, so I'd rather do them here than leave them as follow-ups — happy to sequence the MDBX migration as its own PR immediately after this one if you'd prefer to keep diff size down.

prestwich · 2026-04-21T19:07:08Z

i'm going to change this to draft. it will be superseded by #57 or an upcoming #58

Adds two accessors to the `ColdStorageBackend` trait: fn read_timeout(&self) -> Option<Duration> { None } fn write_timeout(&self) -> Option<Duration> { None } Wired through `MdbxColdBackend`, `SqlColdBackend`, and `EitherCold`. `MemColdBackend` returns `None` (already-documented test exemption). Two behaviour changes use these: 1. The advisory write-SLO WARN moves from the MDBX backend (`warn_on_overrun` per-method) to `ColdStorage::spawn_write`. Timing is now captured before `write_sem` acquisition, so the elapsed value covers the queue wait, the read drain, and the commit end-to-end. The failure shape that wedged production at #56 — slow readers gating writes — now surfaces as a write-SLO violation rather than as a sub-threshold backend timing. 2. `stream_logs`'s setup `get_latest_block` is wrapped in `tokio::time::timeout(backend.read_timeout(), ...)`. Without this, a stuck point lookup (cold MDBX page) or a saturated PG pool parking on `acquire_timeout` could pin N concurrent setup callers indefinitely with no permit cap. The setup read still bypasses `read_sem` and the drain barrier by design. Also drops the now-unused `tracing` dep from `signet-cold-mdbx` and updates the type docs to point at the handle's new WARN path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(cold): ColdStorageWrite takes &self; all backends updated in lockstep * refactor(cold): unify handle around Arc<Inner>; remove channels and dispatcher * fix(cold): stream permit acquired in handle; streams do not hold a read permit * feat(cold): drain barrier moves to handle write path * feat(cold): shutdown coordinator closes semaphores on cancel * refactor(cold-mdbx): spawn_blocking reads, block_in_place writes, in-body iterator deadline Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cold-sql): mandatory statement_timeout; read_timeout and write_timeout builders * feat(cold): metrics and tracing spans across all operations Adds a `metrics` module under `crates/cold/src/metrics.rs` with const metric names, help strings, a `LazyLock` describe block, and `pub(crate)` helper functions for recording: - `cold.reads_in_flight`, `cold.writes_in_flight`, `cold.streams_active` (gauges) - `cold.op_duration_us` (histogram, labeled by op) - `cold.permit_wait_us` (histogram, labeled by sem: read/write/drain/stream) - `cold.op_errors_total` (counter, labeled by op and error kind) - `cold.stream_lifetime_ms` (histogram) Wires the helpers into every `ColdStorage<B>` handle method: `spawn_read` and `spawn_write` time permit acquisition, bump in-flight, measure op duration, record errors, and dec in-flight after the backend call. Cache hits in `get_header`/`get_transaction`/`get_receipt` record op duration only (no permit wait, no in-flight). `stream_logs` instruments stream permit wait and records stream lifetime + gauge in the spawned producer. Adds `ColdStorageError::kind()` for the error metric label. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(cold): trait impl guide documents mandatory timeouts * test(cold): concurrency suite covers new architecture * fix(cold): shutdown coordinator holds Weak<Inner>, not Arc The coordinator task previously moved Arc<Inner<B>> into its body and awaited the user's cancel token. If callers dropped all ColdStorage clones without firing cancel, Inner (and the backend's file/DB handles) stayed pinned until process exit. Switch the coordinator to Weak<Inner>, and put a DropGuard on Inner that fires a child cancel token. shutdown now fires on either user-side cancel OR Inner drop; in the drop case upgrade() returns None and the coordinator exits without pinning anything. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cold-mdbx): preserve TooManyLogs via From impl, not backend wrapper ColdStorageError::backend unconditionally wraps as Backend(Box<_>), which hid MdbxColdError::TooManyLogs behind the generic backend variant and broke the conformance suite's max_logs assertion. The From<MdbxColdError> for ColdStorageError impl already translates TooManyLogs correctly and wraps the rest. Route all spawn_blocking result conversions through ::from so the translation runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cold): address review on permits, gauges, errors, cache - stream_logs resolves `to` (and get_latest_block fallback) before acquiring stream_sem, so a stuck backend no longer pins all 8 permits across setup I/O. - In-flight gauges are now maintained by an InFlightGuard RAII wrapper so the decrement survives a panic in the spawned body; previously a panic left cold.reads_in_flight / writes_in_flight / streams_active drifting up and poisoning the Prometheus alert signal. - Promote timeout to a first-class ColdStorageError::DeadlineExceeded variant. MDBX Timeout now routes through it (not Backend), and downstream callers can match without downcasting. Fixes stale Backpressure references in the cold and storage READMEs and the signet-storage skill doc. - ColdCache switches from tokio::sync::Mutex to parking_lot::Mutex. The cache only ever holds the lock across synchronous LRU ops, so the async mutex's yield-on-lock was pure overhead. - MemColdBackend now explicitly documents its exemption from the trait's mandatory-timeout contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cold-mdbx): spawn_blocking writes, per-item deadlines, docs - Writes (append_block, append_blocks, truncate_above, drain_above) now use tokio::task::spawn_blocking instead of block_in_place. block_in_place panics on a current_thread runtime, so any consumer wiring MdbxColdBackend into a single-threaded Tokio would hit the first write and crash. Added writes_work_on_current_thread_runtime regression test. - Overrun WARN fires only on successful writes. A failed write that took > 2 s already surfaces Backend(...) to the caller; a spurious advisory-write-timeout WARN on the error path would poison any SLO alert built on that signal. - Iterator reads gained inner-loop deadline checks: per-receipt in get_logs_inner, per-event in collect_signet_events_in_range, per-receipt + per-log in produce_log_stream_blocking. A block with many matching logs (or a slow stream consumer) can no longer run unbounded past the configured deadline. - MdbxColdError::Timeout now maps to ColdStorageError::DeadlineExceeded (new variant) instead of Backend. Updated the existing timeout test to match on the variant directly. - Documented the point-lookup timeout exemption: MDBX page I/O on cold pages can stall arbitrarily, and the handle does not wrap point lookups in a tokio::time::timeout, so a stuck lookup ties up a spawn_blocking worker AND a read_sem permit. Callers that need fail-fast behavior should wrap at the call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(storage): align unified::drain_above doc with silent-swallow impl The PR-#58 doc rewrite advertised a `Cold` error path, but the impl collapses every cold error into `Vec::new()`. Update the doc + comment to admit silent-swallow behaviour. ENG-2210 tracks the propagation decision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cold-sql): map PG statement_timeout to DeadlineExceeded SQLSTATE 57014 (query_canceled, emitted on `statement_timeout` expiry) was wrapped as `Sqlx(...)` and surfaced to the handle as `Backend(...)`, breaking symmetry with the MDBX backend (which routes its `Timeout` to `ColdStorageError::DeadlineExceeded`). The metric `cold.op_errors_total{error="backend"}` therefore conflated "query too slow" with "backend down". `From<sqlx::Error> for SqlColdError` now detects 57014 and produces a dedicated `Timeout` variant; `From<SqlColdError> for ColdStorageError` maps it to `DeadlineExceeded`. The configured deadline is not threaded to this conversion boundary so the surfaced duration is `ZERO`; threading the real value is a separate refactor (left for a follow-up once the call sites are confirmed to need it). The `pg_statement_timeout` test is rewritten to match on the typed variant rather than a substring of the error message — a future refactor that drops 57014 detection now fails the test instead of silently passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cold): hoist write SLO and stream-setup timeout into the handle Adds two accessors to the `ColdStorageBackend` trait: fn read_timeout(&self) -> Option<Duration> { None } fn write_timeout(&self) -> Option<Duration> { None } Wired through `MdbxColdBackend`, `SqlColdBackend`, and `EitherCold`. `MemColdBackend` returns `None` (already-documented test exemption). Two behaviour changes use these: 1. The advisory write-SLO WARN moves from the MDBX backend (`warn_on_overrun` per-method) to `ColdStorage::spawn_write`. Timing is now captured before `write_sem` acquisition, so the elapsed value covers the queue wait, the read drain, and the commit end-to-end. The failure shape that wedged production at #56 — slow readers gating writes — now surfaces as a write-SLO violation rather than as a sub-threshold backend timing. 2. `stream_logs`'s setup `get_latest_block` is wrapped in `tokio::time::timeout(backend.read_timeout(), ...)`. Without this, a stuck point lookup (cold MDBX page) or a saturated PG pool parking on `acquire_timeout` could pin N concurrent setup callers indefinitely with no permit cap. The setup read still bypasses `read_sem` and the drain barrier by design. Also drops the now-unused `tracing` dep from `signet-cold-mdbx` and updates the type docs to point at the handle's new WARN path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cold): reject zero timeouts; log JoinError panics; misc nits Builders for `read_timeout` / `write_timeout` on both the MDBX and SQL backends and connectors now panic on zero. Postgres treats `statement_timeout = 0` as "no timeout", so a caller passing `Duration::ZERO` (or computing one from a config that defaults to zero) would silently disable the trait-level mandatory-timeout contract. MDBX accepts the same assert for symmetry — zero there is a useless config rather than a silent disable, but the trait says non-zero and the assert keeps the surface honest. `spawn_read` / `spawn_write` now log spawned-task `JoinError`s before mapping to `TaskTerminated`. A backend panic was previously indistinguishable from graceful shutdown for the on-call: panics fire ERROR with the panic message, cancellations fire DEBUG. The error variant still collapses to `TaskTerminated` per design. `MdbxColdBackend::get_logs_inner` now checks the deadline inside the inner per-log loop, mirroring the streaming path. Previously a single receipt with thousands of matching logs would iterate unchecked past the configured `read_timeout`. The two `std::time::Instant::now()` sites in `produce_log_stream_blocking` are also folded into the already-imported `Instant`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(cold): stream_logs setup fails fast on hung get_latest_block Pins the new setup-timeout behaviour against regression. The test parks `GatedBackend::get_latest_block` indefinitely and asserts that `stream_logs` (with no `to_block` on the filter, forcing the "resolve to=latest" path) returns `DeadlineExceeded` within the configured 50 ms `read_timeout` rather than hanging. Adds `GatedBackend::with_read_timeout` so tests can advertise a custom read timeout to the handle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rswanson requested review from Evalir, Fraser999 and prestwich April 16, 2026 20:02

rswanson closed this Apr 16, 2026

rswanson reopened this Apr 16, 2026

prestwich mentioned this pull request Apr 17, 2026

refactor(cold): permit-attached reads, dispatcher/writer split, 5s operation deadline #57

Closed

7 tasks

prestwich marked this pull request as draft April 21, 2026 19:07

Fraser999 mentioned this pull request Apr 28, 2026

cold: unified handle architecture (supersedes #57) #58

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cold): replace read-arm TaskTracker backpressure with Semaphore#56

fix(cold): replace read-arm TaskTracker backpressure with Semaphore#56
rswanson wants to merge 1 commit intomainfrom
swanny/fix-cold-reader-deadlock

rswanson commented Apr 16, 2026

Uh oh!

rswanson commented Apr 16, 2026

Uh oh!

prestwich commented Apr 16, 2026

Uh oh!

rswanson commented Apr 16, 2026

Uh oh!

prestwich commented Apr 17, 2026

Uh oh!

Fraser999 commented Apr 17, 2026

Uh oh!

prestwich commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rswanson commented Apr 16, 2026

Summary

The bug

The fix

Tests

Test plan

Uh oh!

rswanson commented Apr 16, 2026

Code review

Uh oh!

prestwich commented Apr 16, 2026

Uh oh!

rswanson commented Apr 16, 2026

Uh oh!

prestwich commented Apr 17, 2026

Cold storage read/write permit refactor

Context

Goals

Non-goals

Design

Architecture

Permit-attached-to-message

Channel sizing: READ_CHANNEL_SIZE = MAX_CONCURRENT_READERS

Writer task with cancellable drain

Dispatcher task

Deadline value

Drain semantics preserved

Shutdown

API changes

Testing

GatedBackend<B>

Regression tests

Rollout

Open questions

Uh oh!

Fraser999 commented Apr 17, 2026

Core issue: tokio-level timeout doesn't cancel work, it only drops our handle on it

MDBX point lookups as the backend is written today

MDBX after moving reads to spawn_blocking

sqlx PG

Consequence: the permit count lies about backend load

Proposed tweaks

1. Permits release on backend completion, not on caller-side drop

2. Remove the dispatcher-side timeout entirely

3. Push timeouts to layers where they can actually cancel work

4. Opt-in caller-side deadline on the handle

5. The deadline's stated purpose — bounding drain duration — is really a node-layer concern

Net design

Uh oh!

prestwich commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Channel sizing: `READ_CHANNEL_SIZE = MAX_CONCURRENT_READERS`

`GatedBackend<B>`

Core issue: tokio-level `timeout` doesn't cancel work, it only drops our handle on it

MDBX after moving reads to `spawn_blocking`

2. Remove the dispatcher-side `timeout` entirely