refactor(submitter): concurrent submitter#3287
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds cache APIs to reset/query total pending ranges; reworks pending selection to track contiguous in-flight claims and gaps; converts DA submission to asynchronous, callback-driven submissions with centralized retry/backoff and a Close() lifecycle method. Changes
Sequence DiagramssequenceDiagram
participant SubmitLoop as Submit loop
participant Cache as Cache (pending manager)
participant DASubmitter as DA Submitter
participant DA as DA Layer
SubmitLoop->>Cache: GetPendingHeaders()
Cache->>Cache: choose contiguous unclaimed range\nregister in-flight claim
Cache-->>SubmitLoop: return batch + (start,end)
SubmitLoop->>DASubmitter: SubmitHeaders(batch, onSuccess, onError)
DASubmitter-->>SubmitLoop: return (async)
DASubmitter->>DASubmitter: spawn goroutine -> submitWithRetry
DASubmitter->>DA: submit batch
alt DA success
DA-->>DASubmitter: success
DASubmitter->>Cache: apply post-submit updates (included markers, hints, set last-submitted)
DASubmitter->>SubmitLoop: call onSuccess()
else retryable failure
DA-->>DASubmitter: error
DASubmitter->>DASubmitter: backoff & retry
else terminal error
DA-->>DASubmitter: error
DASubmitter->>SubmitLoop: call onError(error)
SubmitLoop->>Cache: ResetInFlightHeaderRange(start,end)
end
sequenceDiagram
participant SubmitLoop as Submit loop
participant Cache as pendingBase
participant DASubmitter as DA Submitter
SubmitLoop->>Cache: GetPendingData()
Cache->>Cache: select first unclaimed contiguous range\nregister claim, remove overlapping gaps
Cache-->>SubmitLoop: pending items + (start,end)
SubmitLoop->>DASubmitter: SubmitData(batch, onSuccess, onError)
DASubmitter-->>SubmitLoop: return (async)
DASubmitter->>DASubmitter: submitWithRetry -> onSuccess/onError callbacks
alt onError called
SubmitLoop->>Cache: ResetInFlightDataRange(start,end)
Cache->>Cache: remove claim, reinsert failing portion as gap
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).
|
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3287 +/- ##
==========================================
+ Coverage 60.52% 60.54% +0.02%
==========================================
Files 126 126
Lines 13721 13943 +222
==========================================
+ Hits 8304 8442 +138
- Misses 4513 4580 +67
- Partials 904 921 +17
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
block/internal/submitting/da_submitter_tracing.go (1)
33-61:⚠️ Potential issue | 🟠 MajorThe span now ends before the async submission does.
inner.SubmitHeaders/SubmitDatanow return right after scheduling background work, sodefer span.End()closes the span before retries/callbacks run. Async failures routed throughonSubmitErrornever reach the span, and the recorded latency is only the enqueue time. Wrap the callbacks and end/annotate the span from the actual completion path instead.Also applies to: 64-92
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@block/internal/submitting/da_submitter_tracing.go` around lines 33 - 61, The current tracedDASubmitter starts a span in SubmitHeaders (and similarly in SubmitData) but defers span.End(), which closes the span before async work completes; instead remove the immediate defer span.End() and wrap the onSubmitSuccess and onSubmitError callbacks with closures that record errors/status on the span and call span.End() when the async completion path runs; i.e., in tracedDASubmitter.SubmitHeaders and SubmitData, create wrappedOnSuccess := func(){ span.SetStatus(codes.Ok, ""); onSubmitSuccess(); span.End() } and wrappedOnError := func(err error){ span.RecordError(err); span.SetStatus(codes.Error, err.Error()); onSubmitError(err); span.End() } (or equivalent wrappers matching original signatures) and pass those to t.inner.SubmitHeaders/SubmitData so the span lifetime and annotations reflect actual completion.block/internal/cache/pending_base.go (1)
80-96:⚠️ Potential issue | 🟠 MajorTake
lastHeightand the in-flight ranges under one synchronization boundary.
getPending()readslastHeightbefore cloninginFlightClaims/gaps, whilesetLastSubmittedHeight()updateslastHeightand trims those slices independently. With the new concurrent submitter, thelastHeight=old+claims already trimmedinterleaving can makefindAvailableRange()hand out heights that were just acknowledged, causing duplicate DA submissions.As per coding guidelines "Be careful with concurrent access to shared state".
Also applies to: 175-189
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@block/internal/cache/pending_base.go` around lines 80 - 96, getPending() currently reads pb.lastHeight (pb.lastHeight.Load()) outside the pb.inFlightMu critical section and then clones pb.inFlightClaims and pb.gaps, which allows an interleaving with setLastSubmittedHeight() that trims those slices and updates lastHeight causing findAvailableRange() to return already-acknowledged heights; fix by moving the read of pb.lastHeight inside the same pb.inFlightMu.Lock()/Unlock() block where you clone inFlightClaims and gaps so lastHeight and the in-flight ranges are read atomically, and apply the same locking discipline to setLastSubmittedHeight() (acquire pb.inFlightMu while trimming inFlightClaims/gaps and updating pb.lastHeight) to prevent races when findAvailableRange, getPending, and setLastSubmittedHeight interact.
🧹 Nitpick comments (3)
block/internal/cache/pending_data.go (1)
88-90: Document the new exported reset helper.
ResetInFlightDataRangeis public, but it has no doc comment. That violates the repo's Go guideline for exported identifiers and will likely fail linting.💡 Suggested fix
+// ResetInFlightDataRange clears the in-flight data claim state for heights in [start, end]. func (pd *PendingData) ResetInFlightDataRange(start, end uint64) {As per coding guidelines, Document exported types and functions in Go code.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@block/internal/cache/pending_data.go` around lines 88 - 90, Add a Go doc comment for the exported method ResetInFlightDataRange on type PendingData: describe what the method does, its parameters (start, end uint64) and any important behavior or side-effects (it delegates to pd.base.resetInFlightRange to reset the in-flight data range). Place the comment immediately above the func declaration for ResetInFlightDataRange so it satisfies Go exported identifier documentation guidelines.block/internal/cache/pending_headers.go (1)
83-85: Document the new exported reset helper.
ResetInFlightHeaderRangeis public, but it has no doc comment. That violates the repo's Go guideline for exported identifiers and will likely fail linting.💡 Suggested fix
+// ResetInFlightHeaderRange clears the in-flight header claim state for heights in [start, end]. func (ph *PendingHeaders) ResetInFlightHeaderRange(start, end uint64) {As per coding guidelines, Document exported types and functions in Go code.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@block/internal/cache/pending_headers.go` around lines 83 - 85, Add a Go doc comment for the exported method PendingHeaders.ResetInFlightHeaderRange describing its purpose and behavior: explain that it resets the in-flight header range tracked by the PendingHeaders instance between the inclusive start and end uint64 indices, mention any side effects (it delegates to ph.base.resetInFlightRange), and document the parameters (start, end) and any expectations (e.g., inclusive bounds or preconditions). Place the comment immediately above the ResetInFlightHeaderRange method.block/internal/submitting/submitter_test.go (1)
424-440: Exercise the new callbacks in the fake submitter.This test double now accepts
onSubmitSuccess/onSubmitError, but it still discards them. That means the loop test only verifies enqueueing, not the success/error lifecycle that now updates timestamps and resets in-flight cache state.Consider invoking the callbacks when non-nil or adding a focused test that covers that contract.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@block/internal/submitting/submitter_test.go` around lines 424 - 440, The fakeDASubmitter currently drops the provided callbacks in SubmitHeaders and SubmitData; update these methods (SubmitHeaders and SubmitData on fakeDASubmitter) to call the supplied on-success and on-error callbacks when they are non-nil so the test exercises the full success/error lifecycle (e.g., invoke the success callback when you want the fake to simulate success, or invoke the error callback with a test error to simulate failure), while preserving the existing signaling to chHdr/chData; alternatively add a focused test that uses a fake submitter which invokes those callbacks to assert timestamps and in-flight cache resets.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@block/internal/submitting/da_submitter_integration_test.go`:
- Around line 101-110: The test currently calls daSubmitter.Close() only at the
end, risking resource leaks if earlier assertions fail; after creating the async
submitter (the daSubmitter variable), ensure cleanup is guaranteed by
registering a deferred close or t.Cleanup call—e.g., immediately after
daSubmitter is constructed call defer daSubmitter.Close() or t.Cleanup(func(){
daSubmitter.Close() }) so the Close() method on daSubmitter always runs even if
the test fails early.
In `@block/internal/submitting/da_submitter_test.go`:
- Around line 216-218: The test currently calls submitter.Close() after
assertions which can leak the submitter's async worker if an assertion fails;
change the teardown to run immediately after setup by invoking defer
submitter.Close() (or t.Cleanup(func(){ submitter.Close() })) right after the
submitter is created so Close() always runs even on test failures — update the
tests that call submitter.SubmitHeaders(...) and later submitter.Close() (e.g.,
the cases around SubmitHeaders and the other similar test) to use deferred
cleanup instead.
In `@block/internal/submitting/da_submitter.go`:
- Around line 388-398: The datalayer success branch uses res.SubmittedCount
directly which can be 0 or >len(marshaled) and cause infinite loops or panics;
in the datypes.StatusSuccess case (around res.SubmittedCount handling) validate
that submitted := int(res.SubmittedCount) is >0 and <= len(marshaled) before
calling onSuccess or advancing the window (marshaled = marshaled[submitted:]);
if submitted==0 treat as a reject/error (update rs.Attempt or return/log and do
not spin) and if submitted>len(marshaled) treat as malformed input (log/error
and reject) so only a validated count is passed to onSuccess and used to slice
marshaled.
In `@block/internal/submitting/submitter.go`:
- Around line 236-250: The code enqueues a batch as in-flight via
GetPendingHeaders/GetPendingData but if s.daSubmitter.SubmitHeaders or
SubmitData returns an immediate error the in-flight claim is never released;
update the error path in submitter.go around s.daSubmitter.SubmitHeaders and the
analogous SubmitData call so that before logging or returning on synchronous
error you call s.cache.ResetInFlightHeaderRange(headers[0].Height(),
headers[len(headers)-1].Height()) (and for data use the corresponding
ResetInFlightDataRange with the first/last data heights), then proceed to
log/handle the error (including the existing ErrOversizedItem handling) so the
claimed heights are retried.
---
Outside diff comments:
In `@block/internal/cache/pending_base.go`:
- Around line 80-96: getPending() currently reads pb.lastHeight
(pb.lastHeight.Load()) outside the pb.inFlightMu critical section and then
clones pb.inFlightClaims and pb.gaps, which allows an interleaving with
setLastSubmittedHeight() that trims those slices and updates lastHeight causing
findAvailableRange() to return already-acknowledged heights; fix by moving the
read of pb.lastHeight inside the same pb.inFlightMu.Lock()/Unlock() block where
you clone inFlightClaims and gaps so lastHeight and the in-flight ranges are
read atomically, and apply the same locking discipline to
setLastSubmittedHeight() (acquire pb.inFlightMu while trimming
inFlightClaims/gaps and updating pb.lastHeight) to prevent races when
findAvailableRange, getPending, and setLastSubmittedHeight interact.
In `@block/internal/submitting/da_submitter_tracing.go`:
- Around line 33-61: The current tracedDASubmitter starts a span in
SubmitHeaders (and similarly in SubmitData) but defers span.End(), which closes
the span before async work completes; instead remove the immediate defer
span.End() and wrap the onSubmitSuccess and onSubmitError callbacks with
closures that record errors/status on the span and call span.End() when the
async completion path runs; i.e., in tracedDASubmitter.SubmitHeaders and
SubmitData, create wrappedOnSuccess := func(){ span.SetStatus(codes.Ok, "");
onSubmitSuccess(); span.End() } and wrappedOnError := func(err error){
span.RecordError(err); span.SetStatus(codes.Error, err.Error());
onSubmitError(err); span.End() } (or equivalent wrappers matching original
signatures) and pass those to t.inner.SubmitHeaders/SubmitData so the span
lifetime and annotations reflect actual completion.
---
Nitpick comments:
In `@block/internal/cache/pending_data.go`:
- Around line 88-90: Add a Go doc comment for the exported method
ResetInFlightDataRange on type PendingData: describe what the method does, its
parameters (start, end uint64) and any important behavior or side-effects (it
delegates to pd.base.resetInFlightRange to reset the in-flight data range).
Place the comment immediately above the func declaration for
ResetInFlightDataRange so it satisfies Go exported identifier documentation
guidelines.
In `@block/internal/cache/pending_headers.go`:
- Around line 83-85: Add a Go doc comment for the exported method
PendingHeaders.ResetInFlightHeaderRange describing its purpose and behavior:
explain that it resets the in-flight header range tracked by the PendingHeaders
instance between the inclusive start and end uint64 indices, mention any side
effects (it delegates to ph.base.resetInFlightRange), and document the
parameters (start, end) and any expectations (e.g., inclusive bounds or
preconditions). Place the comment immediately above the ResetInFlightHeaderRange
method.
In `@block/internal/submitting/submitter_test.go`:
- Around line 424-440: The fakeDASubmitter currently drops the provided
callbacks in SubmitHeaders and SubmitData; update these methods (SubmitHeaders
and SubmitData on fakeDASubmitter) to call the supplied on-success and on-error
callbacks when they are non-nil so the test exercises the full success/error
lifecycle (e.g., invoke the success callback when you want the fake to simulate
success, or invoke the error callback with a test error to simulate failure),
while preserving the existing signaling to chHdr/chData; alternatively add a
focused test that uses a fake submitter which invokes those callbacks to assert
timestamps and in-flight cache resets.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 996c9001-1856-439f-b61e-aa2ea4669b4e
📒 Files selected for processing (16)
block/internal/cache/manager.goblock/internal/cache/manager_test.goblock/internal/cache/pending_base.goblock/internal/cache/pending_base_test.goblock/internal/cache/pending_data.goblock/internal/cache/pending_data_test.goblock/internal/cache/pending_headers.goblock/internal/cache/pending_headers_test.goblock/internal/submitting/da_submitter.goblock/internal/submitting/da_submitter_integration_test.goblock/internal/submitting/da_submitter_mocks_test.goblock/internal/submitting/da_submitter_test.goblock/internal/submitting/da_submitter_tracing.goblock/internal/submitting/da_submitter_tracing_test.goblock/internal/submitting/submitter.goblock/internal/submitting/submitter_test.go
There was a problem hiding this comment.
♻️ Duplicate comments (1)
block/internal/submitting/da_submitter.go (1)
384-395:⚠️ Potential issue | 🔴 CriticalValidate
SubmittedCountbefore using it.The
res.SubmittedCountfrom the DA client is used without validation. Ifsubmitted == 0, the loop continues with unchangedmarshaledand reset backoff, potentially causing an infinite loop. Ifsubmitted > len(marshaled), the slice operation at line 394 will panic.🛡️ Proposed fix to validate SubmittedCount
case datypes.StatusSuccess: submitted := int(res.SubmittedCount) + if submitted <= 0 || submitted > len(marshaled) { + s.recordFailure(common.DASubmitterFailureReasonUnknown) + err := fmt.Errorf("invalid submitted count %d for batch size %d", submitted, len(marshaled)) + s.logger.Error().Err(err).Str("itemType", itemType).Msg("DA layer returned invalid submitted count") + if onError != nil { + onError(err) + } + return + } if onSuccess != nil { onSuccess(submitted, res.Height) }As per coding guidelines: "Validate all inputs from external sources in Go code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@block/internal/submitting/da_submitter.go` around lines 384 - 395, The code uses res.SubmittedCount directly which can be 0 or >len(marshaled); validate it before slicing and advancing. In the datypes.StatusSuccess branch (symbols: res.SubmittedCount, marshaled, onSuccess, rs.Next, reasonSuccess, pol) ensure submitted := int(res.SubmittedCount) is checked: if submitted <= 0 treat as no progress (do not reset backoff — call rs.Fail or return to avoid infinite loop), if submitted > len(marshaled) cap it to len(marshaled) before calling onSuccess and slicing; only advance marshaled and call rs.Next when a positive, bounded submitted value was applied.
🧹 Nitpick comments (2)
block/internal/submitting/submitter.go (1)
249-260: Minor: Avoid logging whenerris nil.When
onErroris called withnil(on context cancellation or empty batch fromsubmitWithRetry), this logs an error with no actual error. Consider guarding the log statement.♻️ Proposed improvement
onError := func(err error) { if len(headers) > 0 { s.cache.ResetInFlightHeaderRange(headers[0].Height(), headers[len(headers)-1].Height()) } if errors.Is(err, common.ErrOversizedItem) { s.logger.Error().Err(err). Msg("CRITICAL: Header exceeds DA blob size limit - halting to prevent live lock") s.sendCriticalError(fmt.Errorf("unrecoverable DA submission error: %w", err)) return } - s.logger.Error().Err(err).Msg("failed to submit headers") + if err != nil { + s.logger.Error().Err(err).Msg("failed to submit headers") + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@block/internal/submitting/submitter.go` around lines 249 - 260, The onError closure should not log when err is nil: after calling s.cache.ResetInFlightHeaderRange(headers[0].Height(), headers[len(headers)-1].Height()) (if headers present), add an early guard if err == nil { return } so you skip the error handling below; keep the oversized-item check (errors.Is(err, common.ErrOversizedItem)), s.logger.Error().Err(err).Msg("failed to submit headers"), and s.sendCriticalError(...) as-is for non-nil errors.block/internal/submitting/da_submitter.go (1)
326-331: Consider documenting theonError(nil)contract.Calling
onError(nil)for empty batches and context cancellations signals completion without error, but the callback name suggests an error occurred. This works correctly with the caller'serrors.Is(err, common.ErrOversizedItem)check, but the semantics could be clearer.Consider renaming to
onComplete(error)or documenting thatnilindicates graceful termination without retry failure.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@block/internal/submitting/da_submitter.go` around lines 326 - 331, The callback parameter onError in da_submitter.go is used to signal both actual errors and graceful completion by calling onError(nil) for empty batches/cancellations; update the code to make this contract explicit by either renaming the parameter from onError to onComplete (and update all call sites and related tests) or by adding a clear doc comment on the onError parameter/signature (and any public type that exposes it) stating that a nil argument indicates graceful termination/no-retry, while a non-nil error indicates a retryable/failure condition; ensure you update comments and any place that inspects the value (e.g., callers that use errors.Is(err, common.ErrOversizedItem)) to reflect the new name or documented behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@block/internal/submitting/da_submitter.go`:
- Around line 384-395: The code uses res.SubmittedCount directly which can be 0
or >len(marshaled); validate it before slicing and advancing. In the
datypes.StatusSuccess branch (symbols: res.SubmittedCount, marshaled, onSuccess,
rs.Next, reasonSuccess, pol) ensure submitted := int(res.SubmittedCount) is
checked: if submitted <= 0 treat as no progress (do not reset backoff — call
rs.Fail or return to avoid infinite loop), if submitted > len(marshaled) cap it
to len(marshaled) before calling onSuccess and slicing; only advance marshaled
and call rs.Next when a positive, bounded submitted value was applied.
---
Nitpick comments:
In `@block/internal/submitting/da_submitter.go`:
- Around line 326-331: The callback parameter onError in da_submitter.go is used
to signal both actual errors and graceful completion by calling onError(nil) for
empty batches/cancellations; update the code to make this contract explicit by
either renaming the parameter from onError to onComplete (and update all call
sites and related tests) or by adding a clear doc comment on the onError
parameter/signature (and any public type that exposes it) stating that a nil
argument indicates graceful termination/no-retry, while a non-nil error
indicates a retryable/failure condition; ensure you update comments and any
place that inspects the value (e.g., callers that use errors.Is(err,
common.ErrOversizedItem)) to reflect the new name or documented behavior.
In `@block/internal/submitting/submitter.go`:
- Around line 249-260: The onError closure should not log when err is nil: after
calling s.cache.ResetInFlightHeaderRange(headers[0].Height(),
headers[len(headers)-1].Height()) (if headers present), add an early guard if
err == nil { return } so you skip the error handling below; keep the
oversized-item check (errors.Is(err, common.ErrOversizedItem)),
s.logger.Error().Err(err).Msg("failed to submit headers"), and
s.sendCriticalError(...) as-is for non-nil errors.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bbad2e4d-1de5-4835-bdcd-e5d352bbf063
📒 Files selected for processing (3)
block/internal/cache/pending_base.goblock/internal/submitting/da_submitter.goblock/internal/submitting/submitter.go
✅ Files skipped from review due to trivial changes (1)
- block/internal/cache/pending_base.go
730acad to
9669b26
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
block/internal/submitting/submitter.go (1)
153-168:⚠️ Potential issue | 🟠 MajorWait for submitter workers to stop producing before closing the async DA submitter.
Stop()callss.daSubmitter.Close()before waiting ons.wg, but the goroutines tracked bys.wgcan still reachSubmitHeaders/SubmitDataafterClose()has started. That lets new DA goroutines escape the wait and survive shutdown. Reorder this so cancellation stops the producers first, then wait fors.wg, thenClose()the DA submitter.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@block/internal/submitting/submitter.go` around lines 153 - 168, In Submitter.Stop(), currently s.daSubmitter.Close() is called before waiting for s.wg which allows goroutines to call SubmitHeaders/SubmitData after Close() starts; change the order so you first call s.cancel() (if non-nil) to stop producers, then wait for s.wg (with the existing timeout/done pattern), and only after the wait completes or times out call s.daSubmitter.Close(); ensure Submitter.Stop retains the timeout warning path and uses the same s.wg, s.cancel, and s.daSubmitter.Close() symbols.block/internal/submitting/da_submitter_tracing.go (1)
33-61:⚠️ Potential issue | 🟠 MajorThese spans no longer cover the actual DA submission.
The wrapped calls now return after enqueueing work, so
defer span.End()records only enqueue latency. It also never sees async failures, and the currentonSubmitSuccesscallback is not a terminal signal because it can fire on partial successes. If you want end-to-end submission traces, this wrapper needs a real completion callback (or separate enqueue-vs-submit spans) instead of ending the span here.Also applies to: 64-97
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@block/internal/submitting/da_submitter_tracing.go` around lines 33 - 61, The current tracedDASubmitter.SubmitHeaders creates a span and defers span.End(), but the inner SubmitHeaders only enqueues work so the span ends too early and misses async failures; change the wrapper to either (a) create two spans (an immediate "enqueue" span that ends on return and a separate "submission" span that is ended by a real completion callback), or (b) wrap the provided onSubmitSuccess and onSubmitError with a new finalizing callback that records errors, sets span status, and ends the span when the overall submission is truly finished; apply the same pattern to the other traced methods mentioned (lines 64-97) so spans cover end-to-end submission rather than just enqueue latency, and reference tracedDASubmitter.SubmitHeaders, the inner SubmitHeaders call, and the onSubmitSuccess/onSubmitError callbacks when implementing the change.
♻️ Duplicate comments (2)
block/internal/submitting/da_submitter.go (1)
383-395:⚠️ Potential issue | 🔴 CriticalValidate
SubmittedCountbefore using it.
SubmittedCountcomes from the DA client, but this branch uses it to drive both callbacks and slicing without checking bounds.0falsely reports success without advancing the window, and a value greater thanlen(marshaled)will panic here and in the post-submit callbacks.Suggested fix
case datypes.StatusSuccess: submitted := int(res.SubmittedCount) + if submitted <= 0 || submitted > len(marshaled) { + err := fmt.Errorf("invalid submitted count %d for batch size %d", submitted, len(marshaled)) + s.recordFailure(common.DASubmitterFailureReasonUnknown) + s.logger.Error().Err(err).Str("itemType", itemType).Msg("DA layer returned invalid submitted count") + if onError != nil { + onError(err) + } + return + } if onSuccess != nil { onSuccess(submitted, res.Height) }As per coding guidelines "Validate all inputs from external sources in Go code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@block/internal/submitting/da_submitter.go` around lines 383 - 395, Validate res.SubmittedCount from the DA client before using it: ensure it's within [0, len(marshaled)] and handle the zero-case by advancing the window (call rs.Next(reasonSuccess, pol)) instead of treating it as full success; compute a safe int value (e.g., capSubmitted := max(0, min(int(res.SubmittedCount), len(marshaled)))) then call onSuccess(capSubmitted, res.Height) and log using capSubmitted, slice marshaled = marshaled[capSubmitted:] only when capSubmitted > 0, and log or error when res.SubmittedCount is out-of-range to avoid panics in this block that contains res.SubmittedCount, onSuccess, marshaled, rs.Next, reasonSuccess, pol, and s.logger.Info().block/internal/submitting/da_submitter_test.go (1)
216-218:⚠️ Potential issue | 🟡 MinorRegister
Close()witht.Cleanupright after setup.If one of the assertions fails before these lines, the async worker is never joined and can bleed into later tests. Move teardown next to
setupDASubmitterTest.Suggested fix
submitter, st, cm, mockDA, gen := setupDASubmitterTest(t) +t.Cleanup(submitter.Close) ... - submitter.Close()Also applies to: 331-333
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@block/internal/submitting/da_submitter_test.go` around lines 216 - 218, Register the submitter.Close cleanup immediately after setupDASubmitterTest returns the submitter so the async worker is always joined even if an assertion fails; i.e., right after obtaining the submitter variable in the test, call t.Cleanup(func(){ submitter.Close() }) (and remove or leave any later explicit submitter.Close() calls as desired) to ensure teardown; apply the same change to the other test case that creates a submitter (the block that currently calls submitter.Close() later).
🧹 Nitpick comments (1)
block/internal/cache/pending_data.go (1)
84-95: Add doc comments for the new exportedPendingDatamethods.
NumPendingDataTotalandResetInFlightDataRangeare exported additions, so they should be documented like the other public cache APIs.As per coding guidelines "Document exported types and functions in Go code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@block/internal/cache/pending_data.go` around lines 84 - 95, Add Go doc comments for the exported PendingData methods: NumPendingDataTotal, SetLastSubmittedDataHeight, and ResetInFlightDataRange. For NumPendingDataTotal and ResetInFlightDataRange add brief one-line comments describing what they return/affect (e.g., number of pending data entries and that ResetInFlightDataRange clears the in-flight range between start and end), and for SetLastSubmittedDataHeight document the parameter and its effect on PendingData state; follow the style and phrasing used by other public cache APIs in the package.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@block/internal/submitting/da_submitter.go`:
- Around line 227-235: The success callback passed into submitWithRetry (in the
s.wg.Go closures that call postSubmit and onSubmitSuccess) incorrectly always
slices from the start (headers[:submittedCount] /
signedDataList[:submittedCount]) and thus re-processes already-handled items on
partial successes; fix this by introducing and capturing a local cumulative
offset (e.g., submittedOffset) alongside the closure, have the callback use
headers[submittedOffset:submittedOffset+submittedCount] (and signedDataList
likewise), call postSubmit with that slice, then increment submittedOffset +=
submittedCount inside the callback so subsequent partial successes advance
correctly; apply the same change to the other closure that handles
signedDataList (the block around submitWithRetry at the second occurrence).
---
Outside diff comments:
In `@block/internal/submitting/da_submitter_tracing.go`:
- Around line 33-61: The current tracedDASubmitter.SubmitHeaders creates a span
and defers span.End(), but the inner SubmitHeaders only enqueues work so the
span ends too early and misses async failures; change the wrapper to either (a)
create two spans (an immediate "enqueue" span that ends on return and a separate
"submission" span that is ended by a real completion callback), or (b) wrap the
provided onSubmitSuccess and onSubmitError with a new finalizing callback that
records errors, sets span status, and ends the span when the overall submission
is truly finished; apply the same pattern to the other traced methods mentioned
(lines 64-97) so spans cover end-to-end submission rather than just enqueue
latency, and reference tracedDASubmitter.SubmitHeaders, the inner SubmitHeaders
call, and the onSubmitSuccess/onSubmitError callbacks when implementing the
change.
In `@block/internal/submitting/submitter.go`:
- Around line 153-168: In Submitter.Stop(), currently s.daSubmitter.Close() is
called before waiting for s.wg which allows goroutines to call
SubmitHeaders/SubmitData after Close() starts; change the order so you first
call s.cancel() (if non-nil) to stop producers, then wait for s.wg (with the
existing timeout/done pattern), and only after the wait completes or times out
call s.daSubmitter.Close(); ensure Submitter.Stop retains the timeout warning
path and uses the same s.wg, s.cancel, and s.daSubmitter.Close() symbols.
---
Duplicate comments:
In `@block/internal/submitting/da_submitter_test.go`:
- Around line 216-218: Register the submitter.Close cleanup immediately after
setupDASubmitterTest returns the submitter so the async worker is always joined
even if an assertion fails; i.e., right after obtaining the submitter variable
in the test, call t.Cleanup(func(){ submitter.Close() }) (and remove or leave
any later explicit submitter.Close() calls as desired) to ensure teardown; apply
the same change to the other test case that creates a submitter (the block that
currently calls submitter.Close() later).
In `@block/internal/submitting/da_submitter.go`:
- Around line 383-395: Validate res.SubmittedCount from the DA client before
using it: ensure it's within [0, len(marshaled)] and handle the zero-case by
advancing the window (call rs.Next(reasonSuccess, pol)) instead of treating it
as full success; compute a safe int value (e.g., capSubmitted := max(0,
min(int(res.SubmittedCount), len(marshaled)))) then call onSuccess(capSubmitted,
res.Height) and log using capSubmitted, slice marshaled =
marshaled[capSubmitted:] only when capSubmitted > 0, and log or error when
res.SubmittedCount is out-of-range to avoid panics in this block that contains
res.SubmittedCount, onSuccess, marshaled, rs.Next, reasonSuccess, pol, and
s.logger.Info().
---
Nitpick comments:
In `@block/internal/cache/pending_data.go`:
- Around line 84-95: Add Go doc comments for the exported PendingData methods:
NumPendingDataTotal, SetLastSubmittedDataHeight, and ResetInFlightDataRange. For
NumPendingDataTotal and ResetInFlightDataRange add brief one-line comments
describing what they return/affect (e.g., number of pending data entries and
that ResetInFlightDataRange clears the in-flight range between start and end),
and for SetLastSubmittedDataHeight document the parameter and its effect on
PendingData state; follow the style and phrasing used by other public cache APIs
in the package.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 75185388-8242-4eb1-8f32-3f7b1a2c29f7
⛔ Files ignored due to path filters (1)
execution/grpc/go.sumis excluded by!**/*.sum
📒 Files selected for processing (17)
block/internal/cache/manager.goblock/internal/cache/manager_test.goblock/internal/cache/pending_base.goblock/internal/cache/pending_base_test.goblock/internal/cache/pending_data.goblock/internal/cache/pending_data_test.goblock/internal/cache/pending_headers.goblock/internal/cache/pending_headers_test.goblock/internal/executing/executor.goblock/internal/submitting/da_submitter.goblock/internal/submitting/da_submitter_integration_test.goblock/internal/submitting/da_submitter_mocks_test.goblock/internal/submitting/da_submitter_test.goblock/internal/submitting/da_submitter_tracing.goblock/internal/submitting/da_submitter_tracing_test.goblock/internal/submitting/submitter.goblock/internal/submitting/submitter_test.go
✅ Files skipped from review due to trivial changes (1)
- block/internal/cache/pending_base_test.go
🚧 Files skipped from review as they are similar to previous changes (8)
- block/internal/cache/manager_test.go
- block/internal/cache/pending_headers.go
- block/internal/submitting/da_submitter_integration_test.go
- block/internal/cache/pending_headers_test.go
- block/internal/executing/executor.go
- block/internal/submitting/da_submitter_mocks_test.go
- block/internal/cache/pending_base.go
- block/internal/submitting/da_submitter_tracing_test.go
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@block/internal/submitting/submitter.go`:
- Line 247: The time-based throttle currently updates s.lastHeaderSubmit and
s.lastDataSubmit only in the async onSuccess callbacks (used by
SubmitHeaders/SubmitData), causing ShouldSubmit to measure since DA
acknowledgement rather than enqueue/dispatch; change the code to store the
current time (e.g., call s.lastHeaderSubmit.Store(time.Now().UnixNano()) and
s.lastDataSubmit.Store(...)) at the moment you enqueue/dispatch the header/data
(right before calling SubmitHeaders/SubmitData) instead of only inside the async
onSuccess, and keep or remove the onSuccess update as needed so that submission
rate is throttled by dispatch time rather than DA response time.
- Around line 248-256: The onError callback currently calls
s.cache.ResetInFlightHeaderRange(...) before checking for
common.ErrOversizedItem, which re-queues an unrecoverable batch and allows
daSubmissionLoop to retry it; change the order so the oversized-item branch is
handled first: if errors.Is(err, common.ErrOversizedItem) then call
s.sendCriticalError(...) and return without resetting the in-flight range. Apply
the same ordering to the data path (the ResetInFlightDataRange call in the
analogous onError). Additionally, ensure sendCriticalError integrates with the
component's context cancellation (use context.Context cancellation instead of
relying on sendCriticalError alone) so daSubmissionLoop stops retrying after the
critical error.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0c5f260e-f32f-47b5-8bd1-96b5ddf139a5
📒 Files selected for processing (2)
block/internal/submitting/da_submitter.goblock/internal/submitting/submitter.go
✅ Files skipped from review due to trivial changes (1)
- block/internal/submitting/da_submitter.go
* fix(tools/talis): wait-for-chain + atomic keyring + one-command driver
Three race conditions surfaced repeatedly on a fresh AWS bring-up of
the Fibre throughput experiment. Each one had the same shape: a
talis subcommand "succeeded" at the CLI level (or returned the txhash
with --yes) before the chain had actually applied the work, leaving
downstream steps to fail in confusing ways. This commit makes each
step verify *outcome*, not just *invocation*, so the experiment can
go from a fresh `talis up` to a running loadgen without manual
intervention.
• setup-fibre script (fibre_setup.go) now:
- polls `celestia-appd status` for `latest_block_height>0`
before submitting any tx — fixes the silent-noop where
set-host + 100× deposit-to-escrow all bounced with
"celestia-app is not ready; please wait for first block";
- retries `set-host` in a loop until the validator's host
shows up in `query valaddr providers` — fixes the case
where --yes returns the txhash before block inclusion and
the tx silently lands in the mempool but never confirms;
- verifies fibre-0's escrow account is funded on-chain before
the tmux session exits — same silent-failure mode as
set-host, but on the deposit side.
The talis-CLI step also now cross-checks all validators are
registered from a single vantage point before returning, so a
concurrent set-host race surfaces as an error instead of a
half-empty provider list start-fibre would cache forever.
• fibre-bootstrap-evnode (fibre_bootstrap_evnode.go) now stages
the keyring scp into a tmp directory and `mv`s it atomically
into place. The previous direct `scp -r` to
/root/keyring-fibre/keyring-test created the directory before
transferring its contents — the evnode init script's
`[ -d keyring-test ]` poll passed mid-transfer, the daemon
launched with no fibre-0.info, and crashed with `keyring entry
"fibre-0" not found`.
• evnode_init.sh (genesis.go) now waits for the specific
keyring-test/fibre-0.info file rather than just the
keyring-test directory. Belt-and-braces: the bootstrap mv is
already atomic on the same filesystem, but the file-level
guard means a hand-pushed keyring (not via talis) can't trip
the same race.
• New `talis fibre-experiment` umbrella command runs
up → genesis → deploy → setup-fibre → start-fibre →
fibre-bootstrap-evnode in order. Each step uses the same
binary as a subprocess; failures in any step abort the chain.
Operator goes from a prepared root dir to a running loadgen
with one command, instead of remembering the sequence.
Verified by 5-min sustained loadgen against julien/fiber HEAD with
PR #3287 (concurrent submitter) merged: 47.65 MB/s @ 99.999 % ok,
up from the prior 24.57 MB/s baseline (the gap is PR #3287's
overlapping uploads — these talis fixes just stop the deploy from
silently breaking before throughput matters).
* fix(tools/talis): finalize fibre setup race fixes
Three follow-up bugs surfaced from the PR #3303 follow-up
verification run on a 3-validator AWS Fibre cluster:
- aws.go: CreateAWSInstances exited 0 even when individual
instance launches failed, so `talis up` lied about success
and downstream steps proceeded against a partial cluster.
Returns a joined error now so failure cascades stop early.
- download.go: sshExec used cmd.CombinedOutput, mixing SSH
warnings (the "Warning: Permanently added '...'..." chatter
on stderr) into bytes the caller hands to fmt.Sscanf("%d").
The CLI-side providers cross-check parsed those warnings
as 0 and looped until its 5-min deadline even though a
direct SSH query showed all 3 providers registered. Switch
to cmd.Output() (stdout only) and add `-q -o LogLevel=ERROR`
to silence the chatter for any caller that does combine
streams.
- fibre_setup.go: the per-validator escrow verification used
`celestia-appd query fibre escrow` which doesn't exist —
the actual subcommand is `escrow-account`. The query
errored on every retry, the grep for "amount" never
matched, and the script wedged on the 3-min deadline
reporting `FATAL: fibre-0 escrow not present`. Switch to
`escrow-account` and key on `"found":true` (the explicit
existence flag in the response). Also wrap the fibre-0
deposit-to-escrow itself in a retry loop matching set-host
— same `--yes`-returns-before-inclusion silent-failure
mode bit it. fibre-1..N stay best-effort.
* feat(evnode-txsim): keep-alive conn pool + pprof endpoint
Two diagnostic improvements for the load generator:
1. http.Transport.MaxIdleConnsPerHost defaults to 2 in stdlib.
With --concurrency=8 (or higher), 6+ goroutines per cycle had
to open fresh TCP+TLS sockets per request because the pool
couldn't hold their idle conns between requests. Bump
MaxIdleConns / MaxIdleConnsPerHost / MaxConnsPerHost to
2*concurrency so every active sender has a reusable keep-alive
socket, eliminating handshake churn from the hot path.
2. Always-on net/http/pprof on 127.0.0.1:6060. evnode-txsim is a
load tester, not a production daemon, so cost of always serving
profiling is acceptable; the payoff is being able to grab CPU
profiles under live load without re-deploying the binary —
`ssh -L 6060:127.0.0.1:6060 root@loadgen \
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30`.
A profile captured this way under c=8 traced the per-request hot
path: 25.5% in kernel write(2), 25% in net/http body marshaling.
That diagnostic surfaced that the c6in.2xlarge loadgen was the
binding constraint for the experiment at ~22 MB/s, not evnode or
DA — a finding we'd have spent another debug round chasing
without the in-process profiler.
Overview
Attempt to improve submitter by doing concurrent sends. We don't need to wait for the answer and we would submit.
Useful when the throughput of blobs needs to be high.
Related to #3244 as Fiber takes time to return, so it is necessary there. This PR is mainly to investigate if we can generalize that improvement to mainline evnode.
Summary by CodeRabbit
New Features
Refactor
Bug Fixes
Tests