[DNM] feat(da): support fiber (not via c-node)#3244
[DNM] feat(da): support fiber (not via c-node)#3244julienrbrt wants to merge 66 commits intomainfrom
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).
|
|
Claude finished @julienrbrt's task in 3m 40s —— View job Code Review: [DNM] feat(da): support fiber (not via c-node)
SummaryThis is a substantial PoC introducing Fiber (Fibre) as a direct DA backend, bypassing celestia-node/apex. The architecture is sound — a clean Several issues need attention before promotion beyond PoC, from a potential production panic down to typos. 🔴 Critical Issues1. result, err := c.fiber.Upload(context.Background(), nsID, flat)The result, err := c.fiber.Upload(ctx, nsID, flat)2.
The comment itself says "non-fiber callers should override via ldflag or local config" — that's not a safe assumption for a library. 3. if err := c.DA.Fiber.Validate(); err != nil {
return fmt.Errorf("fiber DA config: %w", err)
}4. Subscriber removal index bug in Example: 3 subscribers [A(0), B(1), C(2)]. A cancels → C moves to 0, but C's goroutine still has A safe implementation stores the subscriber in a map keyed by a unique ID, or uses Also: 🟠 High Priority Issues5. Namespace slice panics on short namespaces nsID := namespace[len(namespace)-10:]This panics if if len(namespace) < 10 {
return datypes.ResultSubmit{BaseResult: datypes.BaseResult{Code: datypes.StatusError, Message: "namespace too short"}}
}6. base, err := da.NewFiberClient(da.FiberConfig{...})
if err != nil {
panic(err)
}
func NewFiberDAClient(...) (FullDAClient, error)7. Dangling comment for non-existent field in // UploadConcurrency limits the number of concurrent upload connectionsThis comment appears after 🟡 Medium Priority Issues8. Height: 0, /* TODO */The 9. 10. if c.KeyringPath == "" {
return fmt.Errorf("%s is required when fiber DA is enabled", FlagDAFiberKeyName)
}11. 12. Extra store instantiation in mainKV := store.NewEvNodeKVStore(datastore)
baseStore := store.New(mainKV)
// ... read latestState ...
// baseStore is then discardedA throwaway store is created just to read 🔵 Low / Style Issues13. Typo heigth, err := c.fiber.Head(headCtx)14. 15. Dropped events in default:
// Channel full, drop event. Subscriber is too slow.In tests, dropped events lead to flaky failures that are hard to debug. Consider incrementing a counter ( 16. Actually checking 17. Security: unpinned Action in CI workflow Positive Aspects
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3244 +/- ##
==========================================
+ Coverage 62.33% 63.16% +0.82%
==========================================
Files 122 124 +2
Lines 12873 13258 +385
==========================================
+ Hits 8024 8374 +350
- Misses 3968 3995 +27
- Partials 881 889 +8
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Adds a fibremock package with: - DA interface (Upload/Download/Listen) matching the fibre gRPC service - In-memory MockDA implementation with LRU eviction and configurable retention - Tests covering all paths Migrated from celestiaorg/x402-risotto#16 as-is for integration.
Adds tools/celestia-node-fiber, a new Go sub-module that implements the ev-node fiber.DA interface by delegating Upload, Download and Listen to a celestia-node api/client.Client. Upload and Download run locally against a Celestia consensus node (gRPC) and Fibre Storage Providers (Fibre gRPC) — no bridge-node hop — using celestia-node's self-sufficient client (celestiaorg/celestia-node#4961). Listen subscribes to blob.Subscribe on a bridge node and forwards only share-version-2 blobs, which is how Fibre blobs settle on-chain via MsgPayForFibre. The package lives in its own go.mod, parallel to tools/local-fiber, so ev-node core does not inherit celestia-app / cosmos-sdk replace-directive soup. A FromModules constructor accepts the Fibre and Blob Module interfaces directly so callers can inject mocks or share an existing *api/client.Client. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…#3280) * test(celestia-node-fiber): showcase end-to-end Upload/Listen/Download Adds tools/celestia-node-fiber/testing/, a single-validator in-process showcase that boots a fibre-tagged Celestia chain + in-process Fibre server + celestia-node bridge, registers the validator's FSP via valaddr (with the dns:/// URI scheme the client's gRPC resolver expects), funds an escrow account, and drives the full adapter surface. TestShowcase proves the round-trip: subscribe via Listen, Upload a blob, wait for the share-version-2 BlobEvent that lands after the async MsgPayForFibre commits, assert the BlobID from Listen matches Upload's return, Download and diff the payload bytes. The harness is intentionally single-validator — a 2-validator Docker Compose showcase is planned as a follow-up for exercising real quorum collection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(celestia-node-fiber): scale showcase to 10 blobs, document DataSize gap Upload 10 distinct-payload blobs through adapter.Upload, collect BlobEvents via adapter.Listen until every BlobID is accounted for (order-insensitive, rejects duplicates), then round-trip each blob through adapter.Download to diff bytes. Catches routing bugs (wrong blob returned for a BlobID) and duplicate-event bugs that a single-blob test can't see. Scaling the test also exposed a semantic issue: the v2 share carries only (fibre_blob_version + commitment), so b.DataLen() — what listen.go's fibreBlobToEvent reports today — is always 36, not the original payload length ev-node's fibermock conveys. The adapter can't derive the payload size from the subscription stream alone; surfacing it correctly needs an x/fibre PaymentPromise lookup (tracked as a TODO on fibreBlobToEvent). The test therefore asserts DataSize is non-zero rather than matching len(payload). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…3281) listen.go previously set BlobEvent.DataSize to b.DataLen(), which for a share-version-2 Fibre blob is always the fixed share-data layout (fibre_blob_version + commitment = 36 bytes) — not the original payload length. That diverges from ev-node's fibermock contract and misleads any consumer that uses DataSize to allocate buffers or report progress. The v2 share genuinely doesn't carry the original size, and x/fibre v8 has no chain query to derive it from the commitment. The only accurate path is to Download the blob and measure. Listen now does exactly that before forwarding each event. The cost is one FSP round-trip per v2 blob; can be made opt-out later if it hurts throughput-sensitive use cases. Tests: - Showcase restores the strict DataSize == len(payload) assertion across all 10 blobs. - Unit test TestListen_FiltersFibreOnlyAndEmitsEvent now stubs fakeFibre.Download to return a deterministic payload and asserts DataSize matches its length. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ight subscriptions (#3283) feat(celestia-node-fiber): Listen takes fromHeight for resume subscriptions Threads a fromHeight parameter through the Fibre DA Listen path so a subscriber can rejoin the stream from a past block height without missing blobs. Consumes the matching celestia-node API change landed in celestiaorg/celestia-node#4962, which gave Blob.Subscribe a fromHeight argument backed by a WaitForHeight loop. Changes: - block/internal/da/fiber/types.go: DA.Listen signature now takes fromHeight uint64. fromHeight == 0 preserves "follow from tip" semantics, >0 replays from that block forward. - block/internal/da/fibremock/mock.go: replay matching blobs with height >= fromHeight before attaching the live subscriber. - block/internal/da/fiber_client.go: outer fiberDAClient.Subscribe does not yet expose a starting height (datypes.DA doesn't plumb one), so pass 0 and defer resume-from-height wiring to a future datypes.DA change. - tools/celestia-node-fiber/listen.go: propagate fromHeight to client.Blob.Subscribe on the celestia-node API. - tools/celestia-node-fiber/go.mod: bump celestia-node to the merged pseudo-version (v0.0.0-20260423143400-194cc74ce99c) carrying #4962. - tools/celestia-node-fiber/adapter_test.go: fakeBlob.subscribeFn gets the new fromHeight arg; TestListen_FiltersFibreOnlyAndEmitsEvent asserts that fromHeight=0 is forwarded. - tools/celestia-node-fiber/testing/showcase_test.go: existing TestShowcase passes fromHeight=0. New TestShowcaseResume uploads 3 blobs, discovers their settlement heights via a live Listen, then opens a fresh Listen with fromHeight at the first blob's height and verifies every historical blob is replayed with correct Height and DataSize. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d_yarn group across 1 directory (#3292) build(deps): Bump postcss Bumps the npm_and_yarn group with 1 update in the /docs directory: [postcss](https://github.com/postcss/postcss). Updates `postcss` from 8.5.8 to 8.5.12 - [Release notes](https://github.com/postcss/postcss/releases) - [Changelog](https://github.com/postcss/postcss/blob/main/CHANGELOG.md) - [Commits](postcss/postcss@8.5.8...8.5.12) --- updated-dependencies: - dependency-name: postcss dependency-version: 8.5.12 dependency-type: indirect dependency-group: npm_and_yarn ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Add a `changes` job using dorny/paths-filter to detect whether any non-documentation files were modified. All heavy jobs (lint, docker, test, docker-tests, proto) are gated behind this check and skipped when the PR only touches docs/** or markdown files. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* docs: better code readability * chore: restore yarn.lock to main Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(style): address PR review feedback - Add `"type": "dark"` to ev-dark.json theme manifest - Raise punctuation token contrast from #505050 to #767676 (WCAG AA) - Align --vp-code-block-color CSS var with ev-dark default text (#dbd7ca) - Use ThemeRegistration type instead of `as any` in config.ts Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* docs: ev-node high availability * docs: node placement * docs(ha): address PR review feedback Critical fixes: - Fix snapshot_threshold math: 5000 ÷ 10 = 500s ≈ 8.3 min (not 83s) - Fix trailing_logs math: 18000 ÷ 10 = 1800s = 30 min (not 5 min) Medium fixes: - Fix heartbeat_timeout description: it is a follower-side election trigger, not the interval at which the leader sends heartbeats - Add explicit restart instruction after Step 5 data copy in single-to-ha.md so the chain keeps producing blocks during preparation (Steps 6-8) - Replace priv_validator_key.json with signer.json in single-to-ha.md to match cluster-setup.md and the E2E tests Minor fixes: - Exclude self from raft.peers in all examples (cluster-setup.md node-1 yaml/CLI/systemd, single-to-ha.md node-1 and node-2) - Add "exclude local node" note to raft.peers description in overview.md - Fix P2P port in overview.md Interaction with P2P section (7676 → 26656) - Add text language tag to all bare fenced blocks (MD040): multiaddr example, RTT equations, and all log snippets Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(ha): absorb raft_production.md into ha/overview.md raft_production.md had no sidebar entry and its content was fully superseded by the new ha/ guides. Extract the three pieces that were unique to it — bootstrap flag docs, auto-detection startup mode explanation, and static-membership limitation note — into ha/overview.md, then delete the file. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(ha): use EnvironmentFile for signer passphrase Passing --evnode.signer.passphrase inline exposes the secret in ps aux, journalctl, and shell history. - Add EnvironmentFile=/etc/ev-node/env (chmod 600) to the systemd unit in cluster-setup.md with setup instructions - Replace all inline <YOUR_PASSPHRASE> occurrences with $EV_SIGNER_PASSPHRASE sourced from /etc/ev-node/env in every evm start / evm init snippet across both guides Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(ha): explicit node-2 peers and action-based rolling restart - Replace "peers list is identical" stub in node-2 config with an explicit peers list that excludes node-2 itself, and add a note that each node must omit itself from raft.peers - Replace "Wait ~30 seconds" in rolling restart with journalctl one-liners that exit as soon as the node logs follower/leader state, giving a deterministic signal instead of an arbitrary timeout Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(ha): fix raft.peers self-inclusion startup bug The abbreviated node-2 snippet with "# peers list is identical" caused a startup failure: with raft_addr=0.0.0.0:5001 the bootstrap code's literal address comparison does not recognise node-2@10.0.0.2:5001 as self, so node-2 is appended twice and deduplicateServers returns "duplicate peers found in config". - Fix intro text: "only raft.node_id and raft_addr differ" → "raft.node_id is unique; raft.peers and p2p.peers must exclude self" - Expand node-2 snippet to a full evnode.yaml with the correct peers list (node-1, node-3, node-4, node-5 — no node-2) and an inline explanation of the wildcard address pitfall - Align overview.md trailing_logs example to 1 block/s (matching block_time: "1s" used throughout) and note the 10 block/s rate too Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(ha): fix passphrase flag and failover kill cardinality check Replace non-existent --evnode.signer.passphrase with the actual --evnode.signer.passphrase_file flag throughout cluster-setup and single-to-ha guides. Update passphrase setup to create a chmod 600 file at /etc/ev-node/passphrase referenced directly by the flag. Add mapfile-based cardinality check in the failover test fallback kill command to guard against killing the wrong process. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(ha): fix RPC endpoints, init ordering, and snap_count CLI flag Replace incorrect CometBFT RPC calls (port 26657/status) with the actual ev-node HTTP API (port 7331 /health/ready, /raft/node) and EVM execution layer (cast block latest) throughout both guides. Align single-to-ha Step 2 init ordering with cluster-setup: create passphrase file before evm init so the signer key is encrypted from the start, and pass --evnode.node.aggregator and passphrase_file flags. Fix Step 9a fallback kill in single-to-ha to use mapfile cardinality check, matching the pattern already applied in cluster-setup. Add --evnode.raft.snap_count=3 to the CLI start example to match the YAML config block. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
ee05d3c to
2c98c5d
Compare
This reverts commit a1a0861.
…nner (#3301) Brings the celestia-app talis multi-cloud deploy tool into ev-node, plus a long-lived ev-node aggregator runner that wires the existing celestia-node-fiber adapter behind ev-node's DA client interface. Verified end-to-end on AWS — talis up → genesis → deploy → setup-fibre → start-fibre → fibre-bootstrap-evnode reaches 24.57 MB/s @ 99.7 % ok on a 60 s sustained loadgen (3 × c6in.4xlarge validators + c6in.2xlarge bridge + c6in.8xlarge ev-node + c6in.2xlarge load-gen, us-east-1). What this adds: • tools/talis/ — vendored from celestia-app's feat/fibre-payments. Provisions AWS / DO / GCP boxes for validators + bridge + ev-node + load-gen, deploys binaries + init scripts, drives the Fibre setup-fibre + start-fibre flow, and ships a fibre-bootstrap-evnode step that scp's the bridge JWT and Fibre payment keyring onto each ev-node before its init script starts the daemon. • tools/celestia-node-fiber/cmd/evnode-fibre/ — the long-lived aggregator runner. Wires block.NewFiberDAClient on top of the celestia-node-fiber adapter that julien/fiber already ships, plus the in-memory executor + HTTP /tx ingress used by evnode-txsim. Distinct from the existing fiber-bench cmd. • tools/talis/cmd/evnode-txsim/ — small Go load-gen that pumps the runner's HTTP /tx ingress for a fixed duration; deployed to load-gen boxes and prints a single TXSIM: line on completion. Two small ev-node-side helpers the runner calls: • block/public.go: SetMaxBlobSize(n) — overrides the per-blob byte cap so the runner can lift Celestia's 5 MiB default to Fibre's 120 MiB headroom. • pkg/config/config.go: Config.ApplyFiberDefaults() — flips the DA config to Fibre-friendly settings (adaptive batching, 1 s DA.BlockTime, 50-deep pending-cache window) when the Fiber profile is enabled, so a runner can opt in with one call. setup-fibre robustness fixes uncovered during the verified run: • bash script for set-host now retries until the validator's host appears in `query valaddr providers`. The previous one- shot call relied on `--yes` returning the txhash before block inclusion; if the chain wasn't ready, the tx silently bounced. The Fibre client cached the partial set on startup and uploads cascaded to "host not found" → "voting power: collected 0". • talis-CLI side polls `query valaddr providers` after the per- validator scripts finish and refuses to return until all validators are registered (5-minute deadline). External dependency (documented in tools/talis/fibre.md): • Sibling clone of celestia-app on a branch with feat/fibre-payments + sysrex/fibre_url_fix cherry-picked. Without the URL-parse fix the Fibre client rejects every host:port registration. Tested: - go build ./... — clean - go test ./block/internal/submitting ./pkg/config (the two pre-existing test failures on julien/fiber — TestAddFlags and TestFiberClient_Submit_BlobTooLarge — are not introduced by this PR and reproduce on raw julien/fiber) - End-to-end AWS deploy from this branch — 24.57 MB/s, 99.7 % ok
…log (#3307) * feat(fibre): log per-Submit upload duration The Fibre Submit path was opaque: failures showed up as DeadlineExceeded with no signal of how long the upload actually took, and successes only logged at debug level inside the upstream library. During load-test debugging this turned into a guessing game — was the cluster slow, the deadline too tight, or something stuck mid-RPC? Add a single info-level (warn-on-failure) log line in fiberDAClient.Submit covering the Upload call: duration, flat blob bytes, blob count. Cheap (one time.Since) and gives the operator concrete numbers — e.g. "17 blobs / 115 MiB / 1.5 s" — to reason about whether RPCTimeout, pending cap, or batch sizing is the right knob to turn next. * fix(fibre): split DA Submit batches at Fibre's 128 MiB upload cap Under sustained txsim load (~50 MiB/s) the DA submitter batched 10 block_data items into one Upload(), producing a flat payload of 144 MiB. Fibre's per-upload cap is hard at ~128 MiB ("blob size exceeds maximum allowed size: data size 144366912 exceeds maximum 134217723") and rejected every batched upload. With MaxPendingHeadersAndData=10 that took down 170 consecutive submissions before the node halted itself with "Data exceeds DA blob size limit". Wrap the Upload call in a chunker that groups input blobs into ≤120 MiB chunks (8 MiB headroom under Fibre's cap for the per-blob length-prefix overhead added by flattenBlobs) and uploads each chunk separately. Aggregates submitted counts and BlobIDs across chunks; on first chunk failure, returns the error with the partially-submitted count so the submitter's retry/backoff logic sees a coherent state instead of all-or-nothing. Single oversized blobs (already validated against DefaultMaxBlobSize earlier in Submit) still land alone and fail server-side, but at least don't drag healthy peers into the same rejected batch. * fix(evnode-fibre): cap per-block data at 100 MiB to fit a Fibre upload Companion to the submitter chunking fix. The submitter can split a multi-blob batch into ≤120 MiB Fibre uploads, but a *single* block_data item that exceeds 128 MiB still ends up alone in its own chunk and fails server-side ("blob size exceeds maximum allowed size"). Lower the per-block cap to 100 MiB so under high-throughput txsim a single block can't grow past Fibre's hard limit, and update the comment to explain the relationship between this cap and Fibre's ~128 MiB upload reject threshold.
* fix(tools/talis): wait-for-chain + atomic keyring + one-command driver
Three race conditions surfaced repeatedly on a fresh AWS bring-up of
the Fibre throughput experiment. Each one had the same shape: a
talis subcommand "succeeded" at the CLI level (or returned the txhash
with --yes) before the chain had actually applied the work, leaving
downstream steps to fail in confusing ways. This commit makes each
step verify *outcome*, not just *invocation*, so the experiment can
go from a fresh `talis up` to a running loadgen without manual
intervention.
• setup-fibre script (fibre_setup.go) now:
- polls `celestia-appd status` for `latest_block_height>0`
before submitting any tx — fixes the silent-noop where
set-host + 100× deposit-to-escrow all bounced with
"celestia-app is not ready; please wait for first block";
- retries `set-host` in a loop until the validator's host
shows up in `query valaddr providers` — fixes the case
where --yes returns the txhash before block inclusion and
the tx silently lands in the mempool but never confirms;
- verifies fibre-0's escrow account is funded on-chain before
the tmux session exits — same silent-failure mode as
set-host, but on the deposit side.
The talis-CLI step also now cross-checks all validators are
registered from a single vantage point before returning, so a
concurrent set-host race surfaces as an error instead of a
half-empty provider list start-fibre would cache forever.
• fibre-bootstrap-evnode (fibre_bootstrap_evnode.go) now stages
the keyring scp into a tmp directory and `mv`s it atomically
into place. The previous direct `scp -r` to
/root/keyring-fibre/keyring-test created the directory before
transferring its contents — the evnode init script's
`[ -d keyring-test ]` poll passed mid-transfer, the daemon
launched with no fibre-0.info, and crashed with `keyring entry
"fibre-0" not found`.
• evnode_init.sh (genesis.go) now waits for the specific
keyring-test/fibre-0.info file rather than just the
keyring-test directory. Belt-and-braces: the bootstrap mv is
already atomic on the same filesystem, but the file-level
guard means a hand-pushed keyring (not via talis) can't trip
the same race.
• New `talis fibre-experiment` umbrella command runs
up → genesis → deploy → setup-fibre → start-fibre →
fibre-bootstrap-evnode in order. Each step uses the same
binary as a subprocess; failures in any step abort the chain.
Operator goes from a prepared root dir to a running loadgen
with one command, instead of remembering the sequence.
Verified by 5-min sustained loadgen against julien/fiber HEAD with
PR #3287 (concurrent submitter) merged: 47.65 MB/s @ 99.999 % ok,
up from the prior 24.57 MB/s baseline (the gap is PR #3287's
overlapping uploads — these talis fixes just stop the deploy from
silently breaking before throughput matters).
* fix(tools/talis): finalize fibre setup race fixes
Three follow-up bugs surfaced from the PR #3303 follow-up
verification run on a 3-validator AWS Fibre cluster:
- aws.go: CreateAWSInstances exited 0 even when individual
instance launches failed, so `talis up` lied about success
and downstream steps proceeded against a partial cluster.
Returns a joined error now so failure cascades stop early.
- download.go: sshExec used cmd.CombinedOutput, mixing SSH
warnings (the "Warning: Permanently added '...'..." chatter
on stderr) into bytes the caller hands to fmt.Sscanf("%d").
The CLI-side providers cross-check parsed those warnings
as 0 and looped until its 5-min deadline even though a
direct SSH query showed all 3 providers registered. Switch
to cmd.Output() (stdout only) and add `-q -o LogLevel=ERROR`
to silence the chatter for any caller that does combine
streams.
- fibre_setup.go: the per-validator escrow verification used
`celestia-appd query fibre escrow` which doesn't exist —
the actual subcommand is `escrow-account`. The query
errored on every retry, the grep for "amount" never
matched, and the script wedged on the 3-min deadline
reporting `FATAL: fibre-0 escrow not present`. Switch to
`escrow-account` and key on `"found":true` (the explicit
existence flag in the response). Also wrap the fibre-0
deposit-to-escrow itself in a retry loop matching set-host
— same `--yes`-returns-before-inclusion silent-failure
mode bit it. fibre-1..N stay best-effort.
* feat(evnode-txsim): keep-alive conn pool + pprof endpoint
Two diagnostic improvements for the load generator:
1. http.Transport.MaxIdleConnsPerHost defaults to 2 in stdlib.
With --concurrency=8 (or higher), 6+ goroutines per cycle had
to open fresh TCP+TLS sockets per request because the pool
couldn't hold their idle conns between requests. Bump
MaxIdleConns / MaxIdleConnsPerHost / MaxConnsPerHost to
2*concurrency so every active sender has a reusable keep-alive
socket, eliminating handshake churn from the hot path.
2. Always-on net/http/pprof on 127.0.0.1:6060. evnode-txsim is a
load tester, not a production daemon, so cost of always serving
profiling is acceptable; the payoff is being able to grab CPU
profiles under live load without re-deploying the binary —
`ssh -L 6060:127.0.0.1:6060 root@loadgen \
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30`.
A profile captured this way under c=8 traced the per-request hot
path: 25.5% in kernel write(2), 25% in net/http body marshaling.
That diagnostic surfaced that the c6in.2xlarge loadgen was the
binding constraint for the experiment at ~22 MB/s, not evnode or
DA — a finding we'd have spent another debug round chasing
without the in-process profiler.
Overview
Support Fiber client (based on https://github.com/celestiaorg/celestia-app/blob/63fbf31cca216fc4e067a9e1b3a3431115c7009b/fibre), but not via celestia node or apex for this PoC
celestiaorg/celestia-node#4892