Skip to content

[DNM] feat(da): support fiber (not via c-node)#3244

Draft
julienrbrt wants to merge 66 commits intomainfrom
julien/fiber
Draft

[DNM] feat(da): support fiber (not via c-node)#3244
julienrbrt wants to merge 66 commits intomainfrom
julien/fiber

Conversation

@julienrbrt
Copy link
Copy Markdown
Member

@julienrbrt julienrbrt commented Apr 13, 2026

Overview

Support Fiber client (based on https://github.com/celestiaorg/celestia-app/blob/63fbf31cca216fc4e067a9e1b3a3431115c7009b/fibre), but not via celestia node or apex for this PoC
celestiaorg/celestia-node#4892

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 13, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d3fe2390-b52b-4f26-90f1-2cdc2f361060

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch julien/fiber

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 13, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedMay 3, 2026, 8:53 AM

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 13, 2026

Claude finished @julienrbrt's task in 3m 40s —— View job


Code Review: [DNM] feat(da): support fiber (not via c-node)

  • Gather context (AGENTS.md, repo structure)
  • Review fiber client implementation (block/internal/da/fiber_client.go)
  • Review DA submitter changes (block/internal/submitting/da_submitter.go)
  • Review config changes (pkg/config/config.go)
  • Review cache/pending_base changes
  • Review public API changes (block/public.go)
  • Review tools/celestia-node-fiber package
  • Review concurrency, error handling, and security
  • Post comprehensive review

Summary

This is a substantial PoC introducing Fiber (Fibre) as a direct DA backend, bypassing celestia-node/apex. The architecture is sound — a clean fiber.DA interface, an in-memory mock for tests, and a tools/celestia-node-fiber adapter that wraps the actual celestia-node API. The overall structure is well-thought-out, testing coverage is reasonable, and the in-flight claim/gap system in pendingBase is impressively robust. The comments are thorough.

Several issues need attention before promotion beyond PoC, from a potential production panic down to typos.


🔴 Critical Issues

1. context.Background() discards the caller's context in Submit
block/internal/da/fiber_client.go:102

result, err := c.fiber.Upload(context.Background(), nsID, flat)

The ctx parameter passed to Submit is completely ignored for the upload call. If the caller passes a deadline-bearing or cancel-bearing context, the upload will continue running after the caller has given up, wasting resources and potentially causing ghost uploads. The parameter should be used directly (or wrapped with c.defaultTimeout if that protection is needed):

result, err := c.fiber.Upload(ctx, nsID, flat)

Fix this →


2. DefaultMaxBlobSize globally changed to Fiber's cap — breaks non-Fiber users
block/internal/common/consts.go:29
The default is now 134217723 (128 MiB − 5), which is Fibre's cap. The HACK comment is honest, but the consequence is that any non-Fiber node using this binary would submit blobs up to 128 MiB against a Celestia bridge that enforces a ~2 MiB per-blob cap, causing immediate and confusing failures. This needs to either:

  • Restore the old default and require callers to block.SetMaxBlobSize() when enabling Fiber, or
  • Gate DefaultMaxBlobSize on config.DA.IsFiberEnabled() at startup (in ApplyFiberDefaults).

The comment itself says "non-fiber callers should override via ldflag or local config" — that's not a safe assumption for a library.


3. Config.Validate() never calls FiberDAConfig.Validate()
pkg/config/config.go:562
FiberDAConfig.Validate() is a well-written validator, but it is never invoked from Config.Validate(). A user who sets fiber.enabled=true but omits consensus_address or bridge_address will get a confusing nil-pointer or connection error at runtime, not a clean validation error. Add to Config.Validate():

if err := c.DA.Fiber.Validate(); err != nil {
    return fmt.Errorf("fiber DA config: %w", err)
}

Fix this →


4. Subscriber removal index bug in MockDA.Listen
block/internal/da/fibremock/mock.go:224
The cleanup goroutine uses a swap-with-last pattern but captures a stale idx. When a subscriber is removed, the subscriber previously at last is moved to idx. The goroutine for that moved subscriber still holds its original (now wrong) index. Subsequent cleanup will swap or truncate the wrong slot, leaving live subscribers silently dropped or zombie subscribers retained.

Example: 3 subscribers [A(0), B(1), C(2)]. A cancels → C moves to 0, but C's goroutine still has idx=2. C cancels: last=1, 2<=1 is false, so slice is truncated to [:1]B is lost and C is retained.

A safe implementation stores the subscriber in a map keyed by a unique ID, or uses sync.Map.

Also: m.subscribers = m.subscribers[:last] panics when len(m.subscribers) == 0 since last = -1. Fix this →


🟠 High Priority Issues

5. Namespace slice panics on short namespaces
block/internal/da/fiber_client.go:95, also lines 202, 321

nsID := namespace[len(namespace)-10:]

This panics if namespace is fewer than 10 bytes. While the Celestia namespace specification guarantees a 29-byte namespace, no bounds check protects against misconfiguration. Add a guard:

if len(namespace) < 10 {
    return datypes.ResultSubmit{BaseResult: datypes.BaseResult{Code: datypes.StatusError, Message: "namespace too short"}}
}

Fix this →


6. NewFiberDAClient panics instead of returning an error
block/public.go:101

base, err := da.NewFiberClient(da.FiberConfig{...})
if err != nil {
    panic(err)
}

NewFiberClient only errors when cfg.Client == nil, which callers already guard against, but panicking in a library constructor is unacceptable. The function signature should propagate the error:

func NewFiberDAClient(...) (FullDAClient, error)

Fix this →


7. Dangling comment for non-existent field in FiberDAConfig
pkg/config/config.go:306

// UploadConcurrency limits the number of concurrent upload connections

This comment appears after KeyName but there is no UploadConcurrency field. The struct closes after this comment. Either add the field or remove the comment.


🟡 Medium Priority Issues

8. height is always 0 in Submit result — breaks DA tracking
block/internal/da/fiber_client.go:143-152

Height: 0, /* TODO */

The daSubmitter uses res.Height to call postSubmit(..., daHeight), which feeds cache.SetHeaderDAIncluded(hash, daHeight, ...). With daHeight = 0, the DA inclusion height is always wrong. The IsHeightDAIncluded checks in submitter.go will produce incorrect results. This isn't just cosmetic — it affects correctness of finality tracking for Fiber-backed nodes.


9. Subscribe has no retry on Listen failure
block/internal/da/fiber_client.go:311
When c.fiber.Listen returns an error, the goroutine closes out immediately with no backoff or retry. Any transient Fiber network hiccup will permanently kill the subscription, requiring the node to restart. Consider wrapping in an exponential-backoff retry loop, or at minimum log the error prominently and add a comment that the caller must detect channel closure and re-subscribe.


10. FiberDAConfig.KeyringPath not validated
pkg/config/config.go:311
KeyringPath is listed in the struct comment as required "for signing payment promises," but FiberDAConfig.Validate() does not check that it is non-empty. If left unset, the adapter will likely fail with an opaque keyring-not-found error. Add:

if c.KeyringPath == "" {
    return fmt.Errorf("%s is required when fiber DA is enabled", FlagDAFiberKeyName)
}

11. DefaultMaxBlobSize global mutated without synchronization
block/internal/common/consts.go:47 / block/public.go:26
SetMaxBlobSize(n) writes to DefaultMaxBlobSize with no mutex and is called from startup code. If any goroutine reads DefaultMaxBlobSize concurrently with a SetMaxBlobSize call (e.g., a batch-size check in the submitter goroutines), this is an unsynchronized write to a non-atomic. Use sync/atomic or call SetMaxBlobSize strictly before starting any goroutines.


12. Extra store instantiation in run_node.go
pkg/cmd/run_node.go:161

mainKV := store.NewEvNodeKVStore(datastore)
baseStore := store.New(mainKV)
// ... read latestState ...
// baseStore is then discarded

A throwaway store is created just to read latestState.DAHeight, then discarded. The node.NewNode call creates its own store from the same datastore. This is wasteful and potentially confusing. Consider passing latestDAHeight as a parameter, or using a leaner metadata read.


🔵 Low / Style Issues

13. Typo heigth in GetLatestDAHeight
block/internal/da/fiber_client.go:370

heigth, err := c.fiber.Head(headCtx)

Fix this →

14. GetProofs and Validate return stub errors / always-true results
block/internal/da/fiber_client.go:378-398
GetProofs returns fmt.Errorf("not implemented") and Validate returns all true. These are acceptable for a PoC, but should be marked with //nolint:errcheck or documented prominently so they don't silently satisfy interfaces in production wiring.

15. Dropped events in MockDA subscribers are silent
block/internal/da/fibremock/mock.go:132

default:
    // Channel full, drop event. Subscriber is too slow.

In tests, dropped events lead to flaky failures that are hard to debug. Consider incrementing a counter (DroppedEvents int64) on MockDA so tests can assert no events were dropped.

16. ApplyFiberDefaults docs suggest it should be called from Validate
pkg/config/config.go:355
The function comment says "Intended to be invoked once at runner startup, after parsing the usual config." The run_node.go never calls it — there is no code path that actually triggers it. Verify this is wired up somewhere or add a call in StartNode before constructing the DA client.

Actually checking run_node.go more carefully — ApplyFiberDefaults is indeed not called. The BatchingStrategy and BlockTime defaults won't be applied for Fiber users.

17. Security: unpinned Action in CI workflow
The GitHub Advanced Security scan flagged dorny/paths-filter@v3 — pin to a commit SHA to prevent supply chain attacks.


Positive Aspects

  • The pendingBase in-flight claim/gap system is an elegant solution for concurrent DA submission with gap-recovery.
  • The fiber.DA interface design is minimal and clean — Head/Upload/Download/Listen is the right surface.
  • The chunkBlobsForFibre function correctly accounts for flattenBlobs framing overhead.
  • The FiberDAConfig.Validate() with WebSocket scheme check is a nice touch.
  • Parallel signing in signEnvelopesParallel is well-structured with proper goroutine lifecycle.
  • Test coverage for fiberDAClient is thorough; the fault-injector pattern in the test file is reusable.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 13, 2026

Codecov Report

❌ Patch coverage is 90.65657% with 37 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.16%. Comparing base (2865d6d) to head (4485d91).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
block/public.go 0.00% 12 Missing ⚠️
block/internal/da/fibremock/mock.go 90.90% 5 Missing and 5 partials ⚠️
block/internal/da/fiber_client.go 96.74% 5 Missing and 3 partials ⚠️
pkg/sequencers/solo/sequencer.go 61.53% 5 Missing ⚠️
pkg/config/config.go 75.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3244      +/-   ##
==========================================
+ Coverage   62.33%   63.16%   +0.82%     
==========================================
  Files         122      124       +2     
  Lines       12873    13258     +385     
==========================================
+ Hits         8024     8374     +350     
- Misses       3968     3995      +27     
- Partials      881      889       +8     
Flag Coverage Δ
combined 63.16% <90.65%> (+0.82%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

julienrbrt and others added 7 commits April 14, 2026 15:12
Adds a fibremock package with:
- DA interface (Upload/Download/Listen) matching the fibre gRPC service
- In-memory MockDA implementation with LRU eviction and configurable retention
- Tests covering all paths

Migrated from celestiaorg/x402-risotto#16 as-is for integration.
@julienrbrt julienrbrt changed the title feat(da): support fiber (not via c-node) [DNM] feat(da): support fiber (not via c-node) Apr 20, 2026
julienrbrt and others added 15 commits April 20, 2026 14:46
Adds tools/celestia-node-fiber, a new Go sub-module that implements the
ev-node fiber.DA interface by delegating Upload, Download and Listen to a
celestia-node api/client.Client.

Upload and Download run locally against a Celestia consensus node (gRPC)
and Fibre Storage Providers (Fibre gRPC) — no bridge-node hop — using
celestia-node's self-sufficient client (celestiaorg/celestia-node#4961).
Listen subscribes to blob.Subscribe on a bridge node and forwards only
share-version-2 blobs, which is how Fibre blobs settle on-chain via
MsgPayForFibre.

The package lives in its own go.mod, parallel to tools/local-fiber, so
ev-node core does not inherit celestia-app / cosmos-sdk replace-directive
soup. A FromModules constructor accepts the Fibre and Blob Module
interfaces directly so callers can inject mocks or share an existing
*api/client.Client.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…#3280)

* test(celestia-node-fiber): showcase end-to-end Upload/Listen/Download

Adds tools/celestia-node-fiber/testing/, a single-validator in-process
showcase that boots a fibre-tagged Celestia chain + in-process Fibre
server + celestia-node bridge, registers the validator's FSP via
valaddr (with the dns:/// URI scheme the client's gRPC resolver
expects), funds an escrow account, and drives the full adapter
surface.

TestShowcase proves the round-trip: subscribe via Listen, Upload a
blob, wait for the share-version-2 BlobEvent that lands after the
async MsgPayForFibre commits, assert the BlobID from Listen matches
Upload's return, Download and diff the payload bytes.

The harness is intentionally single-validator — a 2-validator
Docker Compose showcase is planned as a follow-up for exercising real
quorum collection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(celestia-node-fiber): scale showcase to 10 blobs, document DataSize gap

Upload 10 distinct-payload blobs through adapter.Upload, collect
BlobEvents via adapter.Listen until every BlobID is accounted for
(order-insensitive, rejects duplicates), then round-trip each blob
through adapter.Download to diff bytes. Catches routing bugs (wrong
blob returned for a BlobID) and duplicate-event bugs that a
single-blob test can't see.

Scaling the test also exposed a semantic issue: the v2 share carries
only (fibre_blob_version + commitment), so b.DataLen() — what
listen.go's fibreBlobToEvent reports today — is always 36, not the
original payload length ev-node's fibermock conveys. The adapter
can't derive the payload size from the subscription stream alone;
surfacing it correctly needs an x/fibre PaymentPromise lookup
(tracked as a TODO on fibreBlobToEvent). The test therefore asserts
DataSize is non-zero rather than matching len(payload).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…3281)

listen.go previously set BlobEvent.DataSize to b.DataLen(), which for
a share-version-2 Fibre blob is always the fixed share-data layout
(fibre_blob_version + commitment = 36 bytes) — not the original
payload length. That diverges from ev-node's fibermock contract and
misleads any consumer that uses DataSize to allocate buffers or
report progress.

The v2 share genuinely doesn't carry the original size, and x/fibre
v8 has no chain query to derive it from the commitment. The only
accurate path is to Download the blob and measure. Listen now does
exactly that before forwarding each event. The cost is one FSP
round-trip per v2 blob; can be made opt-out later if it hurts
throughput-sensitive use cases.

Tests:
- Showcase restores the strict DataSize == len(payload) assertion
  across all 10 blobs.
- Unit test TestListen_FiltersFibreOnlyAndEmitsEvent now stubs
  fakeFibre.Download to return a deterministic payload and asserts
  DataSize matches its length.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ight subscriptions (#3283)

feat(celestia-node-fiber): Listen takes fromHeight for resume subscriptions

Threads a fromHeight parameter through the Fibre DA Listen path so a
subscriber can rejoin the stream from a past block height without
missing blobs. Consumes the matching celestia-node API change landed
in celestiaorg/celestia-node#4962, which gave Blob.Subscribe a
fromHeight argument backed by a WaitForHeight loop.

Changes:

- block/internal/da/fiber/types.go: DA.Listen signature now takes
  fromHeight uint64. fromHeight == 0 preserves "follow from tip"
  semantics, >0 replays from that block forward.
- block/internal/da/fibremock/mock.go: replay matching blobs with
  height >= fromHeight before attaching the live subscriber.
- block/internal/da/fiber_client.go: outer fiberDAClient.Subscribe
  does not yet expose a starting height (datypes.DA doesn't plumb
  one), so pass 0 and defer resume-from-height wiring to a future
  datypes.DA change.
- tools/celestia-node-fiber/listen.go: propagate fromHeight to
  client.Blob.Subscribe on the celestia-node API.
- tools/celestia-node-fiber/go.mod: bump celestia-node to the merged
  pseudo-version (v0.0.0-20260423143400-194cc74ce99c) carrying #4962.
- tools/celestia-node-fiber/adapter_test.go: fakeBlob.subscribeFn
  gets the new fromHeight arg; TestListen_FiltersFibreOnlyAndEmitsEvent
  asserts that fromHeight=0 is forwarded.
- tools/celestia-node-fiber/testing/showcase_test.go: existing
  TestShowcase passes fromHeight=0. New TestShowcaseResume uploads 3
  blobs, discovers their settlement heights via a live Listen, then
  opens a fresh Listen with fromHeight at the first blob's height and
  verifies every historical blob is replayed with correct Height and
  DataSize.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dependabot Bot and others added 5 commits April 29, 2026 13:45
…d_yarn group across 1 directory (#3292)

build(deps): Bump postcss

Bumps the npm_and_yarn group with 1 update in the /docs directory: [postcss](https://github.com/postcss/postcss).


Updates `postcss` from 8.5.8 to 8.5.12
- [Release notes](https://github.com/postcss/postcss/releases)
- [Changelog](https://github.com/postcss/postcss/blob/main/CHANGELOG.md)
- [Commits](postcss/postcss@8.5.8...8.5.12)

---
updated-dependencies:
- dependency-name: postcss
  dependency-version: 8.5.12
  dependency-type: indirect
  dependency-group: npm_and_yarn
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Add a `changes` job using dorny/paths-filter to detect whether any
non-documentation files were modified. All heavy jobs (lint, docker,
test, docker-tests, proto) are gated behind this check and skipped
when the PR only touches docs/** or markdown files.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* docs: better code readability

* chore: restore yarn.lock to main

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs(style): address PR review feedback

- Add `"type": "dark"` to ev-dark.json theme manifest
- Raise punctuation token contrast from #505050 to #767676 (WCAG AA)
- Align --vp-code-block-color CSS var with ev-dark default text (#dbd7ca)
- Use ThemeRegistration type instead of `as any` in config.ts

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* docs: ev-node high availability

* docs: node placement

* docs(ha): address PR review feedback

Critical fixes:
- Fix snapshot_threshold math: 5000 ÷ 10 = 500s ≈ 8.3 min (not 83s)
- Fix trailing_logs math: 18000 ÷ 10 = 1800s = 30 min (not 5 min)

Medium fixes:
- Fix heartbeat_timeout description: it is a follower-side election trigger,
  not the interval at which the leader sends heartbeats
- Add explicit restart instruction after Step 5 data copy in single-to-ha.md
  so the chain keeps producing blocks during preparation (Steps 6-8)
- Replace priv_validator_key.json with signer.json in single-to-ha.md
  to match cluster-setup.md and the E2E tests

Minor fixes:
- Exclude self from raft.peers in all examples (cluster-setup.md node-1
  yaml/CLI/systemd, single-to-ha.md node-1 and node-2)
- Add "exclude local node" note to raft.peers description in overview.md
- Fix P2P port in overview.md Interaction with P2P section (7676 → 26656)
- Add text language tag to all bare fenced blocks (MD040): multiaddr
  example, RTT equations, and all log snippets

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs(ha): absorb raft_production.md into ha/overview.md

raft_production.md had no sidebar entry and its content was fully
superseded by the new ha/ guides. Extract the three pieces that were
unique to it — bootstrap flag docs, auto-detection startup mode
explanation, and static-membership limitation note — into
ha/overview.md, then delete the file.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs(ha): use EnvironmentFile for signer passphrase

Passing --evnode.signer.passphrase inline exposes the secret in
ps aux, journalctl, and shell history.

- Add EnvironmentFile=/etc/ev-node/env (chmod 600) to the systemd
  unit in cluster-setup.md with setup instructions
- Replace all inline <YOUR_PASSPHRASE> occurrences with
  $EV_SIGNER_PASSPHRASE sourced from /etc/ev-node/env in every
  evm start / evm init snippet across both guides

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs(ha): explicit node-2 peers and action-based rolling restart

- Replace "peers list is identical" stub in node-2 config with an
  explicit peers list that excludes node-2 itself, and add a note
  that each node must omit itself from raft.peers
- Replace "Wait ~30 seconds" in rolling restart with journalctl
  one-liners that exit as soon as the node logs follower/leader state,
  giving a deterministic signal instead of an arbitrary timeout

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs(ha): fix raft.peers self-inclusion startup bug

The abbreviated node-2 snippet with "# peers list is identical" caused
a startup failure: with raft_addr=0.0.0.0:5001 the bootstrap code's
literal address comparison does not recognise node-2@10.0.0.2:5001 as
self, so node-2 is appended twice and deduplicateServers returns
"duplicate peers found in config".

- Fix intro text: "only raft.node_id and raft_addr differ" →
  "raft.node_id is unique; raft.peers and p2p.peers must exclude self"
- Expand node-2 snippet to a full evnode.yaml with the correct peers
  list (node-1, node-3, node-4, node-5 — no node-2) and an inline
  explanation of the wildcard address pitfall
- Align overview.md trailing_logs example to 1 block/s (matching
  block_time: "1s" used throughout) and note the 10 block/s rate too

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs(ha): fix passphrase flag and failover kill cardinality check

Replace non-existent --evnode.signer.passphrase with the actual
--evnode.signer.passphrase_file flag throughout cluster-setup and
single-to-ha guides. Update passphrase setup to create a chmod 600
file at /etc/ev-node/passphrase referenced directly by the flag.

Add mapfile-based cardinality check in the failover test fallback
kill command to guard against killing the wrong process.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs(ha): fix RPC endpoints, init ordering, and snap_count CLI flag

Replace incorrect CometBFT RPC calls (port 26657/status) with the
actual ev-node HTTP API (port 7331 /health/ready, /raft/node) and
EVM execution layer (cast block latest) throughout both guides.

Align single-to-ha Step 2 init ordering with cluster-setup: create
passphrase file before evm init so the signer key is encrypted from
the start, and pass --evnode.node.aggregator and passphrase_file flags.

Fix Step 9a fallback kill in single-to-ha to use mapfile cardinality
check, matching the pattern already applied in cluster-setup.

Add --evnode.raft.snap_count=3 to the CLI start example to match
the YAML config block.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread .github/workflows/ci.yml Fixed
julienrbrt and others added 17 commits April 29, 2026 15:46
This reverts commit a1a0861.
…nner (#3301)

Brings the celestia-app talis multi-cloud deploy tool into ev-node,
plus a long-lived ev-node aggregator runner that wires the existing
celestia-node-fiber adapter behind ev-node's DA client interface.
Verified end-to-end on AWS — talis up → genesis → deploy →
setup-fibre → start-fibre → fibre-bootstrap-evnode reaches
24.57 MB/s @ 99.7 % ok on a 60 s sustained loadgen
(3 × c6in.4xlarge validators + c6in.2xlarge bridge +
c6in.8xlarge ev-node + c6in.2xlarge load-gen, us-east-1).

What this adds:

  • tools/talis/                — vendored from celestia-app's
    feat/fibre-payments. Provisions AWS / DO / GCP boxes for
    validators + bridge + ev-node + load-gen, deploys binaries +
    init scripts, drives the Fibre setup-fibre + start-fibre flow,
    and ships a fibre-bootstrap-evnode step that scp's the bridge
    JWT and Fibre payment keyring onto each ev-node before its
    init script starts the daemon.
  • tools/celestia-node-fiber/cmd/evnode-fibre/  — the long-lived
    aggregator runner. Wires block.NewFiberDAClient on top of the
    celestia-node-fiber adapter that julien/fiber already ships,
    plus the in-memory executor + HTTP /tx ingress used by
    evnode-txsim. Distinct from the existing fiber-bench cmd.
  • tools/talis/cmd/evnode-txsim/ — small Go load-gen that pumps
    the runner's HTTP /tx ingress for a fixed duration; deployed
    to load-gen boxes and prints a single TXSIM: line on completion.

Two small ev-node-side helpers the runner calls:

  • block/public.go: SetMaxBlobSize(n) — overrides the per-blob
    byte cap so the runner can lift Celestia's 5 MiB default to
    Fibre's 120 MiB headroom.
  • pkg/config/config.go: Config.ApplyFiberDefaults() — flips the
    DA config to Fibre-friendly settings (adaptive batching, 1 s
    DA.BlockTime, 50-deep pending-cache window) when the Fiber
    profile is enabled, so a runner can opt in with one call.

setup-fibre robustness fixes uncovered during the verified run:

  • bash script for set-host now retries until the validator's
    host appears in `query valaddr providers`. The previous one-
    shot call relied on `--yes` returning the txhash before block
    inclusion; if the chain wasn't ready, the tx silently bounced.
    The Fibre client cached the partial set on startup and uploads
    cascaded to "host not found" → "voting power: collected 0".
  • talis-CLI side polls `query valaddr providers` after the per-
    validator scripts finish and refuses to return until all
    validators are registered (5-minute deadline).

External dependency (documented in tools/talis/fibre.md):

  • Sibling clone of celestia-app on a branch with feat/fibre-payments
    + sysrex/fibre_url_fix cherry-picked. Without the URL-parse fix
    the Fibre client rejects every host:port registration.

Tested:
  - go build ./... — clean
  - go test ./block/internal/submitting ./pkg/config (the two
    pre-existing test failures on julien/fiber — TestAddFlags
    and TestFiberClient_Submit_BlobTooLarge — are not introduced
    by this PR and reproduce on raw julien/fiber)
  - End-to-end AWS deploy from this branch — 24.57 MB/s, 99.7 % ok
…log (#3307)

* feat(fibre): log per-Submit upload duration

The Fibre Submit path was opaque: failures showed up as
DeadlineExceeded with no signal of how long the upload
actually took, and successes only logged at debug level
inside the upstream library. During load-test debugging
this turned into a guessing game — was the cluster slow,
the deadline too tight, or something stuck mid-RPC?

Add a single info-level (warn-on-failure) log line in
fiberDAClient.Submit covering the Upload call: duration,
flat blob bytes, blob count. Cheap (one time.Since) and
gives the operator concrete numbers — e.g. "17 blobs / 115
MiB / 1.5 s" — to reason about whether RPCTimeout, pending
cap, or batch sizing is the right knob to turn next.

* fix(fibre): split DA Submit batches at Fibre's 128 MiB upload cap

Under sustained txsim load (~50 MiB/s) the DA submitter
batched 10 block_data items into one Upload(), producing a
flat payload of 144 MiB. Fibre's per-upload cap is hard at
~128 MiB ("blob size exceeds maximum allowed size: data
size 144366912 exceeds maximum 134217723") and rejected
every batched upload. With MaxPendingHeadersAndData=10
that took down 170 consecutive submissions before the
node halted itself with "Data exceeds DA blob size limit".

Wrap the Upload call in a chunker that groups input blobs
into ≤120 MiB chunks (8 MiB headroom under Fibre's cap for
the per-blob length-prefix overhead added by flattenBlobs)
and uploads each chunk separately. Aggregates submitted
counts and BlobIDs across chunks; on first chunk failure,
returns the error with the partially-submitted count so
the submitter's retry/backoff logic sees a coherent state
instead of all-or-nothing.

Single oversized blobs (already validated against
DefaultMaxBlobSize earlier in Submit) still land alone and
fail server-side, but at least don't drag healthy peers
into the same rejected batch.

* fix(evnode-fibre): cap per-block data at 100 MiB to fit a Fibre upload

Companion to the submitter chunking fix. The submitter can
split a multi-blob batch into ≤120 MiB Fibre uploads, but
a *single* block_data item that exceeds 128 MiB still ends
up alone in its own chunk and fails server-side ("blob size
exceeds maximum allowed size"). Lower the per-block cap to
100 MiB so under high-throughput txsim a single block can't
grow past Fibre's hard limit, and update the comment to
explain the relationship between this cap and Fibre's
~128 MiB upload reject threshold.
* fix(tools/talis): wait-for-chain + atomic keyring + one-command driver

Three race conditions surfaced repeatedly on a fresh AWS bring-up of
the Fibre throughput experiment. Each one had the same shape: a
talis subcommand "succeeded" at the CLI level (or returned the txhash
with --yes) before the chain had actually applied the work, leaving
downstream steps to fail in confusing ways. This commit makes each
step verify *outcome*, not just *invocation*, so the experiment can
go from a fresh `talis up` to a running loadgen without manual
intervention.

  • setup-fibre script (fibre_setup.go) now:
    - polls `celestia-appd status` for `latest_block_height>0`
      before submitting any tx — fixes the silent-noop where
      set-host + 100× deposit-to-escrow all bounced with
      "celestia-app is not ready; please wait for first block";
    - retries `set-host` in a loop until the validator's host
      shows up in `query valaddr providers` — fixes the case
      where --yes returns the txhash before block inclusion and
      the tx silently lands in the mempool but never confirms;
    - verifies fibre-0's escrow account is funded on-chain before
      the tmux session exits — same silent-failure mode as
      set-host, but on the deposit side.
    The talis-CLI step also now cross-checks all validators are
    registered from a single vantage point before returning, so a
    concurrent set-host race surfaces as an error instead of a
    half-empty provider list start-fibre would cache forever.

  • fibre-bootstrap-evnode (fibre_bootstrap_evnode.go) now stages
    the keyring scp into a tmp directory and `mv`s it atomically
    into place. The previous direct `scp -r` to
    /root/keyring-fibre/keyring-test created the directory before
    transferring its contents — the evnode init script's
    `[ -d keyring-test ]` poll passed mid-transfer, the daemon
    launched with no fibre-0.info, and crashed with `keyring entry
    "fibre-0" not found`.

  • evnode_init.sh (genesis.go) now waits for the specific
    keyring-test/fibre-0.info file rather than just the
    keyring-test directory. Belt-and-braces: the bootstrap mv is
    already atomic on the same filesystem, but the file-level
    guard means a hand-pushed keyring (not via talis) can't trip
    the same race.

  • New `talis fibre-experiment` umbrella command runs
    up → genesis → deploy → setup-fibre → start-fibre →
    fibre-bootstrap-evnode in order. Each step uses the same
    binary as a subprocess; failures in any step abort the chain.
    Operator goes from a prepared root dir to a running loadgen
    with one command, instead of remembering the sequence.

Verified by 5-min sustained loadgen against julien/fiber HEAD with
PR #3287 (concurrent submitter) merged: 47.65 MB/s @ 99.999 % ok,
up from the prior 24.57 MB/s baseline (the gap is PR #3287's
overlapping uploads — these talis fixes just stop the deploy from
silently breaking before throughput matters).

* fix(tools/talis): finalize fibre setup race fixes

Three follow-up bugs surfaced from the PR #3303 follow-up
verification run on a 3-validator AWS Fibre cluster:

- aws.go: CreateAWSInstances exited 0 even when individual
  instance launches failed, so `talis up` lied about success
  and downstream steps proceeded against a partial cluster.
  Returns a joined error now so failure cascades stop early.

- download.go: sshExec used cmd.CombinedOutput, mixing SSH
  warnings (the "Warning: Permanently added '...'..." chatter
  on stderr) into bytes the caller hands to fmt.Sscanf("%d").
  The CLI-side providers cross-check parsed those warnings
  as 0 and looped until its 5-min deadline even though a
  direct SSH query showed all 3 providers registered. Switch
  to cmd.Output() (stdout only) and add `-q -o LogLevel=ERROR`
  to silence the chatter for any caller that does combine
  streams.

- fibre_setup.go: the per-validator escrow verification used
  `celestia-appd query fibre escrow` which doesn't exist —
  the actual subcommand is `escrow-account`. The query
  errored on every retry, the grep for "amount" never
  matched, and the script wedged on the 3-min deadline
  reporting `FATAL: fibre-0 escrow not present`. Switch to
  `escrow-account` and key on `"found":true` (the explicit
  existence flag in the response). Also wrap the fibre-0
  deposit-to-escrow itself in a retry loop matching set-host
  — same `--yes`-returns-before-inclusion silent-failure
  mode bit it. fibre-1..N stay best-effort.

* feat(evnode-txsim): keep-alive conn pool + pprof endpoint

Two diagnostic improvements for the load generator:

1. http.Transport.MaxIdleConnsPerHost defaults to 2 in stdlib.
   With --concurrency=8 (or higher), 6+ goroutines per cycle had
   to open fresh TCP+TLS sockets per request because the pool
   couldn't hold their idle conns between requests. Bump
   MaxIdleConns / MaxIdleConnsPerHost / MaxConnsPerHost to
   2*concurrency so every active sender has a reusable keep-alive
   socket, eliminating handshake churn from the hot path.

2. Always-on net/http/pprof on 127.0.0.1:6060. evnode-txsim is a
   load tester, not a production daemon, so cost of always serving
   profiling is acceptable; the payoff is being able to grab CPU
   profiles under live load without re-deploying the binary —
   `ssh -L 6060:127.0.0.1:6060 root@loadgen \
     go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30`.

A profile captured this way under c=8 traced the per-request hot
path: 25.5% in kernel write(2), 25% in net/http body marshaling.
That diagnostic surfaced that the c6in.2xlarge loadgen was the
binding constraint for the experiment at ~22 MB/s, not evnode or
DA — a finding we'd have spent another debug round chasing
without the in-process profiler.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants