fuzz: cover deferred writing in chanmon_consistency by joostjager · Pull Request #4465 · lightningdevkit/rust-lightning

joostjager · 2026-03-06T11:07:37Z

Adds fuzz coverage for #4351

ldk-reviews-bot · 2026-03-06T11:07:40Z

👋 I see @wpaulino was un-assigned.
If you'd like another reviewer assignment, please click here.

codecov · 2026-03-06T13:25:43Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.17%. Comparing base (964a84f) to head (a798d74).
⚠️ Report is 27 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4465      +/-   ##
==========================================
- Coverage   86.18%   86.17%   -0.01%     
==========================================
  Files         156      156              
  Lines      108528   108636     +108     
  Branches   108528   108636     +108     
==========================================
+ Hits        93532    93616      +84     
- Misses      12386    12405      +19     
- Partials     2610     2615       +5

Flag	Coverage Δ
tests	`86.17% <ø> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

joostjager · 2026-03-19T07:52:16Z

Rebased

joostjager · 2026-03-19T11:01:31Z

Will hold off on this until pre-existing fuzz failures #4496 and #4472 are in

joostjager · 2026-03-20T17:53:19Z

Prerequisites are in, rebased.

TheBlueMatt · 2026-03-22T17:54:23Z

This LGTM, why is it draft?

joostjager · 2026-03-22T18:00:40Z

I first wanted to do a serious local run, but then it turned out there are so many pre-existing fuzz failures that it is hard to see what's new. I've bisected the failures to the various PRs that introduced them.

joostjager · 2026-03-22T20:01:05Z

Although for this PR I could just see if anything pops up that doesnt repro on main. Will do that.

joostjager · 2026-03-23T08:23:03Z

Several newly introduced fuzz failures to address:

7084a32219ffa61b1919192119197084a32219ff4400001d228012ffab
10fa0040801170704040198040111911191f1f1170bbff1170ffb3b2b6
10000150804211be707040198011191f1119401f401f0aa97210b6ff25
10000150804211707040198011191f11191f1a4040a91f7210b6ff25

joostjager · 2026-03-26T10:35:59Z

Zoomed in on one of those sequences, and it seems it is reproducible with another string without deferred mode too.

0270801109191109191f1f10b6ff

joostjager · 2026-04-15T08:40:24Z

Dependency: #4520

Add a deferred flag to TestChainMonitor that controls whether the underlying ChainMonitor queues operations instead of executing them immediately. The flag is derived from the first fuzz input byte so each of the three nodes can independently run in deferred or immediate mode, while the same byte also selects the channel type. In deferred mode, watch_channel and update_channel always return InProgress. A new flush_and_update_latest_monitors method drains the queued operations and, when the persister reports Completed, promotes the pending shadow monitor snapshots to persisted state. This method is called before release_pending_monitor_events, at each point where the fuzzer completes pending monitor updates, and after watch_channel during node reload so the node starts from a consistent state. The first-byte config layout now uses bits 0 to 2 for initial monitor status, bits 3 to 4 for channel type, and bits 5 to 7 for deferred monitor writes. This fixes the rebase conflict where deferred mode still referenced the pre-channel-type bit layout. AI tools were used in preparing this commit.

joostjager · 2026-05-07T08:19:52Z

Interestingly the non-deferred mode reproducer 0270801109191109191f1f10b6ff doesn't fail on main anymore, with #4520 not yet merged. Bisected the fixing PR to #4529 (@tankyleo).

It seems the increased headroom made the bug unobservable to the fuzzer?

I ran the fuzzer on this branch overnight, and no failures were found.

joostjager · 2026-05-07T08:59:29Z

I'll try to add a new invariant to the fuzzer so it catches the problem in a more robust way.

joostjager · 2026-05-07T11:29:12Z

Invariant addition: #4601

joostjager · 2026-05-07T13:23:31Z

Will rebase this after #4571

ldk-claude-review-bot · 2026-05-07T13:39:44Z

+	/// This simulates the pattern of snapshotting the pending count, persisting the
+	/// `ChannelManager`, then flushing the queued monitor writes.


Nit: The docstring says this "simulates the pattern of snapshotting the pending count, persisting the ChannelManager, then flushing the queued monitor writes." However, in the actual fuzz loop, the flush happens during event processing (called from release_pending_monitor_events), while the ChannelManager is serialized at the end of each loop iteration (lines 3023-3031) — i.e., after the flush, not before.

This means the fuzzer always tests the scenario where the ChannelManager snapshot captures post-flush state. The more interesting crash scenario — serializing the ChannelManager while monitor writes are still queued, then crashing before the flush — is not directly exercised by this ordering. (The fuzzer does partially cover stale-monitor restarts through the use_old_mons selection in reload_node, but that's a different axis.)

Consider either updating the docstring to reflect what actually happens, or restructuring so the ChannelManager is serialized at the flush call-site to match the documented pattern.

ldk-claude-review-bot · 2026-05-07T13:39:57Z

+		if persister_res == chain::ChannelMonitorUpdateStatus::Completed {
+			for (_channel_id, state) in self.latest_monitors.lock().unwrap().iter_mut() {
+				if let Some((id, data)) = state.pending_monitors.drain(..).last() {
+					state.persisted_monitor_id = id;
+					state.persisted_monitor = data;
+				}
+			}
+		}


When persister_res == Completed, this drains pending_monitors for all channels, not just those whose operations were included in the current flush(count) call. This is correct today because update_ret is constant for the lifetime of a node (set at creation/reload, never mutated by the mon_style fuzz commands at 0x00-0x06). That invariant ensures:

If update_ret == Completed: every prior flush also used Completed, so there are no "orphaned" pending entries from an earlier InProgress flush.

If update_ret == InProgress: this branch is never taken.

But this is a subtle invariant — if a future change allows update_ret to be toggled mid-run (e.g., fuzz commands updating the live persister), this would silently promote monitors whose corresponding channel_monitor_updated calls were never made by the flush. Consider adding a comment documenting this assumption, or tightening the logic to only promote monitors that correspond to channels with operations in the flushed batch.

ldk-claude-review-bot · 2026-05-07T13:40:30Z

Review Summary

After thorough analysis of all hunks in this diff, I found no bugs or security issues. The deferred monitor write integration into the fuzz target is well-designed and correctly handles the various combinations of deferred mode and mon_style (InProgress vs Completed).

Inline comments posted:

fuzz/src/chanmon_consistency.rs:366-367 — The docstring describes a "snapshot count → persist ChannelManager → flush" pattern, but the actual fuzz loop flushes during event processing and serializes the ChannelManager afterwards. The documented crash scenario (serialize before flush) is not directly exercised.
fuzz/src/chanmon_consistency.rs:379-386 — flush_and_update_latest_monitors promotes pending_monitors for all channels unconditionally when the persister returns Completed. This is correct today because update_ret is constant during a node's lifetime, but the invariant is subtle and undocumented. If a future change allows update_ret to be toggled mid-run, this would silently mistrack acknowledgments.

Cross-cutting observations:

The code correctly handles both deferred=true + mon_style=Completed and deferred=true + mon_style=InProgress combinations. In the Completed case, flush() inside ChainMonitor calls channel_monitor_updated and our shadow promotes; in the InProgress case, explicit completion helpers (complete_all_pending_monitor_updates!, complete_monitor_update, complete_all_monitor_updates) handle acknowledgment.
The initial channel setup in make_channel! correctly uses complete_all_pending_monitor_updates! (with >= comparison) to handle the deferred initial watch_channel, ensuring persisted_monitor is populated before the fuzz loop begins.
The reload path correctly uses a temporary Completed persister for the initial flush, then switches to the node's mon_style for subsequent operations.

joostjager force-pushed the chain-mon-internal-deferred-writes-with-fuzz branch from 51afc25 to 0aebb10 Compare March 19, 2026 07:51

joostjager force-pushed the chain-mon-internal-deferred-writes-with-fuzz branch from 0aebb10 to 6274ba0 Compare March 20, 2026 17:52

joostjager force-pushed the chain-mon-internal-deferred-writes-with-fuzz branch 2 times, most recently from c6da16e to d4bf3e0 Compare April 15, 2026 04:36

joostjager self-assigned this Apr 16, 2026

joostjager added this to Weekly Goals Apr 16, 2026

joostjager force-pushed the chain-mon-internal-deferred-writes-with-fuzz branch from d4bf3e0 to a798d74 Compare May 6, 2026 11:42

joostjager marked this pull request as ready for review May 7, 2026 13:23

ldk-reviews-bot requested a review from wpaulino May 7, 2026 13:23

joostjager removed the request for review from wpaulino May 7, 2026 13:23

ldk-claude-review-bot reviewed May 7, 2026

View reviewed changes

		/// This simulates the pattern of snapshotting the pending count, persisting the
		/// `ChannelManager`, then flushing the queued monitor writes.

Conversation

joostjager commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ldk-reviews-bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

joostjager commented Mar 19, 2026

Uh oh!

joostjager commented Mar 19, 2026

Uh oh!

joostjager commented Mar 20, 2026

Uh oh!

TheBlueMatt commented Mar 22, 2026

Uh oh!

joostjager commented Mar 22, 2026

Uh oh!

joostjager commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joostjager commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joostjager commented Mar 26, 2026

Uh oh!

joostjager commented Apr 15, 2026

Uh oh!

joostjager commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joostjager commented May 7, 2026

Uh oh!

joostjager commented May 7, 2026

Uh oh!

joostjager commented May 7, 2026

Uh oh!

ldk-claude-review-bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

ldk-claude-review-bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

ldk-claude-review-bot commented May 7, 2026

Review Summary

Inline comments posted:

Cross-cutting observations:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

joostjager commented Mar 6, 2026 •

edited

Loading

ldk-reviews-bot commented Mar 6, 2026 •

edited

Loading

codecov Bot commented Mar 6, 2026 •

edited

Loading

joostjager commented Mar 22, 2026 •

edited

Loading

joostjager commented Mar 23, 2026 •

edited

Loading

joostjager commented May 7, 2026 •

edited

Loading