fuzz: cover deferred writing in chanmon_consistency#4465
fuzz: cover deferred writing in chanmon_consistency#4465joostjager wants to merge 1 commit intolightningdevkit:mainfrom
Conversation
|
👋 I see @wpaulino was un-assigned. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4465 +/- ##
==========================================
- Coverage 86.18% 86.17% -0.01%
==========================================
Files 156 156
Lines 108528 108636 +108
Branches 108528 108636 +108
==========================================
+ Hits 93532 93616 +84
- Misses 12386 12405 +19
- Partials 2610 2615 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
51afc25 to
0aebb10
Compare
|
Rebased |
0aebb10 to
6274ba0
Compare
|
Prerequisites are in, rebased. |
|
This LGTM, why is it draft? |
|
I first wanted to do a serious local run, but then it turned out there are so many pre-existing fuzz failures that it is hard to see what's new. I've bisected the failures to the various PRs that introduced them. |
|
Although for this PR I could just see if anything pops up that doesnt repro on main. Will do that. |
|
Several newly introduced fuzz failures to address: |
|
Zoomed in on one of those sequences, and it seems it is reproducible with another string without deferred mode too. |
c6da16e to
d4bf3e0
Compare
|
Dependency: #4520 |
Add a deferred flag to TestChainMonitor that controls whether the underlying ChainMonitor queues operations instead of executing them immediately. The flag is derived from the first fuzz input byte so each of the three nodes can independently run in deferred or immediate mode, while the same byte also selects the channel type. In deferred mode, watch_channel and update_channel always return InProgress. A new flush_and_update_latest_monitors method drains the queued operations and, when the persister reports Completed, promotes the pending shadow monitor snapshots to persisted state. This method is called before release_pending_monitor_events, at each point where the fuzzer completes pending monitor updates, and after watch_channel during node reload so the node starts from a consistent state. The first-byte config layout now uses bits 0 to 2 for initial monitor status, bits 3 to 4 for channel type, and bits 5 to 7 for deferred monitor writes. This fixes the rebase conflict where deferred mode still referenced the pre-channel-type bit layout. AI tools were used in preparing this commit.
d4bf3e0 to
a798d74
Compare
|
Interestingly the non-deferred mode reproducer It seems the increased headroom made the bug unobservable to the fuzzer? I ran the fuzzer on this branch overnight, and no failures were found. |
|
I'll try to add a new invariant to the fuzzer so it catches the problem in a more robust way. |
|
Invariant addition: #4601 |
|
Will rebase this after #4571 |
| /// This simulates the pattern of snapshotting the pending count, persisting the | ||
| /// `ChannelManager`, then flushing the queued monitor writes. |
There was a problem hiding this comment.
Nit: The docstring says this "simulates the pattern of snapshotting the pending count, persisting the ChannelManager, then flushing the queued monitor writes." However, in the actual fuzz loop, the flush happens during event processing (called from release_pending_monitor_events), while the ChannelManager is serialized at the end of each loop iteration (lines 3023-3031) — i.e., after the flush, not before.
This means the fuzzer always tests the scenario where the ChannelManager snapshot captures post-flush state. The more interesting crash scenario — serializing the ChannelManager while monitor writes are still queued, then crashing before the flush — is not directly exercised by this ordering. (The fuzzer does partially cover stale-monitor restarts through the use_old_mons selection in reload_node, but that's a different axis.)
Consider either updating the docstring to reflect what actually happens, or restructuring so the ChannelManager is serialized at the flush call-site to match the documented pattern.
| if persister_res == chain::ChannelMonitorUpdateStatus::Completed { | ||
| for (_channel_id, state) in self.latest_monitors.lock().unwrap().iter_mut() { | ||
| if let Some((id, data)) = state.pending_monitors.drain(..).last() { | ||
| state.persisted_monitor_id = id; | ||
| state.persisted_monitor = data; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
When persister_res == Completed, this drains pending_monitors for all channels, not just those whose operations were included in the current flush(count) call. This is correct today because update_ret is constant for the lifetime of a node (set at creation/reload, never mutated by the mon_style fuzz commands at 0x00-0x06). That invariant ensures:
- If
update_ret == Completed: every prior flush also usedCompleted, so there are no "orphaned" pending entries from an earlierInProgressflush. - If
update_ret == InProgress: this branch is never taken.
But this is a subtle invariant — if a future change allows update_ret to be toggled mid-run (e.g., fuzz commands updating the live persister), this would silently promote monitors whose corresponding channel_monitor_updated calls were never made by the flush. Consider adding a comment documenting this assumption, or tightening the logic to only promote monitors that correspond to channels with operations in the flushed batch.
Review SummaryAfter thorough analysis of all hunks in this diff, I found no bugs or security issues. The deferred monitor write integration into the fuzz target is well-designed and correctly handles the various combinations of Inline comments posted:
Cross-cutting observations:
|
Adds fuzz coverage for #4351