bdev_nvme: stop failover hot-loop from adminq poller while resetting#1
Merged
Conversation
bdev_nvme_poll_adminq re-called bdev_nvme_failover_ctrlr on every poll whenever spdk_nvme_ctrlr_process_admin_completions returned < 0 and no disconnected_cb was pending. When a reset/failover is already in flight (nvme_ctrlr->resetting), that call is a guaranteed no-op: it takes the mutex, hits the "already in progress" branch in bdev_nvme_failover_ctrlr_unsafe, logs a NOTICE, and returns -EBUSY. If a reconnect stalls -- e.g. a remote replica target that accepts the TCP connection but never completes the admin handshake -- the adminq keeps failing every poll while resetting stays true, so the poller spins on this no-op. Observed in production (ma5-worker-5, v2 instance-manager) as ~231k "Unable to perform failover, already in progress." lines in ~15 min, ~1.1 GB of synchronous reactor-thread log writes, which saturated the SPDK reactor until the kubelet liveness probe timed out and SIGKILLed spdk_tgt. Guard the re-drive with !nvme_ctrlr->resetting. While a reset/failover is in progress the reset state machine already owns recovery: reconnect attempts, ctrlr_loss_timeout, and advancing to the next trid on reset completion. The authoritative resetting check still happens under the mutex inside bdev_nvme_failover_ctrlr_unsafe; this is a fast-path guard that simply avoids the pathological per-poll re-enqueue + log flood. Incident: longhorn-im-failover-incident-2026-06-09. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds test_failover_not_redriven_while_resetting, exercising the bdev_nvme_poll_adminq guard from the previous commit. Case 1 (the regression): with a reset/failover already in progress (resetting == true, in_failover == false) and the admin queue disconnected so process_admin_completions returns < 0, poll_adminq must NOT re-drive failover. Observed via pending_failover: an un-guarded re-drive reaches bdev_nvme_failover_ctrlr_unsafe's "resetting && !in_failover" branch and sets pending_failover = true, so the assert pending_failover == false fails without the guard. This is the same no-op re-drive that, in production, spun the reactor with ~231k "already in progress" NOTICEs until liveness SIGKILLed spdk_tgt. Case 2 (no over-blocking): with no reset in progress, the same admin-queue failure still initiates failover (resetting becomes true) and the reset reconnects and completes, proving the guard does not suppress legitimate recovery. Setup/teardown mirror test_failover_ctrlr. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A v2 instance-manager (ma5-worker-5,
v1.11.1-linkpool.70) was SIGKILLed on 2026-06-09 04:36 UTC: itsbdev_nvmeinitiator wedged in a failover hot-loop, loggingbdev_nvme_failover_ctrlr_unsafe: "Unable to perform failover, already in progress."~231k times in ~15 min (~1.1 GB of synchronous reactor-thread log writes), saturating the SPDK reactor until the kubelet liveness probe killedspdk_tgt.Root cause
bdev_nvme_poll_adminqre-callsbdev_nvme_failover_ctrlron every poll wheneverspdk_nvme_ctrlr_process_admin_completions()returns < 0 with nodisconnected_cbpending. When a reset/failover is already in flight (nvme_ctrlr->resetting), that call is a guaranteed no-op that hits the "already in progress" branch, logs a NOTICE, and returns-EBUSY. A stalled reconnect (remote replica target that accepts TCP but never completes the admin handshake) keeps the adminq failing every poll whileresettingstays true → the poller spins on this no-op.Upstream master has the same unguarded re-drive, so porting the upstream failover series does not fix this.
Fix
Guard the re-drive with
!nvme_ctrlr->resetting. While a reset/failover is in progress the reset state machine already owns recovery (reconnect, ctrlr_loss_timeout, next-trid advance on completion). The authoritativeresettingcheck still happens under the mutex insidebdev_nvme_failover_ctrlr_unsafe; this is a fast-path guard that avoids the per-poll re-enqueue + log flood.Test
test_failover_not_redriven_while_resettinginbdev_nvme_ut:pending_failover).Verified locally: full
bdev_nvme_utsuite 54/54, 4126 asserts. Negative control (guard reverted) → the new test FAILS, confirming it catches the regression.Incident: longhorn-im-failover-incident-2026-06-09.