Skip to content

bdev_nvme: stop failover hot-loop from adminq poller while resetting#1

Merged
jleeh merged 2 commits into
rebase/v26.01from
fix/bdev-nvme-failover-hotloop
Jun 9, 2026
Merged

bdev_nvme: stop failover hot-loop from adminq poller while resetting#1
jleeh merged 2 commits into
rebase/v26.01from
fix/bdev-nvme-failover-hotloop

Conversation

@jleeh

@jleeh jleeh commented Jun 9, 2026

Copy link
Copy Markdown

Problem

A v2 instance-manager (ma5-worker-5, v1.11.1-linkpool.70) was SIGKILLed on 2026-06-09 04:36 UTC: its bdev_nvme initiator wedged in a failover hot-loop, logging bdev_nvme_failover_ctrlr_unsafe: "Unable to perform failover, already in progress." ~231k times in ~15 min (~1.1 GB of synchronous reactor-thread log writes), saturating the SPDK reactor until the kubelet liveness probe killed spdk_tgt.

Root cause

bdev_nvme_poll_adminq re-calls bdev_nvme_failover_ctrlr on every poll whenever spdk_nvme_ctrlr_process_admin_completions() returns < 0 with no disconnected_cb pending. When a reset/failover is already in flight (nvme_ctrlr->resetting), that call is a guaranteed no-op that hits the "already in progress" branch, logs a NOTICE, and returns -EBUSY. A stalled reconnect (remote replica target that accepts TCP but never completes the admin handshake) keeps the adminq failing every poll while resetting stays true → the poller spins on this no-op.

Upstream master has the same unguarded re-drive, so porting the upstream failover series does not fix this.

Fix

Guard the re-drive with !nvme_ctrlr->resetting. While a reset/failover is in progress the reset state machine already owns recovery (reconnect, ctrlr_loss_timeout, next-trid advance on completion). The authoritative resetting check still happens under the mutex inside bdev_nvme_failover_ctrlr_unsafe; this is a fast-path guard that avoids the per-poll re-enqueue + log flood.

Test

test_failover_not_redriven_while_resetting in bdev_nvme_ut:

  • Case 1: reset in progress + adminq failing → asserts no failover re-drive (observed via pending_failover).
  • Case 2: no reset in progress → asserts failover still initiates (no over-blocking).

Verified locally: full bdev_nvme_ut suite 54/54, 4126 asserts. Negative control (guard reverted) → the new test FAILS, confirming it catches the regression.

Incident: longhorn-im-failover-incident-2026-06-09.

jleeh and others added 2 commits June 9, 2026 08:32
bdev_nvme_poll_adminq re-called bdev_nvme_failover_ctrlr on every poll
whenever spdk_nvme_ctrlr_process_admin_completions returned < 0 and no
disconnected_cb was pending. When a reset/failover is already in flight
(nvme_ctrlr->resetting), that call is a guaranteed no-op: it takes the
mutex, hits the "already in progress" branch in
bdev_nvme_failover_ctrlr_unsafe, logs a NOTICE, and returns -EBUSY.

If a reconnect stalls -- e.g. a remote replica target that accepts the
TCP connection but never completes the admin handshake -- the adminq
keeps failing every poll while resetting stays true, so the poller spins
on this no-op. Observed in production (ma5-worker-5, v2 instance-manager)
as ~231k "Unable to perform failover, already in progress." lines in
~15 min, ~1.1 GB of synchronous reactor-thread log writes, which
saturated the SPDK reactor until the kubelet liveness probe timed out and
SIGKILLed spdk_tgt.

Guard the re-drive with !nvme_ctrlr->resetting. While a reset/failover is
in progress the reset state machine already owns recovery: reconnect
attempts, ctrlr_loss_timeout, and advancing to the next trid on reset
completion. The authoritative resetting check still happens under the
mutex inside bdev_nvme_failover_ctrlr_unsafe; this is a fast-path guard
that simply avoids the pathological per-poll re-enqueue + log flood.

Incident: longhorn-im-failover-incident-2026-06-09.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds test_failover_not_redriven_while_resetting, exercising the
bdev_nvme_poll_adminq guard from the previous commit.

Case 1 (the regression): with a reset/failover already in progress
(resetting == true, in_failover == false) and the admin queue
disconnected so process_admin_completions returns < 0, poll_adminq must
NOT re-drive failover. Observed via pending_failover: an un-guarded
re-drive reaches bdev_nvme_failover_ctrlr_unsafe's "resetting &&
!in_failover" branch and sets pending_failover = true, so the assert
pending_failover == false fails without the guard. This is the same
no-op re-drive that, in production, spun the reactor with ~231k "already
in progress" NOTICEs until liveness SIGKILLed spdk_tgt.

Case 2 (no over-blocking): with no reset in progress, the same admin-queue
failure still initiates failover (resetting becomes true) and the reset
reconnects and completes, proving the guard does not suppress legitimate
recovery.

Setup/teardown mirror test_failover_ctrlr.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jleeh jleeh merged commit 47df8e3 into rebase/v26.01 Jun 9, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant