bdev_nvme: stop failover hot-loop from adminq poller while resetting by jleeh · Pull Request #1 · linkpoolio/spdk

jleeh · 2026-06-09T07:48:36Z

Problem

A v2 instance-manager (ma5-worker-5, v1.11.1-linkpool.70) was SIGKILLed on 2026-06-09 04:36 UTC: its bdev_nvme initiator wedged in a failover hot-loop, logging bdev_nvme_failover_ctrlr_unsafe: "Unable to perform failover, already in progress." ~231k times in ~15 min (~1.1 GB of synchronous reactor-thread log writes), saturating the SPDK reactor until the kubelet liveness probe killed spdk_tgt.

Root cause

bdev_nvme_poll_adminq re-calls bdev_nvme_failover_ctrlr on every poll whenever spdk_nvme_ctrlr_process_admin_completions() returns < 0 with no disconnected_cb pending. When a reset/failover is already in flight (nvme_ctrlr->resetting), that call is a guaranteed no-op that hits the "already in progress" branch, logs a NOTICE, and returns -EBUSY. A stalled reconnect (remote replica target that accepts TCP but never completes the admin handshake) keeps the adminq failing every poll while resetting stays true → the poller spins on this no-op.

Upstream master has the same unguarded re-drive, so porting the upstream failover series does not fix this.

Fix

Guard the re-drive with !nvme_ctrlr->resetting. While a reset/failover is in progress the reset state machine already owns recovery (reconnect, ctrlr_loss_timeout, next-trid advance on completion). The authoritative resetting check still happens under the mutex inside bdev_nvme_failover_ctrlr_unsafe; this is a fast-path guard that avoids the per-poll re-enqueue + log flood.

Test

test_failover_not_redriven_while_resetting in bdev_nvme_ut:

Case 1: reset in progress + adminq failing → asserts no failover re-drive (observed via pending_failover).
Case 2: no reset in progress → asserts failover still initiates (no over-blocking).

Verified locally: full bdev_nvme_ut suite 54/54, 4126 asserts. Negative control (guard reverted) → the new test FAILS, confirming it catches the regression.

Incident: longhorn-im-failover-incident-2026-06-09.

bdev_nvme_poll_adminq re-called bdev_nvme_failover_ctrlr on every poll whenever spdk_nvme_ctrlr_process_admin_completions returned < 0 and no disconnected_cb was pending. When a reset/failover is already in flight (nvme_ctrlr->resetting), that call is a guaranteed no-op: it takes the mutex, hits the "already in progress" branch in bdev_nvme_failover_ctrlr_unsafe, logs a NOTICE, and returns -EBUSY. If a reconnect stalls -- e.g. a remote replica target that accepts the TCP connection but never completes the admin handshake -- the adminq keeps failing every poll while resetting stays true, so the poller spins on this no-op. Observed in production (ma5-worker-5, v2 instance-manager) as ~231k "Unable to perform failover, already in progress." lines in ~15 min, ~1.1 GB of synchronous reactor-thread log writes, which saturated the SPDK reactor until the kubelet liveness probe timed out and SIGKILLed spdk_tgt. Guard the re-drive with !nvme_ctrlr->resetting. While a reset/failover is in progress the reset state machine already owns recovery: reconnect attempts, ctrlr_loss_timeout, and advancing to the next trid on reset completion. The authoritative resetting check still happens under the mutex inside bdev_nvme_failover_ctrlr_unsafe; this is a fast-path guard that simply avoids the pathological per-poll re-enqueue + log flood. Incident: longhorn-im-failover-incident-2026-06-09. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds test_failover_not_redriven_while_resetting, exercising the bdev_nvme_poll_adminq guard from the previous commit. Case 1 (the regression): with a reset/failover already in progress (resetting == true, in_failover == false) and the admin queue disconnected so process_admin_completions returns < 0, poll_adminq must NOT re-drive failover. Observed via pending_failover: an un-guarded re-drive reaches bdev_nvme_failover_ctrlr_unsafe's "resetting && !in_failover" branch and sets pending_failover = true, so the assert pending_failover == false fails without the guard. This is the same no-op re-drive that, in production, spun the reactor with ~231k "already in progress" NOTICEs until liveness SIGKILLed spdk_tgt. Case 2 (no over-blocking): with no reset in progress, the same admin-queue failure still initiates failover (resetting becomes true) and the reset reconnects and completes, proving the guard does not suppress legitimate recovery. Setup/teardown mirror test_failover_ctrlr. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jleeh and others added 2 commits June 9, 2026 08:32

jleeh merged commit 47df8e3 into rebase/v26.01 Jun 9, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bdev_nvme: stop failover hot-loop from adminq poller while resetting#1

bdev_nvme: stop failover hot-loop from adminq poller while resetting#1
jleeh merged 2 commits into
rebase/v26.01from
fix/bdev-nvme-failover-hotloop

jleeh commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jleeh commented Jun 9, 2026

Problem

Root cause

Fix

Test

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant