Skip to content

[DNM] Service abort openshift rebased#487

Open
jacob-anders wants to merge 4 commits into
openshift:mainfrom
jacob-anders:service-abort-openshift-rebased
Open

[DNM] Service abort openshift rebased#487
jacob-anders wants to merge 4 commits into
openshift:mainfrom
jacob-anders:service-abort-openshift-rebased

Conversation

@jacob-anders

@jacob-anders jacob-anders commented May 28, 2026

Copy link
Copy Markdown

Summary by CodeRabbit

Release Notes

  • Improvements

    • Firmware servicing now automatically aborts and returns the host to operational status when firmware specifications are cleared.
    • Enhanced firmware settings validation and condition tracking for improved visibility into firmware service state.
  • Tests

    • Added comprehensive end-to-end test suite validating firmware settings operations across multiple scenarios including creation, updates, and servicing abort.

The setCondition function in the HostFirmwareSettings controller was
checking whether the condition changed by comparing against the old
info.hfs.Status.Conditions rather than the newly updated status. When
only ObservedGeneration changed but the condition status remained the
same, the function returned false (no change), causing the status
update to be skipped. This left ObservedGeneration permanently stale,
blocking the BMH controller's generation-based consistency checks.

Fix by using the return value of meta.SetStatusCondition, which
correctly detects ObservedGeneration changes. Also remove the unused
info parameter from setCondition since it is no longer needed.

Assisted-By: Claude Opus 4.6
Signed-off-by: Jacob Anders <jacob-anders-dev@proton.me>
When a user clears HFC/HFS specs to abort a failed servicing operation,
the BMH controller calls getHostFirmwareSettings/Components which check
that metadata.generation matches status.conditions.observedGeneration.
Since the HFS/HFC sub-controllers may not have reconciled yet after the
spec change, this generation mismatch returns an error that blocks the
abort path, leaving the host stuck in ServicingError with exponential
backoff.

Fix with two complementary changes:

1. Add a fast-path in doServiceIfNeeded: when in ServicingError state,
   check specs directly via lightweight r.Get() calls (which don't
   validate generation) before the generation-sensitive functions. If
   specs are cleared, abort immediately via prov.Service() and clear
   the error/operational status on success.

2. Watch HostFirmwareSettings and HostFirmwareComponents for generation
   changes (spec modifications), mapping events to the corresponding
   BMH. This ensures the BMH controller reconciles promptly when specs
   change, rather than waiting for error-state exponential backoff.

Assisted-By: Claude Opus 4.6
Signed-off-by: Jacob Anders <jacob-anders-dev@proton.me>
Extract hostFirmwareSpecsExist to handle Get errors safely, include legacy
Spec.Firmware, and gate recovery on active HFS/HFC policy paths. Add unit
tests and an e2e case for aborting servicing by clearing HFS spec.
@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown

Walkthrough

This PR implements firmware servicing recovery for BareMetalHost by detecting when firmware settings and components specs are cleared and aborting in-flight servicing. It refactors HostFirmwareSettings condition-setting logic and validates the feature across three end-to-end scenarios with Ironic and Redfish.

Changes

Firmware Servicing Recovery

Layer / File(s) Summary
BareMetalHost recovery and watch infrastructure
internal/controller/metal3.io/baremetalhost_controller.go, internal/controller/metal3.io/baremetalhost_controller_test.go
Adds hostFirmwareSpecsExist helper to detect when firmware specs are absent by checking firmware state differences and HFS/HFC resource content. Inserts a servicing-error fast-path in doServiceIfNeeded that aborts servicing and clears error status when specs are cleared. Sets up watches on HFS and HFC objects to enqueue BareMetalHost reconciliations on generation changes. Unit tests validate helper behavior across spec scenarios and client errors.
HostFirmwareSettings condition-setting refactoring
internal/controller/metal3.io/hostfirmwaresettings_controller.go, internal/controller/metal3.io/hostfirmwaresettings_test.go
Refactors updateStatus to consistently use the refactored setCondition helper, which now delegates change detection to meta.SetStatusCondition instead of manual comparison. Sets FirmwareSettingsValid and FirmwareSettingsChangeDetected conditions during Spec/Status mismatches and validation failures. New test validates condition change detection when observed generation increments.
End-to-end test suite for firmware servicing recovery
test/e2e/config/ironic.yaml, test/e2e/firmware_settings_test.go
Adds E2E configuration intervals for servicing and firmware settings waits. Implements full test suite with hardcoded firmware keys/values and three scenarios: pre-create HFS persistence and reconciliation behavior, provisioned-host patching with on-reboot updates and optional SSH verification, and servicing abort when HFS specs are cleared. Includes resource collection and optional cluster cleanup in teardown.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 12 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning AfterEach calls Cleanup() with potentially nil namespace when BeforeEach Skip() triggers before namespace creation, risking panic and violating setup/cleanup reliability guidelines. Add nil check: if !skipCleanup && namespace != nil { to guard against BeforeEach Skip scenarios, as proposed in the review comment.
Title check ❓ Inconclusive The title '[DNM] Service abort openshift rebased' is vague and uses non-descriptive terms; '[DNM]' typically means 'Do Not Merge' and provides no meaningful information about the changeset. The phrase 'openshift rebased' lacks specificity about actual changes. Replace the title with a clear, descriptive summary of the main change, e.g., 'Add firmware servicing abort capability with HFS/HFC change detection' to convey the primary purpose of the changes.
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names are stable and deterministic; no dynamic values (namespaces, UUIDs, timestamps, node names) appear in test titles. Dynamic data stays in test bodies.
Microshift Test Compatibility ✅ Passed New e2e test uses only standard Kubernetes APIs and metal3.io community APIs; no OpenShift-specific APIs unavailable on MicroShift detected.
Single Node Openshift (Sno) Test Compatibility ✅ Passed Tests operate on individual BareMetalHost resources and do not assume multiple nodes or HA clusters, making them compatible with Single Node OpenShift deployments.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only controller logic and tests; no deployment manifests or scheduling constraints are introduced. Check not applicable.
Ote Binary Stdout Contract ✅ Passed No OTE Binary Stdout Contract violations detected. All logging properly configured via zap/klog, no fmt.Print* calls to stdout in process-level code, test code uses Ginkgo framework blocks.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed Test contains no hardcoded IPv4 addresses, IPv4-only parsing, IPv4 CIDRs, or external internet requirements. SSH connectivity is conditionally gated and cluster-local.
No-Weak-Crypto ✅ Passed No weak cryptography found in modified files. Only crypto/sha256 (NIST-approved) used for schema hashing, not for security-critical purposes.
Container-Privileges ✅ Passed No container privilege escalation settings found in modified files. PR contains Go controllers, tests, and test configuration without K8s manifests.
No-Sensitive-Data-In-Logs ✅ Passed New logging in doServiceIfNeeded ("specs cleared while servicing error" and "successfully recovered") does not expose passwords, tokens, keys, PII, or sensitive data.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci

openshift-ci Bot commented May 28, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jacob-anders
Once this PR has been reviewed and has the lgtm label, please assign zaneb for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/firmware_settings_test.go`:
- Around line 563-569: AfterEach currently calls Cleanup unconditionally which
can panic if BeforeEach called Skip and never set namespace/cancelWatches;
modify the AfterEach guard so Cleanup is only invoked when teardown is required
and resources exist — e.g., check skipCleanup is false AND namespace is
non-empty (and/or cancelWatches is non-nil) before calling Cleanup; update the
AfterEach block (referencing AfterEach, Cleanup, namespace, cancelWatches,
BeforeEach, Skip) to perform this conditional check so skipped specs do not
panic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cbe7c311-edb2-437f-a34d-b026727b7c30

📥 Commits

Reviewing files that changed from the base of the PR and between 7a00547 and 01d32fc.

📒 Files selected for processing (6)
  • internal/controller/metal3.io/baremetalhost_controller.go
  • internal/controller/metal3.io/baremetalhost_controller_test.go
  • internal/controller/metal3.io/hostfirmwaresettings_controller.go
  • internal/controller/metal3.io/hostfirmwaresettings_test.go
  • test/e2e/config/ironic.yaml
  • test/e2e/firmware_settings_test.go

Comment on lines +563 to +569
AfterEach(func() {
CollectSerialLogs(bmc.Name, path.Join(artifactFolder, specName))
DumpResources(ctx, e2eConfig, clusterProxy, path.Join(artifactFolder, specName))
if !skipCleanup {
isNamespaced := e2eConfig.GetBoolVariable("NAMESPACE_SCOPED")
Cleanup(ctx, clusterProxy, namespace, cancelWatches, isNamespaced, e2eConfig.GetIntervals("default", "wait-namespace-deleted")...)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard cleanup when BeforeEach skips before namespace creation.

namespace/cancelWatches can be unset when Skip(...) triggers in BeforeEach, but AfterEach still executes. Calling Cleanup unconditionally can panic and make skipped specs fail noisily.

Proposed fix
 	AfterEach(func() {
 		CollectSerialLogs(bmc.Name, path.Join(artifactFolder, specName))
 		DumpResources(ctx, e2eConfig, clusterProxy, path.Join(artifactFolder, specName))
-		if !skipCleanup {
+		if !skipCleanup && namespace != nil {
 			isNamespaced := e2eConfig.GetBoolVariable("NAMESPACE_SCOPED")
 			Cleanup(ctx, clusterProxy, namespace, cancelWatches, isNamespaced, e2eConfig.GetIntervals("default", "wait-namespace-deleted")...)
 		}
 	})

As per coding guidelines, tests must ensure setup and cleanup are reliable (“Setup and cleanup - tests should use BeforeEach/AfterEach for setup and cleanup; flag tests creating resources without cleanup...”).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
AfterEach(func() {
CollectSerialLogs(bmc.Name, path.Join(artifactFolder, specName))
DumpResources(ctx, e2eConfig, clusterProxy, path.Join(artifactFolder, specName))
if !skipCleanup {
isNamespaced := e2eConfig.GetBoolVariable("NAMESPACE_SCOPED")
Cleanup(ctx, clusterProxy, namespace, cancelWatches, isNamespaced, e2eConfig.GetIntervals("default", "wait-namespace-deleted")...)
}
AfterEach(func() {
CollectSerialLogs(bmc.Name, path.Join(artifactFolder, specName))
DumpResources(ctx, e2eConfig, clusterProxy, path.Join(artifactFolder, specName))
if !skipCleanup && namespace != nil {
isNamespaced := e2eConfig.GetBoolVariable("NAMESPACE_SCOPED")
Cleanup(ctx, clusterProxy, namespace, cancelWatches, isNamespaced, e2eConfig.GetIntervals("default", "wait-namespace-deleted")...)
}
})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/firmware_settings_test.go` around lines 563 - 569, AfterEach
currently calls Cleanup unconditionally which can panic if BeforeEach called
Skip and never set namespace/cancelWatches; modify the AfterEach guard so
Cleanup is only invoked when teardown is required and resources exist — e.g.,
check skipCleanup is false AND namespace is non-empty (and/or cancelWatches is
non-nil) before calling Cleanup; update the AfterEach block (referencing
AfterEach, Cleanup, namespace, cancelWatches, BeforeEach, Skip) to perform this
conditional check so skipped specs do not panic.

@openshift-ci

openshift-ci Bot commented May 28, 2026

Copy link
Copy Markdown

@jacob-anders: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ipi-serial-ipv4 01d32fc link true /test e2e-metal-ipi-serial-ipv4

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant