[DNM] Service abort openshift rebased#487
Conversation
The setCondition function in the HostFirmwareSettings controller was checking whether the condition changed by comparing against the old info.hfs.Status.Conditions rather than the newly updated status. When only ObservedGeneration changed but the condition status remained the same, the function returned false (no change), causing the status update to be skipped. This left ObservedGeneration permanently stale, blocking the BMH controller's generation-based consistency checks. Fix by using the return value of meta.SetStatusCondition, which correctly detects ObservedGeneration changes. Also remove the unused info parameter from setCondition since it is no longer needed. Assisted-By: Claude Opus 4.6 Signed-off-by: Jacob Anders <jacob-anders-dev@proton.me>
When a user clears HFC/HFS specs to abort a failed servicing operation, the BMH controller calls getHostFirmwareSettings/Components which check that metadata.generation matches status.conditions.observedGeneration. Since the HFS/HFC sub-controllers may not have reconciled yet after the spec change, this generation mismatch returns an error that blocks the abort path, leaving the host stuck in ServicingError with exponential backoff. Fix with two complementary changes: 1. Add a fast-path in doServiceIfNeeded: when in ServicingError state, check specs directly via lightweight r.Get() calls (which don't validate generation) before the generation-sensitive functions. If specs are cleared, abort immediately via prov.Service() and clear the error/operational status on success. 2. Watch HostFirmwareSettings and HostFirmwareComponents for generation changes (spec modifications), mapping events to the corresponding BMH. This ensures the BMH controller reconciles promptly when specs change, rather than waiting for error-state exponential backoff. Assisted-By: Claude Opus 4.6 Signed-off-by: Jacob Anders <jacob-anders-dev@proton.me>
Extract hostFirmwareSpecsExist to handle Get errors safely, include legacy Spec.Firmware, and gate recovery on active HFS/HFC policy paths. Add unit tests and an e2e case for aborting servicing by clearing HFS spec.
WalkthroughThis PR implements firmware servicing recovery for BareMetalHost by detecting when firmware settings and components specs are cleared and aborting in-flight servicing. It refactors HostFirmwareSettings condition-setting logic and validates the feature across three end-to-end scenarios with Ironic and Redfish. ChangesFirmware Servicing Recovery
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 12 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (12 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jacob-anders The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/e2e/firmware_settings_test.go`:
- Around line 563-569: AfterEach currently calls Cleanup unconditionally which
can panic if BeforeEach called Skip and never set namespace/cancelWatches;
modify the AfterEach guard so Cleanup is only invoked when teardown is required
and resources exist — e.g., check skipCleanup is false AND namespace is
non-empty (and/or cancelWatches is non-nil) before calling Cleanup; update the
AfterEach block (referencing AfterEach, Cleanup, namespace, cancelWatches,
BeforeEach, Skip) to perform this conditional check so skipped specs do not
panic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: cbe7c311-edb2-437f-a34d-b026727b7c30
📒 Files selected for processing (6)
internal/controller/metal3.io/baremetalhost_controller.gointernal/controller/metal3.io/baremetalhost_controller_test.gointernal/controller/metal3.io/hostfirmwaresettings_controller.gointernal/controller/metal3.io/hostfirmwaresettings_test.gotest/e2e/config/ironic.yamltest/e2e/firmware_settings_test.go
| AfterEach(func() { | ||
| CollectSerialLogs(bmc.Name, path.Join(artifactFolder, specName)) | ||
| DumpResources(ctx, e2eConfig, clusterProxy, path.Join(artifactFolder, specName)) | ||
| if !skipCleanup { | ||
| isNamespaced := e2eConfig.GetBoolVariable("NAMESPACE_SCOPED") | ||
| Cleanup(ctx, clusterProxy, namespace, cancelWatches, isNamespaced, e2eConfig.GetIntervals("default", "wait-namespace-deleted")...) | ||
| } |
There was a problem hiding this comment.
Guard cleanup when BeforeEach skips before namespace creation.
namespace/cancelWatches can be unset when Skip(...) triggers in BeforeEach, but AfterEach still executes. Calling Cleanup unconditionally can panic and make skipped specs fail noisily.
Proposed fix
AfterEach(func() {
CollectSerialLogs(bmc.Name, path.Join(artifactFolder, specName))
DumpResources(ctx, e2eConfig, clusterProxy, path.Join(artifactFolder, specName))
- if !skipCleanup {
+ if !skipCleanup && namespace != nil {
isNamespaced := e2eConfig.GetBoolVariable("NAMESPACE_SCOPED")
Cleanup(ctx, clusterProxy, namespace, cancelWatches, isNamespaced, e2eConfig.GetIntervals("default", "wait-namespace-deleted")...)
}
})As per coding guidelines, tests must ensure setup and cleanup are reliable (“Setup and cleanup - tests should use BeforeEach/AfterEach for setup and cleanup; flag tests creating resources without cleanup...”).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| AfterEach(func() { | |
| CollectSerialLogs(bmc.Name, path.Join(artifactFolder, specName)) | |
| DumpResources(ctx, e2eConfig, clusterProxy, path.Join(artifactFolder, specName)) | |
| if !skipCleanup { | |
| isNamespaced := e2eConfig.GetBoolVariable("NAMESPACE_SCOPED") | |
| Cleanup(ctx, clusterProxy, namespace, cancelWatches, isNamespaced, e2eConfig.GetIntervals("default", "wait-namespace-deleted")...) | |
| } | |
| AfterEach(func() { | |
| CollectSerialLogs(bmc.Name, path.Join(artifactFolder, specName)) | |
| DumpResources(ctx, e2eConfig, clusterProxy, path.Join(artifactFolder, specName)) | |
| if !skipCleanup && namespace != nil { | |
| isNamespaced := e2eConfig.GetBoolVariable("NAMESPACE_SCOPED") | |
| Cleanup(ctx, clusterProxy, namespace, cancelWatches, isNamespaced, e2eConfig.GetIntervals("default", "wait-namespace-deleted")...) | |
| } | |
| }) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/e2e/firmware_settings_test.go` around lines 563 - 569, AfterEach
currently calls Cleanup unconditionally which can panic if BeforeEach called
Skip and never set namespace/cancelWatches; modify the AfterEach guard so
Cleanup is only invoked when teardown is required and resources exist — e.g.,
check skipCleanup is false AND namespace is non-empty (and/or cancelWatches is
non-nil) before calling Cleanup; update the AfterEach block (referencing
AfterEach, Cleanup, namespace, cancelWatches, BeforeEach, Skip) to perform this
conditional check so skipped specs do not panic.
|
@jacob-anders: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary by CodeRabbit
Release Notes
Improvements
Tests