Add state machine for image switch (milestone 4c)#50
Conversation
|
@jlebon I skipped the e2e test since we still need to define how to test the upgrade. The state machine is only checked by the env tests for now |
7e1ff20 to
2afd447
Compare
|
There are still some issue with this PR... investigating. |
56119dc to
ba87f0d
Compare
|
I think we should prioritize #57 and implement at least the happy path test for the reboot |
e4557ce to
413cf0d
Compare
1226849 to
ed7b12d
Compare
|
I haven't re-reviewed past the 2nd commit yet (tests). |
Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Alice Frosi <afrosi@redhat.com>
e76449d to
7fdea08
Compare
9dc4817 to
2a136b0
Compare
9660931 to
75db0f2
Compare
jlebon
left a comment
There was a problem hiding this comment.
Full review this time.
This is getting quite close! Very cool to see it all tested in the e2e test.
Rewrite the reconciler to detect image mismatches between spec.desiredImage and the booted image, stage via bootc switch in a background goroutine. Once, it finished to staged the image, the termination of the goroutine triggers once more the reconciliation loop which will detect that the system requires a reboot. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Alice Frosi <afrosi@redhat.com>
Replace raw JSON bytes with a bootc.Status struct in the test fake. Status() serializes the struct via json.Marshal, and Stage() auto-mutates the status (staging sets Staged). Reboot() records the call for test assertions. Add newBootcStatus() and newBootEntry() helpers to build test state without verbose JSON constants. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Alice Frosi <afrosi@redhat.com>
Replace the bootcStatusFull JSON constant with newBootcStatus() struct construction. Tighten the error assertion to match the exact error chain. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Alice Frosi <afrosi@redhat.com>
Add envtest cases for the daemon reconciler state machine: - TestStagingTriggered: image mismatch triggers bootc stage - TestStagingError: stage failure sets Degraded condition - TestAlreadyStaged: skip stage when image already staged - TestRebootingSet: reboot triggered when desiredImageState is Booted - TestRollback: restage when desired image changes - TestCancelInflightStage: spec change cancels in-flight stage Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Alice Frosi <afrosi@redhat.com>
Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Alice Frosi <afrosi@redhat.com>
The TestUpdateReboot verifies that the upgrades to the new image is successfully performed. It starts by patching the desiredImage to a new one. After the reboot, the node should be Idle with the new booted image and schedulable, proving that the uncordon was successful. Additionally, the test verifies that the image is the one we built for upgrades by checking the existance of the file /usr/share/update-marker. We don't assert on intermediate states (Staging, Rebooting) because they are too transient to catch reliably with polling — the daemon restarts after the reboot so the Rebooting window is brief and a spurious reconcile during shutdown can overwrite it before the next poll. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Alice Frosi <afrosi@redhat.com>
The daemon runs bootc switch which downloads the images. This operations requires consumes additional memory limits, otherwise it gets OOM killed. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Alice Frosi <afrosi@redhat.com>
TestUpdateReboot fails because the node briefly enters a Degraded state during the reboot window, but the controller only logged the state name, not the daemon's error message. The daemon restarts after the reboot so its previous logs are lost. Log the Degraded condition's Message field so the next failure reveals the cause. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Alice Frosi <afrosi@redhat.com>
|
There has been some flaky test runs. I have removed the check for the rebooting reason because apparently the bootcnode is sometimes set to degraded if during reboot the bootc status doesn't report correctly the status. It was hard to detect it because the daemon restarts and we loose the logs while the controller was reporting only degraded without the reason and the message. Although, the code is correct, even if the bootcnode is set to degraded for some time, it then recovers since the test failure was reporting healthy condition. Example of a run can be found here. The e2e test verifies now that the end state is the desired one and we successfully managed to upgrade without checking intermediate conditions. |
The test testcontrollermembership fails with the daemon not able pulling from the registry. This has already been reported in bink in: bootc-dev/bink#59 and fixed by: bootc-dev/bink#60 Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Alice Frosi <afrosi@redhat.com>
Nice. Yeah this would be fixed by #50 (comment). |
Implements the daemon-side state machine that detects image mismatches between spec.desiredImage and the booted image,
stages updates via bootc switch in a background goroutine, and triggers reboot when desiredImageState == Booted.
one path