KVM: fix UEFI disk-only instance snapshot NVRAM handling#13020
KVM: fix UEFI disk-only instance snapshot NVRAM handling#13020Kunalbehbud wants to merge 5 commits intoapache:4.22from
Conversation
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## 4.22 #13020 +/- ##
============================================
+ Coverage 17.68% 17.73% +0.05%
- Complexity 15793 15855 +62
============================================
Files 5922 5922
Lines 533096 533514 +418
Branches 65209 65251 +42
============================================
+ Hits 94275 94620 +345
- Misses 428181 428196 +15
- Partials 10640 10698 +58
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17485 |
|
This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch. |
f657823 to
2bc9051
Compare
|
Rebased this branch on the latest Local verification passed again on top of the updated base: The new GitHub Actions runs for this fork push are currently in |
There was a problem hiding this comment.
Pull request overview
Fixes KVM disk-only instance snapshots for UEFI VMs by capturing/restoring the active NVRAM state via an NVRAM “sidecar” file, and gating these flows on explicit host capabilities to prevent unsafe reverts.
Changes:
- Plumbs UEFI/NVRAM metadata through management ↔ agent commands (create/revert/delete) and persists NVRAM sidecar path in snapshot details.
- Implements NVRAM sidecar copy on snapshot creation and atomic restore on revert; adds sidecar cleanup on delete/merge flows.
- Adds host capability advertising + reconnect-time capability syncing/cleanup, plus targeted unit tests for the new behavior.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| plugins/hypervisors/kvm/src/test/java/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtDiskOnlyVMSnapshotCommandWrapperTest.java | New unit tests covering NVRAM sidecar backup/restore/cleanup and freeze/suspend sequencing/warnings. |
| plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtCreateDiskOnlyVMSnapshotCommandWrapper.java | Copies active NVRAM into a per-snapshot sidecar; adds freeze/suspend + post-snapshot recovery warnings. |
| plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtRevertDiskOnlyVMSnapshotCommandWrapper.java | Validates presence of NVRAM sidecar for UEFI reverts and atomically restores active NVRAM from the sidecar. |
| plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtDeleteDiskOnlyVMSnapshotCommandWrapper.java | Deletes NVRAM sidecar when present, using either provided primary datastore or root snapshot lookup. |
| plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtReadyCommandWrapper.java | Advertises the new host capability for NVRAM-aware disk-only snapshots. |
| plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/LibvirtComputingResource.java | Adds host detail advertising at startup and provides getUefiNvramPath(vmUuid) helper. |
| engine/storage/snapshot/src/test/java/org/apache/cloudstack/storage/vmsnapshot/KvmFileBasedStorageVmSnapshotStrategyTest.java | New tests validating NVRAM path persistence, agent command plumbing, and capability gating. |
| engine/storage/snapshot/src/main/java/org/apache/cloudstack/storage/vmsnapshot/KvmFileBasedStorageVmSnapshotStrategy.java | Persists NVRAM sidecar metadata, gates create/revert/delete/merge cleanup on capabilities, sends alert on recovery warnings. |
| engine/orchestration/src/test/java/com/cloud/agent/manager/AgentManagerImplTest.java | Adds test ensuring stale NVRAM capability is cleared on reconnect when no longer advertised. |
| engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java | Syncs boolean host capabilities from ReadyAnswer details, including removal of stale DB details when not advertised. |
| core/src/main/java/com/cloud/agent/api/storage/CreateDiskOnlyVmSnapshotCommand.java | Adds vmUuid and uefiEnabled fields for agent-side NVRAM handling decisions. |
| core/src/main/java/com/cloud/agent/api/storage/CreateDiskOnlyVmSnapshotAnswer.java | Adds nvramSnapshotPath to return sidecar path back to management. |
| core/src/main/java/com/cloud/agent/api/storage/RevertDiskOnlyVmSnapshotCommand.java | Adds vmUuid, uefiEnabled, and nvramSnapshotPath to support validated/atomic NVRAM restores. |
| core/src/main/java/com/cloud/agent/api/storage/DeleteDiskOnlyVmSnapshotCommand.java | Adds optional nvramSnapshotPath and primaryDataStore to support sidecar-only cleanup. |
| api/src/main/java/com/cloud/host/Host.java | Introduces HOST_KVM_DISK_ONLY_VM_SNAPSHOT_NVRAM host capability key. |
| plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtConvertInstanceCommandWrapper.java | Fixes a typo in an error message (“calll” → “call”). |
| agent/conf/agent.properties | Removes a trailing whitespace-only line. |
| PendingReleaseNotes | Adds release note about UEFI disk-only snapshots now capturing NVRAM and legacy snapshot revert rejection. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…spend Address Copilot's review comment on apache#13020. `JsonParser.parse(...)`, `.get("return")` and `.getAsString()` in `verifyVmFilesystemsFrozen` can throw unchecked exceptions (`JsonSyntaxException`, `IllegalStateException`, `NullPointerException`) on malformed or unexpected QEMU guest-agent responses, bypassing the surrounding `catch (LibvirtException | IOException)` and potentially leaving the guest frozen until the `finally` recovery runs. Parse defensively, validate that `return` is a JSON string, and explicitly treat `{"error": {...}}` guest-agent responses as an IO failure so the outer handler thaws/resumes the guest and surfaces a proper failure answer. Also document in PendingReleaseNotes that taking a disk-only instance snapshot for KVM UEFI VMs now briefly suspends the guest while the NVRAM sidecar is copied. Non-UEFI VMs are unaffected. Add regression tests covering the guest-agent error response and the malformed-JSON path, including verification that the thaw recovery is invoked when freeze verification fails. Signed-off-by: kunal.behbudzade <kunal.behbudzade@btsgrp.com>
|
Addressed the Copilot comment in 142d116. Added two regression tests ( Also updated Local verification on top of 142d116 passed: |
Add the command payload, answer metadata, and host capability plumbing required for KVM disk-only instance snapshots to carry UEFI NVRAM state between management and the KVM agent. Also synchronize host capability booleans on reconnect so stale UEFI/NVRAM support details are removed when an older agent reconnects.
Copy the active UEFI NVRAM file as a sidecar during disk-only instance snapshot creation, restore it on revert, and clean it up during delete and merge flows. Also tighten host capability checks, preserve successful snapshot metadata when post-snapshot thaw or resume fails, and reject reverting UEFI disk-only snapshots that do not contain NVRAM state.
Cover host capability synchronization, UEFI NVRAM sidecar handling across create/revert/delete flows, and the running-VM recovery paths for disk-only instance snapshots.
…spend Address Copilot's review comment on apache#13020. `JsonParser.parse(...)`, `.get("return")` and `.getAsString()` in `verifyVmFilesystemsFrozen` can throw unchecked exceptions (`JsonSyntaxException`, `IllegalStateException`, `NullPointerException`) on malformed or unexpected QEMU guest-agent responses, bypassing the surrounding `catch (LibvirtException | IOException)` and potentially leaving the guest frozen until the `finally` recovery runs. Parse defensively, validate that `return` is a JSON string, and explicitly treat `{"error": {...}}` guest-agent responses as an IO failure so the outer handler thaws/resumes the guest and surfaces a proper failure answer. Also document in PendingReleaseNotes that taking a disk-only instance snapshot for KVM UEFI VMs now briefly suspends the guest while the NVRAM sidecar is copied. Non-UEFI VMs are unaffected. Add regression tests covering the guest-agent error response and the malformed-JSON path, including verification that the thaw recovery is invoked when freeze verification fails. Signed-off-by: kunal.behbudzade <kunal.behbudzade@btsgrp.com>
|
Replaced the GitHub |
5f30788 to
75bf215
Compare
|
Local verification passed again on the current rebased head: No GitHub checks have attached to the current head yet. Could a maintainer please re-run or approve the workflows for this PR head? |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| String result = getResultOfQemuCommand(FreezeThawVMCommand.FREEZE, domain); | ||
| if (StringUtils.isBlank(result) || result.startsWith("error")) { | ||
| throw new IOException(String.format("Failed to freeze VM [%s] filesystems before taking the disk-only VM snapshot. Result: %s", vmName, result)); | ||
| } |
There was a problem hiding this comment.
freezeVmFilesystems() treats QEMU guest agent failures as only null/blank or result.startsWith("error"). qemuAgentCommand typically returns JSON like { "error": ... } on failure, which will not be caught here and may allow the snapshot flow to continue as if the freeze succeeded. Consider parsing the JSON response and failing when it contains an error object (or reuse the existing FreezeThawVM logic of checking guest-fsfreeze-status after the freeze).
| } | |
| } | |
| try { | |
| JsonElement resultElement = new JsonParser().parse(result); | |
| if (resultElement.isJsonObject()) { | |
| JsonObject resultObject = resultElement.getAsJsonObject(); | |
| if (resultObject.has("error")) { | |
| throw new IOException(String.format("Failed to freeze VM [%s] filesystems before taking the disk-only VM snapshot. Result: %s", vmName, result)); | |
| } | |
| } | |
| } catch (IOException e) { | |
| throw e; | |
| } catch (RuntimeException e) { | |
| // Keep existing behavior for non-JSON successful responses while still failing on JSON error payloads. | |
| } |
| protected boolean thawVmFilesystemsIfNeeded(Domain domain, String vmName) { | ||
| try { | ||
| String result = getResultOfQemuCommand(FreezeThawVMCommand.THAW, domain); | ||
| if (StringUtils.isBlank(result) || result.startsWith("error")) { | ||
| logger.warn("Failed to thaw VM [{}] filesystems after taking the disk-only VM snapshot. Result: {}", vmName, result); | ||
| return false; | ||
| } | ||
| return true; | ||
| } catch (LibvirtException e) { |
There was a problem hiding this comment.
thawVmFilesystemsIfNeeded() uses the same result.startsWith("error") check, so a JSON QEMU error response (e.g. { "error": ... }) could be treated as success. That can incorrectly suppress the warning/alert path and leave guest filesystems frozen. Parse the JSON and treat responses containing an error field (or lacking an expected return) as a thaw failure.
| } | ||
|
|
||
| String subject = String.format("Disk-only VM snapshot [%s] completed with guest recovery warnings", vmSnapshot.getUuid()); | ||
| String message = String.format("Disk-only VM snapshot [%s] for UEFI VM [%s] completed, but post-snapshot guest recovery reported: %s", |
There was a problem hiding this comment.
notifyGuestRecoveryIssueIfNeeded() hard-codes "for UEFI VM" in the alert message, but the method is invoked unconditionally after snapshot creation. If answer.getDetails() is ever populated for a non-UEFI snapshot path, the alert text will be misleading. Consider either gating this method on isUefiVm(userVm) or making the alert text generic (or include boot type conditionally).
| String message = String.format("Disk-only VM snapshot [%s] for UEFI VM [%s] completed, but post-snapshot guest recovery reported: %s", | |
| String message = String.format("Disk-only VM snapshot [%s] for VM [%s] completed, but post-snapshot guest recovery reported: %s", |
Description
KVM disk-only instance snapshots do not capture the active UEFI NVRAM state, which makes revert unsafe for UEFI guests.
This PR fixes the KVM disk-only instance snapshot flow to:
quiescevm=trueOlder UEFI disk-only snapshots created without an NVRAM sidecar are intentionally rejected on revert.
This PR only covers
disk-onlyinstance snapshots for KVM UEFI VMs.snapshotMemory=true/ internal snapshots remain out of scope.Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
N/A
How Has This Been Tested?
mvn -pl engine/orchestration,engine/storage/snapshot,plugins/hypervisors/kvm -am -Dtest=AgentManagerImplTest,KvmFileBasedStorageVmSnapshotStrategyTest,LibvirtDiskOnlyVMSnapshotCommandWrapperTest -Dsurefire.failIfNoSpecifiedTests=false testHow did you try to break this feature and the system with this change?