Add dry-run failover commands#410
Conversation
Implements issue RamenDR#404 - Add commands to manage dry-run failover testing. This adds two commands: - `ramenctl failover dry-run`: Test failover to secondary cluster without affecting the primary application - `ramenctl failover dry-run --abort`: Revert the dry-run and return to original state The implementation: - Uses Ramen's dryRun field in DRPlacementControlSpec - Relies on last-action label for safe state restoration - Supports all three pre-dry-run states: Deployed, FailedOver, Relocated - Includes comprehensive error handling and user feedback IMPORTANT: This code requires Ramen PR #2416 to be merged first. The DryRun field is not yet available in the current Ramen API version. Reference: RamenDR/ramen#2416 Closes: RamenDR#404 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
8837059 to
e5dbb8d
Compare
nirs
left a comment
There was a problem hiding this comment.
I looked at the documentation - before we have a clear design of the user flow, there is no point in submitting code changes.
The first step that we must complete is the design. See the questions I asked in the issue.
|
|
||
| > [!IMPORTANT] | ||
| > This command requires Ramen PR [#2416](https://github.com/RamenDR/ramen/pull/2416) | ||
| > to be merged. The `DryRun` field is not yet available in the current Ramen API version. |
There was a problem hiding this comment.
This note is should not be in the docs.
Note that ramenctl must work with any ramen version. The command should check if the installed ramen version supports this feature and fail gracefully if not.
| ramenctl failover [command] | ||
|
|
||
| Available Commands: | ||
| dry-run Test failover without affecting the primary application (DRY-RUN mode) |
There was a problem hiding this comment.
ramenctl failover dry-run and ramenctl failover dry-run --abort do not make sense.
I think this should be an a separate command like failover-test, or an option for the failover command.
|
|
||
| The failover dry-run command tests failover to the secondary cluster without | ||
| affecting the primary application. This allows you to verify DR readiness without | ||
| risk to production workloads. |
There was a problem hiding this comment.
This is not true - using this command risk your production workload - if you need to do a real failover the failover will require a full sync. The document should have a big warning about this.
| - Can be safely aborted and reverted | ||
|
|
||
| This is achieved by setting `dryRun: true` in the DRPC spec along with | ||
| `action: Failover`. |
There was a problem hiding this comment.
The implementation suggests that the right user interface is:
ramenctl failover --dry-run ...
|
|
||
| ✅ DRY-RUN failover test passed | ||
|
|
||
| 💡 To abort this dry-run: ramenctl failover dry-run --abort --name appset-deploy-rbd --namespace argocd |
There was a problem hiding this comment.
This is not consistent with other commands - we expect to see:
- validating config step
- failover dry run step
The failover step has sub steps to show the progress
There are no extra info lines.
|
|
||
| ```console | ||
| $ ramenctl failover dry-run --abort --name my-app --namespace argocd | ||
| ``` |
There was a problem hiding this comment.
This should not be an error - if we already in dry run mode we should wait for completion and report success.
For example you may start the operation and then lose connection to the cluster. You should be able to run it again to complete the operation.
The command should be idempotent.
| ### Error: "DRPC has active action" | ||
|
|
||
| The DRPC has an ongoing failover or relocate operation. Wait for it to complete | ||
| or cancel it before starting a dry-run. |
There was a problem hiding this comment.
This shows we have preconditions to check. This should be listed in the issue.
The fist step in this command should check the preconditions, and fail if any of them are not met.
| 1. Check DRPC status: | ||
| ```console | ||
| $ kubectl get drpc my-app -n argocd -o yaml | ||
| ``` |
There was a problem hiding this comment.
The command need a --context argument
| 2. Check Ramen operator logs: | ||
| ```console | ||
| $ kubectl logs -n ramen-system deployment/ramen-hub-operator | ||
| ``` |
There was a problem hiding this comment.
The command need a --context argument
| 3. **State preservation**: Uses `last-action` label for safe abort | ||
| 4. **Timeout protection**: 10-minute timeout prevents indefinite hangs | ||
| 5. **Error detection**: Monitors DRPC conditions for failures | ||
| 6. **Cancellation support**: Handles Ctrl+C gracefully |
There was a problem hiding this comment.
Most of the info here is not needed, and the most important thing is not mentioned, that this command has risk - we create a split brain during the operation, and this will cause a full sync to restore the application protection. We should have a section about the risk of using this command, instead of Safety section that confuses the user about risk-free command.
I have answered them already |
c0df2bf to
c50c0a8
Compare
@nirs Addressed your comments, PTAL |
|
|
||
| ### Testing DR readiness | ||
|
|
||
| Before a real disaster, test that failover will work: |
There was a problem hiding this comment.
This sentence is a bit confusing and may be what triggered Nir’s comment. As written, it suggests that users know a disaster is imminent and should test failover before it happens. I think the intended meaning is that users can use this feature to verify disaster recovery readiness by testing whether failover works ahead of time.
Maybe something like:
“Users can use this feature to test failover in advance and verify that they are prepared for a real disaster.”
- Change command structure from subcommand to flag-based - Make command idempotent - allow re-running without errors - Add progression check before starting dry-run - Auto-generate report directory with timestamp - Add version checking - fail gracefully if dry-run not supported - Remove misleading language about safety from documentation - Add accurate warnings about split-brain and full resync - Remove automation and pre-migration use case examples - Simplify documentation to focus on what command does - Update all kubectl examples to include --context flag - Move implementation details from docs to code comments - Update dependencies to get DryRun support - Fix all lint issues Signed-off-by: Aman Agrawal <aman_31dec@yahoo.in>
c50c0a8 to
10035d6
Compare
Implements issue #404 - Add commands to manage dry-run failover testing.
Commands:
ramenctl failover --dry-run- Test failover without affecting primary applicationramenctl failover --dry-run --abort- Revert dry-run and return to original stateKey features:
Reference: RamenDR/ramen#2416
Closes: #404