Skip to content

Add dry-run failover commands#410

Draft
am-agrawa wants to merge 2 commits into
RamenDR:mainfrom
am-agrawa:dry-run-failover-cmds
Draft

Add dry-run failover commands#410
am-agrawa wants to merge 2 commits into
RamenDR:mainfrom
am-agrawa:dry-run-failover-cmds

Conversation

@am-agrawa

@am-agrawa am-agrawa commented Apr 6, 2026

Copy link
Copy Markdown
Member

Implements issue #404 - Add commands to manage dry-run failover testing.

Commands:

  • ramenctl failover --dry-run - Test failover without affecting primary application
  • ramenctl failover --dry-run --abort - Revert dry-run and return to original state

Key features:

  • Version checking - fails gracefully if Ramen doesn't support dry-run
  • Idempotent - can re-run if connection lost
  • Auto-generates timestamped test reports
  • Clear warnings about split-brain and full resync requirements

Reference: RamenDR/ramen#2416
Closes: #404

Implements issue RamenDR#404 - Add commands to manage dry-run failover testing.

This adds two commands:
- `ramenctl failover dry-run`: Test failover to secondary cluster without
  affecting the primary application
- `ramenctl failover dry-run --abort`: Revert the dry-run and return to
  original state

The implementation:
- Uses Ramen's dryRun field in DRPlacementControlSpec
- Relies on last-action label for safe state restoration
- Supports all three pre-dry-run states: Deployed, FailedOver, Relocated
- Includes comprehensive error handling and user feedback

IMPORTANT: This code requires Ramen PR #2416 to be merged first. The
DryRun field is not yet available in the current Ramen API version.

Reference: RamenDR/ramen#2416
Closes: RamenDR#404

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@am-agrawa am-agrawa force-pushed the dry-run-failover-cmds branch from 8837059 to e5dbb8d Compare April 6, 2026 13:25
@am-agrawa am-agrawa marked this pull request as draft April 6, 2026 14:02

@nirs nirs left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the documentation - before we have a clear design of the user flow, there is no point in submitting code changes.

The first step that we must complete is the design. See the questions I asked in the issue.

Comment thread docs/failover.md Outdated

> [!IMPORTANT]
> This command requires Ramen PR [#2416](https://github.com/RamenDR/ramen/pull/2416)
> to be merged. The `DryRun` field is not yet available in the current Ramen API version.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This note is should not be in the docs.

Note that ramenctl must work with any ramen version. The command should check if the installed ramen version supports this feature and fail gracefully if not.

Comment thread docs/failover.md Outdated
ramenctl failover [command]

Available Commands:
dry-run Test failover without affecting the primary application (DRY-RUN mode)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ramenctl failover dry-run and ramenctl failover dry-run --abort do not make sense.

I think this should be an a separate command like failover-test, or an option for the failover command.

Comment thread docs/failover.md Outdated

The failover dry-run command tests failover to the secondary cluster without
affecting the primary application. This allows you to verify DR readiness without
risk to production workloads.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true - using this command risk your production workload - if you need to do a real failover the failover will require a full sync. The document should have a big warning about this.

Comment thread docs/failover.md Outdated
- Can be safely aborted and reverted

This is achieved by setting `dryRun: true` in the DRPC spec along with
`action: Failover`.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation suggests that the right user interface is:

ramenctl failover --dry-run ...

Comment thread docs/failover.md Outdated

✅ DRY-RUN failover test passed

💡 To abort this dry-run: ramenctl failover dry-run --abort --name appset-deploy-rbd --namespace argocd

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not consistent with other commands - we expect to see:

  • validating config step
  • failover dry run step

The failover step has sub steps to show the progress

There are no extra info lines.

Comment thread docs/failover.md Outdated

```console
$ ramenctl failover dry-run --abort --name my-app --namespace argocd
```

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be an error - if we already in dry run mode we should wait for completion and report success.

For example you may start the operation and then lose connection to the cluster. You should be able to run it again to complete the operation.

The command should be idempotent.

Comment thread docs/failover.md
### Error: "DRPC has active action"

The DRPC has an ongoing failover or relocate operation. Wait for it to complete
or cancel it before starting a dry-run.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shows we have preconditions to check. This should be listed in the issue.

The fist step in this command should check the preconditions, and fail if any of them are not met.

Comment thread docs/failover.md
1. Check DRPC status:
```console
$ kubectl get drpc my-app -n argocd -o yaml
```

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command need a --context argument

Comment thread docs/failover.md
2. Check Ramen operator logs:
```console
$ kubectl logs -n ramen-system deployment/ramen-hub-operator
```

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command need a --context argument

Comment thread docs/failover.md Outdated
3. **State preservation**: Uses `last-action` label for safe abort
4. **Timeout protection**: 10-minute timeout prevents indefinite hangs
5. **Error detection**: Monitors DRPC conditions for failures
6. **Cancellation support**: Handles Ctrl+C gracefully

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the info here is not needed, and the most important thing is not mentioned, that this command has risk - we create a split brain during the operation, and this will cause a full sync to restore the application protection. We should have a section about the risk of using this command, instead of Safety section that confuses the user about risk-free command.

@am-agrawa am-agrawa marked this pull request as ready for review April 6, 2026 17:51
@am-agrawa

Copy link
Copy Markdown
Member Author

I looked at the documentation - before we have a clear design of the user flow, there is no point in submitting code changes.

The first step that we must complete is the design. See the questions I asked in the issue.

I have answered them already

@am-agrawa am-agrawa force-pushed the dry-run-failover-cmds branch 5 times, most recently from c0df2bf to c50c0a8 Compare April 15, 2026 14:44
@am-agrawa

Copy link
Copy Markdown
Member Author

I looked at the documentation - before we have a clear design of the user flow, there is no point in submitting code changes.
The first step that we must complete is the design. See the questions I asked in the issue.

I have answered them already

@nirs Addressed your comments, PTAL

Comment thread docs/failover.md Outdated

### Testing DR readiness

Before a real disaster, test that failover will work:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is a bit confusing and may be what triggered Nir’s comment. As written, it suggests that users know a disaster is imminent and should test failover before it happens. I think the intended meaning is that users can use this feature to verify disaster recovery readiness by testing whether failover works ahead of time.

Maybe something like:

Users can use this feature to test failover in advance and verify that they are prepared for a real disaster.”

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, updated

- Change command structure from subcommand to flag-based
- Make command idempotent - allow re-running without errors
- Add progression check before starting dry-run
- Auto-generate report directory with timestamp
- Add version checking - fail gracefully if dry-run not supported
- Remove misleading language about safety from documentation
- Add accurate warnings about split-brain and full resync
- Remove automation and pre-migration use case examples
- Simplify documentation to focus on what command does
- Update all kubectl examples to include --context flag
- Move implementation details from docs to code comments
- Update dependencies to get DryRun support
- Fix all lint issues

Signed-off-by: Aman Agrawal <aman_31dec@yahoo.in>
@am-agrawa am-agrawa force-pushed the dry-run-failover-cmds branch from c50c0a8 to 10035d6 Compare April 16, 2026 15:02
@nirs nirs marked this pull request as draft April 28, 2026 11:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add command to Manage dry-run Failover

3 participants