fix(crypto): CRP-2979: Guard SKS old-file zeroing against shared inodes#10233
Draft
mbjorkqvist wants to merge 5 commits into
Draft
fix(crypto): CRP-2979: Guard SKS old-file zeroing against shared inodes#10233mbjorkqvist wants to merge 5 commits into
mbjorkqvist wants to merge 5 commits into
Conversation
write_secret_keys_to_disk_and_cleanup_old_file unconditionally zeroed sks_data.pb.old after every write. On filesystems where rename(2) reuses the destination inode rather than atomically swapping the directory entry (observed with virtiofs bind mounts on Docker Desktop / macOS), the post-rename .old hard link shares an inode with the just-written sks_data.pb, and the zeroing corrupts the live keystore — causing every replica to panic at startup with "error parsing SKS protobuf data". Extract the inode-aware "zero or unlink" logic that clean_up_old_sks already had into a shared helper and call it from both the startup cleanup path and the write path. When the paths share an inode, just unlink the extra link: the new content has already been written into that inode, so there is no separate old-inode copy left to scrub. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two tests that pre-create `.old` as a hard link to `proto_file` before calling `insert()` were named/commented as if they exercised the inode-sharing case. They don't: the cleanup error they assert on comes from the hard-link-creation step failing because `.old` already exists, not from any inode comparison. Rename them to reflect what they actually verify (stale `.old` left behind by a crashed prior process) and point the comments at `zeroize_or_unlink_old_file`'s unit tests for the post-rename inode-sharing coverage. Also extend the metrics test with an on-disk integrity assertion: reopen the keystore and confirm the inserted key survived. Pre-fix, on filesystems where rename(2) reuses the destination inode, the reopen would have panicked with "error parsing SKS protobuf data" — so this assertion is the end-to-end regression check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ircuit The helper's `try_exists` short-circuit on missing `.old` is essential — without it, the write path on a fresh keystore would call `are_hard_links_to_the_same_inode` on a non-existent path and surface a spurious `NotFound` cleanup error (a regression an earlier iteration of the fix had). Add a direct unit test so a future refactor that drops the short-circuit fails loudly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes a data-loss bug in ProtoSecretKeyStore where post-write cleanup could zeroize the newly written keystore on filesystems that reuse the destination inode during rename(2) (e.g., virtiofs bind mounts), by guarding cleanup against shared inodes and unlinking instead of zeroing when appropriate.
Changes:
- Extracted shared cleanup logic into
zeroize_or_unlink_old_file, which unlinks.oldwhen it shares an inode with the liveproto_file, otherwise zeroizes and deletes as before. - Wired the helper into both startup cleanup (
clean_up_old_sks) and the write path (write_secret_keys_to_disk_and_cleanup_old_file). - Added/updated unit tests covering: shared-inode unlink behavior, missing
.oldno-op, and differing-inode zeroize+delete behavior; adjusted metrics/logging tests accordingly.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| rs/crypto/internal/crypto_service_provider/src/secret_key_store/proto_store.rs | Adds inode-aware helper and uses it in both startup cleanup and post-write cleanup to prevent keystore corruption on inode-reuse filesystems. |
| rs/crypto/internal/crypto_service_provider/src/secret_key_store/proto_store/tests.rs | Adds regression/unit tests for the new helper behavior and updates existing metrics/logging tests to reflect the new cleanup semantics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix a data-loss bug in
ProtoSecretKeyStorewhere the write path unconditionally zeroedsks_data.pb.oldafter every write, without checking whether it still shared an inode with the just-writtensks_data.pb.On filesystems where
rename(2)atomically swaps the destination directory entry to a new inode (Linux ext4/xfs/btrfs/tmpfs/…) the bug was latent:.oldalways pointed to the pre-rename inode, so zeroing scrubbed old plaintext as intended. On filesystems whererename(2)reuses the destination inode — observed with virtiofs bind mounts via Docker Desktop on macOS — the post-rename.oldshared the inode of the new keystore, and zeroing corrupted the file just written. The symptom was every replica panicking at startup with"error parsing SKS protobuf data"after the keystore came back truncated to the trailing map entry.The startup-cleanup path (
clean_up_old_sks) already had the inode guard viaare_hard_links_to_the_same_inode; the write path (write_secret_keys_to_disk_and_cleanup_old_file) did not. This PR extracts the guard into a shared helperzeroize_or_unlink_old_fileand wires it into both paths. When.oldandproto_fileshare an inode, the helper just unlinks the extra link — the new content has already been written into that inode, so there is no separate old-inode copy left to scrub and the secure-erase invariant is preserved.