Skip to content

fix: address audit findings in force_rebalance code paths#1357

Open
SLoeuillet wants to merge 4 commits intoAltinity:masterfrom
SLoeuillet:fix/force-rebalance-audit-bugs
Open

fix: address audit findings in force_rebalance code paths#1357
SLoeuillet wants to merge 4 commits intoAltinity:masterfrom
SLoeuillet:fix/force-rebalance-audit-bugs

Conversation

@SLoeuillet
Copy link
Copy Markdown
Contributor

Summary

Follow-up to #1351. A post-merge audit of the force_rebalance code found 4 bugs that weren't caught during review. This PR fixes them.

Bugs fixed

1. RebalancedFiles not persisted to metadata (CRITICAL)

pkg/metadata/table_metadata.goTableMetadata.Save() was copying Files, Parts, Checksums, Size, TotalBytes but not RebalancedFiles. This caused the rebalancing map to be lost when metadata was re-read from disk during resumable downloads, potentially causing archive files to be downloaded to wrong disks after resume.

2. Archive file name trimming used wrong disk prefix (CRITICAL)

pkg/backup/download.go:703 — In the compressed format download path:

partName := strings.TrimPrefix(strings.TrimSuffix(archiveFile, ext), capturedDisk+"_")

Archive files are named with the original disk prefix (e.g. default_part1.tar.gz), not the rebalanced disk name. When capturedDisk was hdd1 and archiveFile was default_part1.tar.gz, the TrimPrefix silently failed and partName contained the disk prefix. This broke the --hardlink-exists-files optimization (part lookup always failed). Fixed to use disk (original) for the trim.

3. Loop variable mutation in downloadDiffParts (CRITICAL)

pkg/backup/download.go:978 — The outer loop variable disk was mutated:

for disk, parts := range table.Parts {
    for i, part := range parts {
        // ...
        disk = part.RebalancedDisk  // mutation!
        // ...
        capturedDisk := disk
        // ...
        table.Parts[capturedDisk][idx] = partForDownload  // idx may be invalid for capturedDisk

idx came from iterating table.Parts[original_disk], so when capturedDisk was different (rebalanced), the index could be out of bounds for table.Parts[capturedDisk], causing panics or silent corruption of unrelated parts. Fixed by using a per-iteration activeDisk variable — same pattern we already applied in filesystemhelper.go for the same bug class.

4. Unclear error message for store path construction

pkg/filesystemhelper/filesystemhelper.go — The error can't build store path for rebalanced disk X (uuid=Y) didn't distinguish between a missing disk and an empty UUID. Split into two specific error messages for easier debugging.

Test plan

  • go build passes
  • go vet clean with -tags=integration
  • TestForceRebalance passes locally on ClickHouse 24.8 with race detector (82s)

🤖 Generated with Claude Code

SLoeuillet and others added 3 commits April 13, 2026 18:25
Fixes several bugs found during audit of PR Altinity#1351 force_rebalance logic:

1. RebalancedFiles not persisted to metadata (CRITICAL)
   TableMetadata.Save() was not copying the RebalancedFiles map, causing
   it to be lost on resumable downloads when metadata is re-read from
   disk. Added RebalancedFiles to the saved fields.

2. Archive file name trimming used wrong disk prefix (CRITICAL)
   In downloadTableData() compressed format path, partName was computed
   with TrimPrefix(archiveFile, capturedDisk+"_"). But archive files are
   named with the ORIGINAL disk prefix (e.g. "default_part1.tar.gz"),
   not the rebalanced disk name. The trim silently failed, breaking the
   hardlink-exists-files optimization. Use disk (original) for the trim.

3. Loop variable mutation in downloadDiffParts (CRITICAL)
   `disk` was mutated to `part.RebalancedDisk`, then used later as
   `capturedDisk := disk` to update `table.Parts[capturedDisk][idx]`.
   The idx came from iterating table.Parts[original_disk], so the
   update could go out of bounds or corrupt the wrong entry when the
   rebalanced and original disk names differed. Use activeDisk per
   iteration like we already do in filesystemhelper.

4. Unclear error message for store path construction
   Split the check into two distinct error messages so it's clear
   whether the disk is missing from diskMap or the UUID is empty.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Slach
Copy link
Copy Markdown
Collaborator

Slach commented Apr 23, 2026

your fixes stuck somewhere =(

unexpected checkResumeAlreadyProcessed error: ExecCmdOut: signal: killed

it means kill after timeout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants