Skip to content

Multi-GPU Inference on H20 Causes Rotary Position Embedding Dimension Mismatch #26

Description

@pikanzz

Title:
Dimension mismatch in rotary position embedding when using multi-GPU inference with VideoAlign-based video reward model

Body:
Hi,

I am working with a video reward model based on VideoAlign. My training configuration uses FPS = 4.
During inference on videos longer than 5 seconds, single-GPU (A100-80G) inference runs out of memory (OOM).

To avoid this, I tried multi-GPU inference on H20.
I modified my inference code to wrap the model in nn.DataParallel when multiple GPUs are detected:

class VideoVLMRewardInference:
    def __init__(self, load_from_pretrained, load_from_pretrained_step=-1, device=None, dtype=torch.bfloat16):
        config_path = os.path.join(load_from_pretrained, "VideoReward/model_config.json")
        data_config, _, model_config, peft_lora_config, inference_config = load_configs_from_json(config_path)
        data_config = DataConfig(**data_config)
        model_config = ModelConfig(**model_config)
        peft_lora_config = PEFTLoraConfig(**peft_lora_config)

        training_args = TrainingConfig(
            load_from_pretrained=load_from_pretrained,
            load_from_pretrained_step=load_from_pretrained_step,
            gradient_checkpointing=False,
            disable_flash_attn2=False,
            bf16=True if dtype == torch.bfloat16 else False,
            fp16=True if dtype == torch.float16 else False,
            output_dir="",
        )
        
        model, processor, peft_config = create_model_and_processor(
            model_config=model_config,
            peft_lora_config=peft_lora_config,
            training_args=training_args,
        )

        model, checkpoint_step = load_model_from_checkpoint(
            model,
            load_from_pretrained,
            load_from_pretrained_step
        )
        model.eval()

        if torch.cuda.device_count() > 1:
            print(f"Using {torch.cuda.device_count()} GPUs for inference...")
            self.device = "cuda"
            model = nn.DataParallel(model)
        else:
            self.device = "cuda:0" if torch.cuda.is_available() else "cpu"

        self.model = model.to(self.device)
        self.processor = processor
        self.data_config = data_config
        self.inference_config = inference_config

When running multi-GPU inference, I get the following error:

  File ".../transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 372, in forward
    q = apply_rotary_pos_emb_vision(q.unsqueeze(0), rotary_pos_emb).squeeze(0)
  File ".../transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 255, in apply_rotary_pos_emb_vision
    output = (tensor * cos) + (rotate_half(tensor) * sin)
RuntimeError: The size of tensor a (8960) must match the size of tensor b (20480) at non-singleton dimension 1

Question:
How can I fix the dimension mismatch in rotary position embedding when switching from single-GPU to multi-GPU inference?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions