Skip to content

Transformers 5 and Qwen 3.5/3.6 official suppor#667

Open
Kovbo wants to merge 14 commits intomainfrom
transformers-5-official
Open

Transformers 5 and Qwen 3.5/3.6 official suppor#667
Kovbo wants to merge 14 commits intomainfrom
transformers-5-official

Conversation

@Kovbo
Copy link
Copy Markdown
Collaborator

@Kovbo Kovbo commented Apr 29, 2026

Summary

Updates ART’s backend stack to the official Transformers/vLLM/Unsloth releases and removes the older compatibility patches that were only needed before upstream Qwen 3.5/3.6 support landed.

Key changes:

  • Bumps backend deps to transformers==5.6.2, vllm==0.19.1, newer Unsloth/Unsloth Zoo, TRL, Accelerate, PEFT, TorchAO, etc
  • Forked Unsloth to resolve TRL and Transformers.
  • Removes ART’s Transformers/vLLM monkey patches for Qwen 3.5, rope validation, and mask preprocessing.
  • Adds chat_template_kwargs to InternalModelConfig and applies it consistently to vLLM inference requests and local training tokenization.
  • Updates Qwen 3.5/3.6 DeltaNet handling and removes the old Qwen MoE forced-merged rollout path.
  • Stops converting Unsloth/local MoE LoRA checkpoints before handing them to vLLM, since vLLM now supports PEFT fused target_parameters directly.
  • Keeps the remaining MoE LoRA conversion scoped to Megatron: Megatron still loads PEFT fused target-parameter MoE LoRA as ART Megatron per-expert LoRA internally,
    then converts trained shards back to PEFT/vLLM-compatible fused tensors when merging checkpoint shards.
  • Adds/cleans up Megatron dedicated rollout_weights_mode="merged" support for Qwen 3.6 experiments: Megatron trains the LoRA adapter, while the dedicated vLLM server receives merged weights for rollout inference. This avoids relying on direct vLLM LoRA serving for Megatron target-parameter MoE adapters.
  • Megatron training works without merging LoRA. We made it work by rewriting exported Qwen 3.5/3.6 adapter keys to the vLLM/HF language_model.layers path.

@Kovbo Kovbo requested a review from vivekkalyan April 29, 2026 21:56
@Kovbo Kovbo requested review from FurtherAI May 1, 2026 00:00
@Kovbo Kovbo marked this pull request as ready for review May 1, 2026 00:00
@vivekkalyan
Copy link
Copy Markdown
Collaborator

I think the new PEFT fused MoE <-> Megatron converter has the A/B orientation flipped for real PEFT target_parameters checkpoints.

I checked this against PEFT 0.18.1, transformers 5.6.2, and Qwen/Qwen3.5-35B-A3B. The model's expert parameter shapes are:

gate_up_proj: (256, 1024, 2048)
down_proj:    (256, 2048, 512)

For this target-parameter layout, PEFT saves:

mlp.experts.base_layer.lora_A.weight: (num_experts * r, 1024)
mlp.experts.base_layer.lora_B.weight: (2048, num_experts * r)
mlp.experts.lora_A.weight:            (num_experts * r, 2048)
mlp.experts.lora_B.weight:            (512, num_experts * r)

But the current converter maps PEFT lora_A -> Megatron lora_A and PEFT lora_B -> Megatron lora_B. For expert 0 with r=1, that produces:

gate_proj.lora_A: actual=(1, 1024), expected=(1, 2048)
gate_proj.lora_B: actual=(1024, 1), expected=(512, 1)
up_proj.lora_A:   actual=(1, 1024), expected=(1, 2048)
up_proj.lora_B:   actual=(1024, 1), expected=(512, 1)
down_proj.lora_A: actual=(1, 2048), expected=(1, 512)
down_proj.lora_B: actual=(512, 1), expected=(2048, 1)

So real Qwen MoE Megatron adapters should fail at load time with a shape mismatch, likely before training starts or during the initial merged-weight sync. The current unit test round-trips because its fixture uses the opposite tensor orientation from what PEFT actually writes for these target parameters.

Suggested fix: treat PEFT target-parameter tensors as the transpose/orientation of Megatron module LoRA. In other words, for a fused expert base weight shaped [E, O, I]:

PEFT lora_A: [E * r, O]
PEFT lora_B: [I, E * r]

Megatron lora_A: [r, I]
Megatron lora_B: [O, r]

For gate/up, the conversion should look more like:

num_experts, gate_up_peft_a = _reshape_expert_a(lora_a_key, lora_a, rank=rank)
gate_up_peft_b = _reshape_expert_b(
    lora_b_key, lora_b, num_experts=num_experts, rank=rank
)

gate_b, up_b = gate_up_peft_a.chunk(2, dim=2)

for expert_idx in range(num_experts):
    expert_a = gate_up_peft_b[expert_idx].T.contiguous()
    converted[f"{prefix}.{expert_idx}.gate_proj.lora_A.weight"] = expert_a
    converted[f"{prefix}.{expert_idx}.up_proj.lora_A.weight"] = expert_a.clone()
    converted[f"{prefix}.{expert_idx}.gate_proj.lora_B.weight"] = gate_b[expert_idx].T.contiguous()
    converted[f"{prefix}.{expert_idx}.up_proj.lora_B.weight"] = up_b[expert_idx].T.contiguous()

And down projection should swap similarly:

num_experts, down_peft_a = _reshape_expert_a(lora_a_key, lora_a, rank=rank)
down_peft_b = _reshape_expert_b(
    lora_b_key, lora_b, num_experts=num_experts, rank=rank
)

for expert_idx in range(num_experts):
    converted[f"{prefix}.{expert_idx}.down_proj.lora_A.weight"] = down_peft_b[expert_idx].T.contiguous()
    converted[f"{prefix}.{expert_idx}.down_proj.lora_B.weight"] = down_peft_a[expert_idx].T.contiguous()

The inverse convert_megatron_moe_lora_to_peft_target_parameter also needs the matching inverse mapping, otherwise Megatron may load correctly but the merged/exported PEFT/vLLM adapter will still be invalid.

I'd also update the converter test fixture to use the real PEFT target-parameter orientation:

gate_up_A = torch.empty(num_experts * rank, 2 * intermediate)
gate_up_B = torch.empty(hidden, num_experts * rank)
down_A = torch.empty(num_experts * rank, hidden)
down_B = torch.empty(intermediate, num_experts * rank)

@Kovbo
Copy link
Copy Markdown
Collaborator Author

Kovbo commented May 1, 2026

@vivekkalyan, hmm

after applying that change, Megatron failed loading the adapter with a shape mismatch:

gate_proj: got (256, 1024, 8), expected (256, 2048, 8)

Then I inspected the actual PEFT checkpoint tensors produced for Qwen3.6, and they matched the original converter’s expected orientation, not the transposed orientation suggested by the other LLM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants