Transformers 5 and Qwen 3.5/3.6 official suppor by Kovbo · Pull Request #667 · OpenPipe/ART

Kovbo · 2026-04-29T21:56:36Z

Summary

Updates ART’s backend stack to the official Transformers/vLLM/Unsloth releases and removes the older compatibility patches that were only needed before upstream Qwen 3.5/3.6 support landed.

Key changes:

Bumps backend deps to transformers==5.6.2, vllm==0.19.1, newer Unsloth/Unsloth Zoo, TRL, Accelerate, PEFT, TorchAO, etc
Forked Unsloth to resolve TRL and Transformers.
Removes ART’s Transformers/vLLM monkey patches for Qwen 3.5, rope validation, and mask preprocessing.
Adds chat_template_kwargs to InternalModelConfig and applies it consistently to vLLM inference requests and local training tokenization.
Updates Qwen 3.5/3.6 DeltaNet handling and removes the old Qwen MoE forced-merged rollout path.
Stops converting Unsloth/local MoE LoRA checkpoints before handing them to vLLM, since vLLM now supports PEFT fused target_parameters directly.
Keeps the remaining MoE LoRA conversion scoped to Megatron: Megatron still loads PEFT fused target-parameter MoE LoRA as ART Megatron per-expert LoRA internally,
then converts trained shards back to PEFT/vLLM-compatible fused tensors when merging checkpoint shards.
Adds/cleans up Megatron dedicated rollout_weights_mode="merged" support for Qwen 3.6 experiments: Megatron trains the LoRA adapter, while the dedicated vLLM server receives merged weights for rollout inference. This avoids relying on direct vLLM LoRA serving for Megatron target-parameter MoE adapters.
Megatron training works without merging LoRA. We made it work by rewriting exported Qwen 3.5/3.6 adapter keys to the vLLM/HF language_model.layers path.

…ficial # Conflicts: # src/art/local/backend.py # src/art/preprocessing/tokenize.py

vivekkalyan · 2026-05-01T00:48:56Z

I think the new PEFT fused MoE <-> Megatron converter has the A/B orientation flipped for real PEFT target_parameters checkpoints.

I checked this against PEFT 0.18.1, transformers 5.6.2, and Qwen/Qwen3.5-35B-A3B. The model's expert parameter shapes are:

gate_up_proj: (256, 1024, 2048)
down_proj:    (256, 2048, 512)

For this target-parameter layout, PEFT saves:

mlp.experts.base_layer.lora_A.weight: (num_experts * r, 1024)
mlp.experts.base_layer.lora_B.weight: (2048, num_experts * r)
mlp.experts.lora_A.weight:            (num_experts * r, 2048)
mlp.experts.lora_B.weight:            (512, num_experts * r)

But the current converter maps PEFT lora_A -> Megatron lora_A and PEFT lora_B -> Megatron lora_B. For expert 0 with r=1, that produces:

gate_proj.lora_A: actual=(1, 1024), expected=(1, 2048)
gate_proj.lora_B: actual=(1024, 1), expected=(512, 1)
up_proj.lora_A:   actual=(1, 1024), expected=(1, 2048)
up_proj.lora_B:   actual=(1024, 1), expected=(512, 1)
down_proj.lora_A: actual=(1, 2048), expected=(1, 512)
down_proj.lora_B: actual=(512, 1), expected=(2048, 1)

So real Qwen MoE Megatron adapters should fail at load time with a shape mismatch, likely before training starts or during the initial merged-weight sync. The current unit test round-trips because its fixture uses the opposite tensor orientation from what PEFT actually writes for these target parameters.

Suggested fix: treat PEFT target-parameter tensors as the transpose/orientation of Megatron module LoRA. In other words, for a fused expert base weight shaped [E, O, I]:

PEFT lora_A: [E * r, O]
PEFT lora_B: [I, E * r]

Megatron lora_A: [r, I]
Megatron lora_B: [O, r]

For gate/up, the conversion should look more like:

num_experts, gate_up_peft_a = _reshape_expert_a(lora_a_key, lora_a, rank=rank)
gate_up_peft_b = _reshape_expert_b(
    lora_b_key, lora_b, num_experts=num_experts, rank=rank
)

gate_b, up_b = gate_up_peft_a.chunk(2, dim=2)

for expert_idx in range(num_experts):
    expert_a = gate_up_peft_b[expert_idx].T.contiguous()
    converted[f"{prefix}.{expert_idx}.gate_proj.lora_A.weight"] = expert_a
    converted[f"{prefix}.{expert_idx}.up_proj.lora_A.weight"] = expert_a.clone()
    converted[f"{prefix}.{expert_idx}.gate_proj.lora_B.weight"] = gate_b[expert_idx].T.contiguous()
    converted[f"{prefix}.{expert_idx}.up_proj.lora_B.weight"] = up_b[expert_idx].T.contiguous()

And down projection should swap similarly:

num_experts, down_peft_a = _reshape_expert_a(lora_a_key, lora_a, rank=rank)
down_peft_b = _reshape_expert_b(
    lora_b_key, lora_b, num_experts=num_experts, rank=rank
)

for expert_idx in range(num_experts):
    converted[f"{prefix}.{expert_idx}.down_proj.lora_A.weight"] = down_peft_b[expert_idx].T.contiguous()
    converted[f"{prefix}.{expert_idx}.down_proj.lora_B.weight"] = down_peft_a[expert_idx].T.contiguous()

The inverse convert_megatron_moe_lora_to_peft_target_parameter also needs the matching inverse mapping, otherwise Megatron may load correctly but the merged/exported PEFT/vLLM adapter will still be invalid.

I'd also update the converter test fixture to use the real PEFT target-parameter orientation:

gate_up_A = torch.empty(num_experts * rank, 2 * intermediate)
gate_up_B = torch.empty(hidden, num_experts * rank)
down_A = torch.empty(num_experts * rank, hidden)
down_B = torch.empty(intermediate, num_experts * rank)

Kovbo · 2026-05-01T01:25:21Z

@vivekkalyan, hmm

after applying that change, Megatron failed loading the adapter with a shape mismatch:

gate_proj: got (256, 1024, 8), expected (256, 2048, 8)

Then I inspected the actual PEFT checkpoint tensors produced for Qwen3.6, and they matched the original converter’s expected orientation, not the transposed orientation suggested by the other LLM.

# Conflicts: # src/art/preprocessing/tokenize.py

Kovbo added 4 commits April 29, 2026 02:16

Support official transformers 5 without patches

2caa783

remove megatron qwen 3.5 lora merge

7248b2f

add more qwen models

e61a442

accept chat_template_kwargs for qwen 3.6

8e8e751

Kovbo requested a review from vivekkalyan April 29, 2026 21:56

Kovbo added 6 commits April 30, 2026 20:22

add megatron Qewn 3.6 support

f46488c

Merge branch 'main' of github.com:OpenPipe/ART into transformers-5-of…

f676649

…ficial # Conflicts: # src/art/local/backend.py # src/art/preprocessing/tokenize.py

fork unsloth to resolve new TRL and transformers

daa423f

refactor

41f5686

format

f4d3d04

rename var

4b5f225

Kovbo requested review from FurtherAI May 1, 2026 00:00

Kovbo marked this pull request as ready for review May 1, 2026 00:00

Kovbo added 4 commits May 1, 2026 20:08

support megatron LoRA

1af1e57

Merge remote-tracking branch 'origin/main' into transformers-5-official

520ce02

# Conflicts: # src/art/preprocessing/tokenize.py

add dense support

4571a73

fix dense qwen training

473495e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transformers 5 and Qwen 3.5/3.6 official suppor#667

Transformers 5 and Qwen 3.5/3.6 official suppor#667
Kovbo wants to merge 14 commits intomainfrom
transformers-5-official

Kovbo commented Apr 29, 2026 •

edited

Loading

Uh oh!

vivekkalyan commented May 1, 2026

Uh oh!

Kovbo commented May 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Kovbo commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vivekkalyan commented May 1, 2026

Uh oh!

Kovbo commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Kovbo commented Apr 29, 2026 •

edited

Loading

Kovbo commented May 1, 2026 •

edited

Loading