Add SGLang disaggregated P/D inference and Primus Megatron-LM scaleout benchmarks + mad-slurm-multinode skill by mkuznet1 · Pull Request #174 · ROCm/MAD

mkuznet1 · 2026-07-01T18:00:05Z

Summary

Adds two new multi-node benchmark workloads (SGLang disaggregated prefill/decode
inference and Primus Megatron-LM scaleout training) together with the
mad-slurm-multinode skill that deploys and runs them on a fresh SLURM cluster,
plus the corresponding models.json entries.

Branches off the current develop tip (Primus v26.4, #172); 4 commits,
34 files.

What's included

SGLang disaggregated P/D inference (`1a3e80e`)

Single full-overlay Dockerfiles that merge RCCL + MoRI + NIXL/Mooncake
KV-transfer into one build (no base-image chaining):
docker/sglang_disagg_inference_full_overlay.ubuntu.amd.Dockerfile and the
*.oci-rdma62.* variant (rdma-core v62 baked in for OCI-CX7 hosts).
scripts/sglang_disagg/run.sh native launcher with in-container
rank-ordered node-IP discovery (ip_rendezvous.py).
parse_to_csv.py rewritten for full metric extraction (best-throughput
iteration) into the madengine perf-CSV schema.
Resilient sweep in benchmark_xPyD.sh (fail-fast + per-point retries).

Primus Megatron-LM scaleout (`348e360`)

docker/primus_megatron_train_rccl_overlay.ubuntu.amd.Dockerfile on the
rocm/primus:v26.4 base (RCCL from source + librccl smifix, optional
rdma-core for Broadcom Thor2 bnxt_re).
Scaleout scripts under scripts/primus_scaleout/megatron-lm/
(run / setup / report .sh + report .py), covering Llama-3.1 8B/70B/405B.

mad-slurm-multinode skill (`cf07399`)

SKILL.md + reference docs (cluster-types, deploy-bootstrap, manifests,
launch-and-results, gotchas) and helper scripts (detect_cluster_env,
preflight, validate_manifest).
Cluster-agnostic mad.env templates for CX7/Mellanox-RoCE,
AMD-AINIC/Pollara, and Broadcom-Thor2-RoCE.
Manifest templates for the supported workloads only
(sglang_disagg_deepseek-r1, primus llama-3.1-8b/-70b on v26.4).
Sanitized per-archetype walkthroughs (no real node names, queues, or tokens).

Model configurations (`91fbb75`)

models.json: SGLang disaggregated DeepSeek-R1 inference, and Primus
Megatron-LM scaleout training for Llama-3.1 8B/70B/405B.

Notes / test plan

The Primus overlay pins rocm/primus:v26.4; the RCCL-from-source overlay is
version-sensitive, so a build + smoke run should be confirmed before merge.
SGLang path validated on a multi-node DeepSeek-R1 run (perf CSV harvested into
the madengine schema).
No cluster-specific values, secrets, or codenames are committed; manifests and
mad.env templates use <FILL_...> placeholders.

…ooling Introduce single full-overlay Dockerfiles for SGLang disaggregated prefill/decode inference that merge the RCCL, MoRI, and NIXL/Mooncake KV-transfer layers into one build (no base-image chaining), plus the supporting run and benchmark scripts. Key changes: - New Dockerfiles: `sglang_disagg_inference_full_overlay.ubuntu.amd.Dockerfile` and the `*.oci-rdma62.*` variant (rdma-core v62 baked in for OCI-CX7 hosts). - `run.sh` native launcher with in-container rank-ordered node-IP discovery (`ip_rendezvous.py`), so `IPADDRS`/`SGLANG_NODE_IPS` need not be forwarded. - `parse_to_csv.py` rewritten for comprehensive metric extraction (best-throughput iteration) into the madengine perf-CSV schema. - Resilient benchmark sweep in `benchmark_xPyD.sh` (fail-fast + point retries) writing to the madengine-expected perf CSV path. - Updated `sglang_disagg_mori_io_ep.sh` / `sglang_disagg_server.sh` and README. Co-authored-by: Ilia Kosarev <Ilia.Kosarev@amd.com>

…ripts Add a Primus Megatron-LM multi-node scaleout training image (candidate RCCL built from source) and the benchmark scripts that drive it and convert training output into the madengine perf-CSV format. Key changes: - New overlay Dockerfile `primus_megatron_train_rccl_overlay.ubuntu.amd.Dockerfile` on the `rocm/primus:v26.4` base (RCCL built from source + librccl smifix, optional rdma-core for Broadcom Thor2 bnxt_re). - New scaleout scripts under `scripts/primus_scaleout/megatron-lm/`: `run.sh`, `primus_megatron-lm_benchmark_setup.sh`, `primus_megatron-lm_benchmark_report.sh`, and `primus_megatron-lm_benchmark_report.py` (covers Llama-3.1 8B/70B/405B). Co-authored-by: Ilia Kosarev <Ilia.Kosarev@amd.com>

Introduce the mad-slurm-multinode skill, which deploys and runs madengine performance tests on an unprepared SLURM cluster from scratch and launches multi-node runs from manifest templates. Key additions: - `SKILL.md` plus reference docs (cluster-types, deploy-bootstrap, manifests, launch-and-results, gotchas) and helper scripts (`detect_cluster_env.sh`, `preflight.sh`, `validate_manifest.sh`). - Cluster-agnostic `mad.env` templates for the CX7/Mellanox-RoCE, AMD-AINIC/Pollara, and Broadcom-Thor2-RoCE archetypes. - Manifest templates for the supported workloads: `sglang_disagg_deepseek-r1` and Primus Megatron-LM scaleout `primus_llama-3.1-8b`/`-70b` (on `rocm/primus:v26.4`). - Sanitized end-to-end example walkthroughs (no real node names, queues, or tokens) for each archetype. Co-authored-by: Ilia Kosarev <Ilia.Kosarev@amd.com>

Add model entries in `models.json` for the new inference and training workloads. Key additions: - SGLang disaggregated DeepSeek-R1 inference configuration. - Primus Megatron-LM scaleout training configurations for Llama-3.1 8B/70B/405B, each with its Dockerfile, scripts, and perf-result tracking. Co-authored-by: Ilia Kosarev <Ilia.Kosarev@amd.com>

Copilot

Pull request overview

This PR adds two new multi-node benchmark workloads (SGLang disaggregated prefill/decode inference and Primus Megatron-LM scaleout training) plus a new mad-slurm-multinode skill to bootstrap a fresh SLURM cluster and run supported workloads, along with corresponding models.json registrations and Docker overlay images.

Changes:

Introduces a madengine-native SGLang disagg entrypoint (scripts/sglang_disagg/run.sh) with in-container rank-ordered IP discovery and updated benchmark parsing/CSV emission.
Adds Primus Megatron-LM scaleout benchmark runner + reporting scripts and a build-verified RCCL overlay image based on rocm/primus:v26.4.
Adds the mad-slurm-multinode skill documentation, templates, and static manifest validator to deploy/run these workloads on multiple SLURM cluster archetypes.

Reviewed changes

Copilot reviewed 33 out of 34 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
scripts/sglang_disagg/sglang_disagg_server.sh	Adds KV transfer backend selection/validation and DeepSeek-R1 configs for the non-MoRI launcher.
scripts/sglang_disagg/sglang_disagg_mori_io_ep.sh	Moves runtime build-layer responsibilities into the image and improves benchmark failure propagation; adds KV backend note.
scripts/sglang_disagg/run.sh	New madengine bridge entrypoint; maps env/topology and performs in-container IP rendezvous + optional weight staging.
scripts/sglang_disagg/README.MD	Documents the madengine entrypoint flow, overlays, and updated parsing path.
scripts/sglang_disagg/parse_to_csv.py	Rewritten to extract a fuller metric set and emit madengine perf CSV rows per metric/config.
scripts/sglang_disagg/ip_rendezvous.py	New stdlib-only TCP rendezvous helper to reconstruct rank-ordered node IPs in-container.
scripts/sglang_disagg/benchmark_xPyD.sh	Makes the sweep more resilient (retries, fail-fast), aligns parsing/output with `MAD_OUTPUT_CSV`.
scripts/primus_scaleout/megatron-lm/run.sh	New Primus scaleout runner selecting model + precision combinations based on detected device.
scripts/primus_scaleout/megatron-lm/primus_megatron-lm_benchmark_setup.sh	New setup script for Primus benchmark prerequisites (tokenizers, repo prep).
scripts/primus_scaleout/megatron-lm/primus_megatron-lm_benchmark_report.sh	New Megatron-LM benchmark driver that launches training and writes perf CSVs.
scripts/primus_scaleout/megatron-lm/primus_megatron-lm_benchmark_report.py	New parser to convert Primus training logs into CSV metrics (incl. running-average variants).
models.json	Registers new SGLang disagg DeepSeek-R1 workload and Primus Megatron-LM scaleout workloads.
docker/sglang_disagg_inference_full_overlay.ubuntu.amd.Dockerfile	Adds merged “full overlay” image build (RCCL+MoRI+RIXL/NIXL+Mooncake) with sanity checks.
docker/sglang_disagg_inference_full_overlay.oci-rdma62.ubuntu.amd.Dockerfile	OCI variant that additionally bakes rdma-core v62 for specific host stacks.
docker/primus_megatron_train_rccl_overlay.ubuntu.amd.Dockerfile	Adds build-verified RCCL-from-source overlay + optional rdma-core source build for Primus training images.
.gitignore	Un-ignores checked-in skill manifest template JSON files under the skill assets path.
.claude/skills/mad-slurm-multinode/SKILL.md	Adds the new SLURM bootstrap + run skill with required inputs, workflow, and responsibilities.
.claude/skills/mad-slurm-multinode/scripts/validate_manifest.sh	Adds a GPU-free static manifest validator (placeholders, env consistency, asset resolution).
.claude/skills/mad-slurm-multinode/scripts/preflight.sh	Adds a preflight checker for docker/SLURM/git/python/conda/GPU SMI/HF token prerequisites.
.claude/skills/mad-slurm-multinode/scripts/detect_cluster_env.sh	Adds a node inspection helper to propose archetype-specific network/RDMA settings.
.claude/skills/mad-slurm-multinode/references/manifests.md	Adds manifest anatomy + fill checklist guidance (secrets handling, mounts, interface consistency).
.claude/skills/mad-slurm-multinode/references/launch-and-results.md	Adds run/aggregation guidance and failure triage for multi-node madengine runs.
.claude/skills/mad-slurm-multinode/references/gotchas.md	Adds cross-cutting + per-workload pitfalls and validated workarounds/expectations.
.claude/skills/mad-slurm-multinode/references/deploy-bootstrap.md	Adds detailed idempotent bootstrap steps for fresh nodes (clone, conda, install, env).
.claude/skills/mad-slurm-multinode/references/cluster-types.md	Documents archetype-specific transport settings (CX7/AINIC/Thor2) and validation guidance.
.claude/skills/mad-slurm-multinode/examples/thor2-bnxt-walkthrough.md	Adds sanitized end-to-end example for Broadcom Thor2 clusters.
.claude/skills/mad-slurm-multinode/examples/cx7-roce-walkthrough.md	Adds sanitized end-to-end example for CX7/Mellanox RoCE clusters.
.claude/skills/mad-slurm-multinode/examples/amd-ainic-walkthrough.md	Adds sanitized end-to-end example for AMD AINIC/Pollara clusters.
.claude/skills/mad-slurm-multinode/assets/manifests/sglang_disagg_deepseek-r1.template.json	Adds a filled-workload template manifest for SGLang disagg DeepSeek-R1.
.claude/skills/mad-slurm-multinode/assets/manifests/primus_llama-3.1-8b.template.json	Adds a filled-workload template manifest for Primus 8B scaleout.
.claude/skills/mad-slurm-multinode/assets/manifests/primus_llama-3.1-70b.template.json	Adds a filled-workload template manifest for Primus 70B scaleout.
.claude/skills/mad-slurm-multinode/assets/mad.env/mad.env.thor2-bnxt.template	Adds archetype `mad.env` template for Broadcom Thor2 environments.
.claude/skills/mad-slurm-multinode/assets/mad.env/mad.env.cx7-roce.template	Adds archetype `mad.env` template for CX7/Mellanox RoCE environments.
.claude/skills/mad-slurm-multinode/assets/mad.env/mad.env.amd-ainic.template	Adds archetype `mad.env` template for AMD AINIC/Pollara environments.

Comments suppressed due to low confidence (1)

scripts/sglang_disagg/sglang_disagg_mori_io_ep.sh:191

KV_TRANSFER_BACKEND is appended into PREFILL_MODEL_CONFIG/DECODE_MODEL_CONFIG and later executed via eval, but it is not validated. This allows invalid backends and can enable shell-token injection via KV_TRANSFER_BACKEND.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+export xP="${xP:-${SGLANG_DISAGG_PREFILL_NODES:-1}}"
+export yD="${yD:-${SGLANG_DISAGG_DECODE_NODES:-1}}"
+export NNODES="${NNODES:-${SGLANG_DISAGG_TOTAL_NODES:-${WORLD_SIZE:-$((xP + yD))}}}"
+export MASTER_PORT="${MASTER_PORT:-23731}"
+export IO_EP_TP_SIZE="${IO_EP_TP_SIZE:-${SGLANG_TP_SIZE:-8}}"


+    if dp_mode == "1":
+        backend = "mori_dp"
+    elif run_mori == "1":
+        backend = "mori_io"
    else:


raviguptaamd · 2026-07-03T00:28:07Z

+RUN_LOG_JOB_ID="${SLURM_JOB_ID:-0}"
+RUN_LOG_DIR="/run_logs/${RUN_LOG_JOB_ID}"
+mkdir -p "$RUN_LOG_DIR" 2>/dev/null || true
+
+LOG="${RUN_LOG_DIR}/benchmark_${RUN_LOG_JOB_ID}_${timestamp}_xP${xP}_yD${yD}_${MODEL_NAME}"


/tmp/run_logs is not NFS bound - /run_logs shoudl NFS backed dir passed on through docker env variables - this is incorrect for disagg
cc: @lcskrishna @basemam

+# Parse named arguments
+while [[ "$#" -gt 0 ]]; do
+    case $1 in
+        --model_repo) MODEL_REPO="$2"; shift ;;
+        *) echo "Unknown parameter passed: $1"; usage ;;
+    esac
+    shift
+done


+fi
+
+# Run primus pytorch setup script
+echo "Running setup script to download tokenizers"
+bash ./primus_megatron-lm_benchmark_setup.sh -m $model


+# Parse named arguments
+while [[ "$#" -gt 0 ]]; do
+    case $1 in
+        -m) MODEL_NAME="$2"; shift ;;
+        *) echo "Unknown parameter passed: $1"; usage ;;
+    esac
+    shift
+done


+cd /workspace/Primus
+git pull


raviguptaamd · 2026-07-03T00:28:07Z

+RUN_LOG_JOB_ID="${SLURM_JOB_ID:-0}"
+RUN_LOG_DIR="/run_logs/${RUN_LOG_JOB_ID}"
+mkdir -p "$RUN_LOG_DIR" 2>/dev/null || true
+
+LOG="${RUN_LOG_DIR}/benchmark_${RUN_LOG_JOB_ID}_${timestamp}_xP${xP}_yD${yD}_${MODEL_NAME}"


/tmp/run_logs is not NFS bound - /run_logs shoudl NFS backed dir passed on through docker env variables - this is incorrect for disagg
cc: @lcskrishna @basemam

raviguptaamd · 2026-07-03T00:29:56Z

Why is tis file added ?

raviguptaamd · 2026-07-03T00:30:49Z

Core sglanf_dissag is being changes needs thorough review

mkuznet1 and others added 4 commits July 1, 2026 17:51

i-kosarev requested a review from Copilot July 1, 2026 18:26

Copilot started reviewing on behalf of i-kosarev July 1, 2026 18:27 View session

Copilot AI reviewed Jul 1, 2026

View reviewed changes

raviguptaamd requested changes Jul 3, 2026

View reviewed changes

i-kosarev mentioned this pull request Jul 3, 2026

Fix Primus Llama-3.1-8B scaleout: GBS normalize + NCCL_NET_PLUGIN for v26.4 RCCL overlay mkuznet1/MAD#1

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add SGLang disaggregated P/D inference and Primus Megatron-LM scaleout benchmarks + mad-slurm-multinode skill#174

Add SGLang disaggregated P/D inference and Primus Megatron-LM scaleout benchmarks + mad-slurm-multinode skill#174
mkuznet1 wants to merge 4 commits into
ROCm:developfrom
mkuznet1:aicomnet_dev_public

mkuznet1 commented Jul 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

raviguptaamd Jul 3, 2026

Uh oh!

raviguptaamd Jul 3, 2026

Uh oh!

raviguptaamd Jul 3, 2026

Uh oh!

raviguptaamd Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		cd /workspace/Primus
		git pull No newline at end of file

Uh oh!

Conversation

mkuznet1 commented Jul 1, 2026

Summary

What's included

SGLang disaggregated P/D inference (1a3e80e)

Primus Megatron-LM scaleout (348e360)

mad-slurm-multinode skill (cf07399)

Model configurations (91fbb75)

Notes / test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

raviguptaamd Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

raviguptaamd Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

raviguptaamd Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

raviguptaamd Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SGLang disaggregated P/D inference (`1a3e80e`)

Primus Megatron-LM scaleout (`348e360`)

mad-slurm-multinode skill (`cf07399`)

Model configurations (`91fbb75`)