Skip to content

Add SGLang disaggregated P/D inference and Primus Megatron-LM scaleout benchmarks + mad-slurm-multinode skill#174

Draft
mkuznet1 wants to merge 4 commits into
ROCm:developfrom
mkuznet1:aicomnet_dev_public
Draft

Add SGLang disaggregated P/D inference and Primus Megatron-LM scaleout benchmarks + mad-slurm-multinode skill#174
mkuznet1 wants to merge 4 commits into
ROCm:developfrom
mkuznet1:aicomnet_dev_public

Conversation

@mkuznet1

@mkuznet1 mkuznet1 commented Jul 1, 2026

Copy link
Copy Markdown

Summary

Adds two new multi-node benchmark workloads (SGLang disaggregated prefill/decode
inference and Primus Megatron-LM scaleout training) together with the
mad-slurm-multinode skill that deploys and runs them on a fresh SLURM cluster,
plus the corresponding models.json entries.

Branches off the current develop tip (Primus v26.4, #172); 4 commits,
34 files.

What's included

SGLang disaggregated P/D inference (1a3e80e)

  • Single full-overlay Dockerfiles that merge RCCL + MoRI + NIXL/Mooncake
    KV-transfer into one build (no base-image chaining):
    docker/sglang_disagg_inference_full_overlay.ubuntu.amd.Dockerfile and the
    *.oci-rdma62.* variant (rdma-core v62 baked in for OCI-CX7 hosts).
  • scripts/sglang_disagg/run.sh native launcher with in-container
    rank-ordered node-IP discovery (ip_rendezvous.py).
  • parse_to_csv.py rewritten for full metric extraction (best-throughput
    iteration) into the madengine perf-CSV schema.
  • Resilient sweep in benchmark_xPyD.sh (fail-fast + per-point retries).

Primus Megatron-LM scaleout (348e360)

  • docker/primus_megatron_train_rccl_overlay.ubuntu.amd.Dockerfile on the
    rocm/primus:v26.4 base (RCCL from source + librccl smifix, optional
    rdma-core for Broadcom Thor2 bnxt_re).
  • Scaleout scripts under scripts/primus_scaleout/megatron-lm/
    (run / setup / report .sh + report .py), covering Llama-3.1 8B/70B/405B.

mad-slurm-multinode skill (cf07399)

  • SKILL.md + reference docs (cluster-types, deploy-bootstrap, manifests,
    launch-and-results, gotchas) and helper scripts (detect_cluster_env,
    preflight, validate_manifest).
  • Cluster-agnostic mad.env templates for CX7/Mellanox-RoCE,
    AMD-AINIC/Pollara, and Broadcom-Thor2-RoCE.
  • Manifest templates for the supported workloads only
    (sglang_disagg_deepseek-r1, primus llama-3.1-8b/-70b on v26.4).
  • Sanitized per-archetype walkthroughs (no real node names, queues, or tokens).

Model configurations (91fbb75)

  • models.json: SGLang disaggregated DeepSeek-R1 inference, and Primus
    Megatron-LM scaleout training for Llama-3.1 8B/70B/405B.

Notes / test plan

  • The Primus overlay pins rocm/primus:v26.4; the RCCL-from-source overlay is
    version-sensitive, so a build + smoke run should be confirmed before merge.
  • SGLang path validated on a multi-node DeepSeek-R1 run (perf CSV harvested into
    the madengine schema).
  • No cluster-specific values, secrets, or codenames are committed; manifests and
    mad.env templates use <FILL_...> placeholders.

mkuznet1 and others added 4 commits July 1, 2026 17:51
…ooling

Introduce single full-overlay Dockerfiles for SGLang disaggregated
prefill/decode inference that merge the RCCL, MoRI, and NIXL/Mooncake
KV-transfer layers into one build (no base-image chaining), plus the
supporting run and benchmark scripts.

Key changes:
- New Dockerfiles: `sglang_disagg_inference_full_overlay.ubuntu.amd.Dockerfile`
  and the `*.oci-rdma62.*` variant (rdma-core v62 baked in for OCI-CX7 hosts).
- `run.sh` native launcher with in-container rank-ordered node-IP discovery
  (`ip_rendezvous.py`), so `IPADDRS`/`SGLANG_NODE_IPS` need not be forwarded.
- `parse_to_csv.py` rewritten for comprehensive metric extraction
  (best-throughput iteration) into the madengine perf-CSV schema.
- Resilient benchmark sweep in `benchmark_xPyD.sh` (fail-fast + point retries)
  writing to the madengine-expected perf CSV path.
- Updated `sglang_disagg_mori_io_ep.sh` / `sglang_disagg_server.sh` and README.

Co-authored-by: Ilia Kosarev <Ilia.Kosarev@amd.com>
…ripts

Add a Primus Megatron-LM multi-node scaleout training image (candidate RCCL
built from source) and the benchmark scripts that drive it and convert
training output into the madengine perf-CSV format.

Key changes:
- New overlay Dockerfile
  `primus_megatron_train_rccl_overlay.ubuntu.amd.Dockerfile` on the
  `rocm/primus:v26.4` base (RCCL built from source + librccl smifix, optional
  rdma-core for Broadcom Thor2 bnxt_re).
- New scaleout scripts under `scripts/primus_scaleout/megatron-lm/`: `run.sh`,
  `primus_megatron-lm_benchmark_setup.sh`,
  `primus_megatron-lm_benchmark_report.sh`, and
  `primus_megatron-lm_benchmark_report.py` (covers Llama-3.1 8B/70B/405B).

Co-authored-by: Ilia Kosarev <Ilia.Kosarev@amd.com>
Introduce the mad-slurm-multinode skill, which deploys and runs madengine
performance tests on an unprepared SLURM cluster from scratch and launches
multi-node runs from manifest templates.

Key additions:
- `SKILL.md` plus reference docs (cluster-types, deploy-bootstrap, manifests,
  launch-and-results, gotchas) and helper scripts (`detect_cluster_env.sh`,
  `preflight.sh`, `validate_manifest.sh`).
- Cluster-agnostic `mad.env` templates for the CX7/Mellanox-RoCE,
  AMD-AINIC/Pollara, and Broadcom-Thor2-RoCE archetypes.
- Manifest templates for the supported workloads: `sglang_disagg_deepseek-r1`
  and Primus Megatron-LM scaleout `primus_llama-3.1-8b`/`-70b` (on
  `rocm/primus:v26.4`).
- Sanitized end-to-end example walkthroughs (no real node names, queues, or
  tokens) for each archetype.

Co-authored-by: Ilia Kosarev <Ilia.Kosarev@amd.com>
Add model entries in `models.json` for the new inference and training
workloads.

Key additions:
- SGLang disaggregated DeepSeek-R1 inference configuration.
- Primus Megatron-LM scaleout training configurations for Llama-3.1
  8B/70B/405B, each with its Dockerfile, scripts, and perf-result tracking.

Co-authored-by: Ilia Kosarev <Ilia.Kosarev@amd.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds two new multi-node benchmark workloads (SGLang disaggregated prefill/decode inference and Primus Megatron-LM scaleout training) plus a new mad-slurm-multinode skill to bootstrap a fresh SLURM cluster and run supported workloads, along with corresponding models.json registrations and Docker overlay images.

Changes:

  • Introduces a madengine-native SGLang disagg entrypoint (scripts/sglang_disagg/run.sh) with in-container rank-ordered IP discovery and updated benchmark parsing/CSV emission.
  • Adds Primus Megatron-LM scaleout benchmark runner + reporting scripts and a build-verified RCCL overlay image based on rocm/primus:v26.4.
  • Adds the mad-slurm-multinode skill documentation, templates, and static manifest validator to deploy/run these workloads on multiple SLURM cluster archetypes.

Reviewed changes

Copilot reviewed 33 out of 34 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
scripts/sglang_disagg/sglang_disagg_server.sh Adds KV transfer backend selection/validation and DeepSeek-R1 configs for the non-MoRI launcher.
scripts/sglang_disagg/sglang_disagg_mori_io_ep.sh Moves runtime build-layer responsibilities into the image and improves benchmark failure propagation; adds KV backend note.
scripts/sglang_disagg/run.sh New madengine bridge entrypoint; maps env/topology and performs in-container IP rendezvous + optional weight staging.
scripts/sglang_disagg/README.MD Documents the madengine entrypoint flow, overlays, and updated parsing path.
scripts/sglang_disagg/parse_to_csv.py Rewritten to extract a fuller metric set and emit madengine perf CSV rows per metric/config.
scripts/sglang_disagg/ip_rendezvous.py New stdlib-only TCP rendezvous helper to reconstruct rank-ordered node IPs in-container.
scripts/sglang_disagg/benchmark_xPyD.sh Makes the sweep more resilient (retries, fail-fast), aligns parsing/output with MAD_OUTPUT_CSV.
scripts/primus_scaleout/megatron-lm/run.sh New Primus scaleout runner selecting model + precision combinations based on detected device.
scripts/primus_scaleout/megatron-lm/primus_megatron-lm_benchmark_setup.sh New setup script for Primus benchmark prerequisites (tokenizers, repo prep).
scripts/primus_scaleout/megatron-lm/primus_megatron-lm_benchmark_report.sh New Megatron-LM benchmark driver that launches training and writes perf CSVs.
scripts/primus_scaleout/megatron-lm/primus_megatron-lm_benchmark_report.py New parser to convert Primus training logs into CSV metrics (incl. running-average variants).
models.json Registers new SGLang disagg DeepSeek-R1 workload and Primus Megatron-LM scaleout workloads.
docker/sglang_disagg_inference_full_overlay.ubuntu.amd.Dockerfile Adds merged “full overlay” image build (RCCL+MoRI+RIXL/NIXL+Mooncake) with sanity checks.
docker/sglang_disagg_inference_full_overlay.oci-rdma62.ubuntu.amd.Dockerfile OCI variant that additionally bakes rdma-core v62 for specific host stacks.
docker/primus_megatron_train_rccl_overlay.ubuntu.amd.Dockerfile Adds build-verified RCCL-from-source overlay + optional rdma-core source build for Primus training images.
.gitignore Un-ignores checked-in skill manifest template JSON files under the skill assets path.
.claude/skills/mad-slurm-multinode/SKILL.md Adds the new SLURM bootstrap + run skill with required inputs, workflow, and responsibilities.
.claude/skills/mad-slurm-multinode/scripts/validate_manifest.sh Adds a GPU-free static manifest validator (placeholders, env consistency, asset resolution).
.claude/skills/mad-slurm-multinode/scripts/preflight.sh Adds a preflight checker for docker/SLURM/git/python/conda/GPU SMI/HF token prerequisites.
.claude/skills/mad-slurm-multinode/scripts/detect_cluster_env.sh Adds a node inspection helper to propose archetype-specific network/RDMA settings.
.claude/skills/mad-slurm-multinode/references/manifests.md Adds manifest anatomy + fill checklist guidance (secrets handling, mounts, interface consistency).
.claude/skills/mad-slurm-multinode/references/launch-and-results.md Adds run/aggregation guidance and failure triage for multi-node madengine runs.
.claude/skills/mad-slurm-multinode/references/gotchas.md Adds cross-cutting + per-workload pitfalls and validated workarounds/expectations.
.claude/skills/mad-slurm-multinode/references/deploy-bootstrap.md Adds detailed idempotent bootstrap steps for fresh nodes (clone, conda, install, env).
.claude/skills/mad-slurm-multinode/references/cluster-types.md Documents archetype-specific transport settings (CX7/AINIC/Thor2) and validation guidance.
.claude/skills/mad-slurm-multinode/examples/thor2-bnxt-walkthrough.md Adds sanitized end-to-end example for Broadcom Thor2 clusters.
.claude/skills/mad-slurm-multinode/examples/cx7-roce-walkthrough.md Adds sanitized end-to-end example for CX7/Mellanox RoCE clusters.
.claude/skills/mad-slurm-multinode/examples/amd-ainic-walkthrough.md Adds sanitized end-to-end example for AMD AINIC/Pollara clusters.
.claude/skills/mad-slurm-multinode/assets/manifests/sglang_disagg_deepseek-r1.template.json Adds a filled-workload template manifest for SGLang disagg DeepSeek-R1.
.claude/skills/mad-slurm-multinode/assets/manifests/primus_llama-3.1-8b.template.json Adds a filled-workload template manifest for Primus 8B scaleout.
.claude/skills/mad-slurm-multinode/assets/manifests/primus_llama-3.1-70b.template.json Adds a filled-workload template manifest for Primus 70B scaleout.
.claude/skills/mad-slurm-multinode/assets/mad.env/mad.env.thor2-bnxt.template Adds archetype mad.env template for Broadcom Thor2 environments.
.claude/skills/mad-slurm-multinode/assets/mad.env/mad.env.cx7-roce.template Adds archetype mad.env template for CX7/Mellanox RoCE environments.
.claude/skills/mad-slurm-multinode/assets/mad.env/mad.env.amd-ainic.template Adds archetype mad.env template for AMD AINIC/Pollara environments.
Comments suppressed due to low confidence (1)

scripts/sglang_disagg/sglang_disagg_mori_io_ep.sh:191

  • KV_TRANSFER_BACKEND is appended into PREFILL_MODEL_CONFIG/DECODE_MODEL_CONFIG and later executed via eval, but it is not validated. This allows invalid backends and can enable shell-token injection via KV_TRANSFER_BACKEND.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +68 to +72
export xP="${xP:-${SGLANG_DISAGG_PREFILL_NODES:-1}}"
export yD="${yD:-${SGLANG_DISAGG_DECODE_NODES:-1}}"
export NNODES="${NNODES:-${SGLANG_DISAGG_TOTAL_NODES:-${WORLD_SIZE:-$((xP + yD))}}}"
export MASTER_PORT="${MASTER_PORT:-23731}"
export IO_EP_TP_SIZE="${IO_EP_TP_SIZE:-${SGLANG_TP_SIZE:-8}}"
Comment on lines +171 to 175
if dp_mode == "1":
backend = "mori_dp"
elif run_mori == "1":
backend = "mori_io"
else:
Comment on lines +5 to +9
RUN_LOG_JOB_ID="${SLURM_JOB_ID:-0}"
RUN_LOG_DIR="/run_logs/${RUN_LOG_JOB_ID}"
mkdir -p "$RUN_LOG_DIR" 2>/dev/null || true

LOG="${RUN_LOG_DIR}/benchmark_${RUN_LOG_JOB_ID}_${timestamp}_xP${xP}_yD${yD}_${MODEL_NAME}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/tmp/run_logs is not NFS bound - /run_logs shoudl NFS backed dir passed on through docker env variables - this is incorrect for disagg
cc: @lcskrishna @basemam

Comment on lines +30 to +37
# Parse named arguments
while [[ "$#" -gt 0 ]]; do
case $1 in
--model_repo) MODEL_REPO="$2"; shift ;;
*) echo "Unknown parameter passed: $1"; usage ;;
esac
shift
done
Comment on lines +67 to +71
fi

# Run primus pytorch setup script
echo "Running setup script to download tokenizers"
bash ./primus_megatron-lm_benchmark_setup.sh -m $model
Comment on lines +29 to +36
# Parse named arguments
while [[ "$#" -gt 0 ]]; do
case $1 in
-m) MODEL_NAME="$2"; shift ;;
*) echo "Unknown parameter passed: $1"; usage ;;
esac
shift
done
Comment on lines +40 to +41
cd /workspace/Primus
git pull No newline at end of file
Comment on lines +5 to +9
RUN_LOG_JOB_ID="${SLURM_JOB_ID:-0}"
RUN_LOG_DIR="/run_logs/${RUN_LOG_JOB_ID}"
mkdir -p "$RUN_LOG_DIR" 2>/dev/null || true

LOG="${RUN_LOG_DIR}/benchmark_${RUN_LOG_JOB_ID}_${timestamp}_xP${xP}_yD${yD}_${MODEL_NAME}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/tmp/run_logs is not NFS bound - /run_logs shoudl NFS backed dir passed on through docker env variables - this is incorrect for disagg
cc: @lcskrishna @basemam

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is tis file added ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Core sglanf_dissag is being changes needs thorough review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants