Add SGLang disaggregated P/D inference and Primus Megatron-LM scaleout benchmarks + mad-slurm-multinode skill#174
Conversation
…ooling Introduce single full-overlay Dockerfiles for SGLang disaggregated prefill/decode inference that merge the RCCL, MoRI, and NIXL/Mooncake KV-transfer layers into one build (no base-image chaining), plus the supporting run and benchmark scripts. Key changes: - New Dockerfiles: `sglang_disagg_inference_full_overlay.ubuntu.amd.Dockerfile` and the `*.oci-rdma62.*` variant (rdma-core v62 baked in for OCI-CX7 hosts). - `run.sh` native launcher with in-container rank-ordered node-IP discovery (`ip_rendezvous.py`), so `IPADDRS`/`SGLANG_NODE_IPS` need not be forwarded. - `parse_to_csv.py` rewritten for comprehensive metric extraction (best-throughput iteration) into the madengine perf-CSV schema. - Resilient benchmark sweep in `benchmark_xPyD.sh` (fail-fast + point retries) writing to the madengine-expected perf CSV path. - Updated `sglang_disagg_mori_io_ep.sh` / `sglang_disagg_server.sh` and README. Co-authored-by: Ilia Kosarev <Ilia.Kosarev@amd.com>
…ripts Add a Primus Megatron-LM multi-node scaleout training image (candidate RCCL built from source) and the benchmark scripts that drive it and convert training output into the madengine perf-CSV format. Key changes: - New overlay Dockerfile `primus_megatron_train_rccl_overlay.ubuntu.amd.Dockerfile` on the `rocm/primus:v26.4` base (RCCL built from source + librccl smifix, optional rdma-core for Broadcom Thor2 bnxt_re). - New scaleout scripts under `scripts/primus_scaleout/megatron-lm/`: `run.sh`, `primus_megatron-lm_benchmark_setup.sh`, `primus_megatron-lm_benchmark_report.sh`, and `primus_megatron-lm_benchmark_report.py` (covers Llama-3.1 8B/70B/405B). Co-authored-by: Ilia Kosarev <Ilia.Kosarev@amd.com>
Introduce the mad-slurm-multinode skill, which deploys and runs madengine performance tests on an unprepared SLURM cluster from scratch and launches multi-node runs from manifest templates. Key additions: - `SKILL.md` plus reference docs (cluster-types, deploy-bootstrap, manifests, launch-and-results, gotchas) and helper scripts (`detect_cluster_env.sh`, `preflight.sh`, `validate_manifest.sh`). - Cluster-agnostic `mad.env` templates for the CX7/Mellanox-RoCE, AMD-AINIC/Pollara, and Broadcom-Thor2-RoCE archetypes. - Manifest templates for the supported workloads: `sglang_disagg_deepseek-r1` and Primus Megatron-LM scaleout `primus_llama-3.1-8b`/`-70b` (on `rocm/primus:v26.4`). - Sanitized end-to-end example walkthroughs (no real node names, queues, or tokens) for each archetype. Co-authored-by: Ilia Kosarev <Ilia.Kosarev@amd.com>
Add model entries in `models.json` for the new inference and training workloads. Key additions: - SGLang disaggregated DeepSeek-R1 inference configuration. - Primus Megatron-LM scaleout training configurations for Llama-3.1 8B/70B/405B, each with its Dockerfile, scripts, and perf-result tracking. Co-authored-by: Ilia Kosarev <Ilia.Kosarev@amd.com>
There was a problem hiding this comment.
Pull request overview
This PR adds two new multi-node benchmark workloads (SGLang disaggregated prefill/decode inference and Primus Megatron-LM scaleout training) plus a new mad-slurm-multinode skill to bootstrap a fresh SLURM cluster and run supported workloads, along with corresponding models.json registrations and Docker overlay images.
Changes:
- Introduces a madengine-native SGLang disagg entrypoint (
scripts/sglang_disagg/run.sh) with in-container rank-ordered IP discovery and updated benchmark parsing/CSV emission. - Adds Primus Megatron-LM scaleout benchmark runner + reporting scripts and a build-verified RCCL overlay image based on
rocm/primus:v26.4. - Adds the
mad-slurm-multinodeskill documentation, templates, and static manifest validator to deploy/run these workloads on multiple SLURM cluster archetypes.
Reviewed changes
Copilot reviewed 33 out of 34 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/sglang_disagg/sglang_disagg_server.sh | Adds KV transfer backend selection/validation and DeepSeek-R1 configs for the non-MoRI launcher. |
| scripts/sglang_disagg/sglang_disagg_mori_io_ep.sh | Moves runtime build-layer responsibilities into the image and improves benchmark failure propagation; adds KV backend note. |
| scripts/sglang_disagg/run.sh | New madengine bridge entrypoint; maps env/topology and performs in-container IP rendezvous + optional weight staging. |
| scripts/sglang_disagg/README.MD | Documents the madengine entrypoint flow, overlays, and updated parsing path. |
| scripts/sglang_disagg/parse_to_csv.py | Rewritten to extract a fuller metric set and emit madengine perf CSV rows per metric/config. |
| scripts/sglang_disagg/ip_rendezvous.py | New stdlib-only TCP rendezvous helper to reconstruct rank-ordered node IPs in-container. |
| scripts/sglang_disagg/benchmark_xPyD.sh | Makes the sweep more resilient (retries, fail-fast), aligns parsing/output with MAD_OUTPUT_CSV. |
| scripts/primus_scaleout/megatron-lm/run.sh | New Primus scaleout runner selecting model + precision combinations based on detected device. |
| scripts/primus_scaleout/megatron-lm/primus_megatron-lm_benchmark_setup.sh | New setup script for Primus benchmark prerequisites (tokenizers, repo prep). |
| scripts/primus_scaleout/megatron-lm/primus_megatron-lm_benchmark_report.sh | New Megatron-LM benchmark driver that launches training and writes perf CSVs. |
| scripts/primus_scaleout/megatron-lm/primus_megatron-lm_benchmark_report.py | New parser to convert Primus training logs into CSV metrics (incl. running-average variants). |
| models.json | Registers new SGLang disagg DeepSeek-R1 workload and Primus Megatron-LM scaleout workloads. |
| docker/sglang_disagg_inference_full_overlay.ubuntu.amd.Dockerfile | Adds merged “full overlay” image build (RCCL+MoRI+RIXL/NIXL+Mooncake) with sanity checks. |
| docker/sglang_disagg_inference_full_overlay.oci-rdma62.ubuntu.amd.Dockerfile | OCI variant that additionally bakes rdma-core v62 for specific host stacks. |
| docker/primus_megatron_train_rccl_overlay.ubuntu.amd.Dockerfile | Adds build-verified RCCL-from-source overlay + optional rdma-core source build for Primus training images. |
| .gitignore | Un-ignores checked-in skill manifest template JSON files under the skill assets path. |
| .claude/skills/mad-slurm-multinode/SKILL.md | Adds the new SLURM bootstrap + run skill with required inputs, workflow, and responsibilities. |
| .claude/skills/mad-slurm-multinode/scripts/validate_manifest.sh | Adds a GPU-free static manifest validator (placeholders, env consistency, asset resolution). |
| .claude/skills/mad-slurm-multinode/scripts/preflight.sh | Adds a preflight checker for docker/SLURM/git/python/conda/GPU SMI/HF token prerequisites. |
| .claude/skills/mad-slurm-multinode/scripts/detect_cluster_env.sh | Adds a node inspection helper to propose archetype-specific network/RDMA settings. |
| .claude/skills/mad-slurm-multinode/references/manifests.md | Adds manifest anatomy + fill checklist guidance (secrets handling, mounts, interface consistency). |
| .claude/skills/mad-slurm-multinode/references/launch-and-results.md | Adds run/aggregation guidance and failure triage for multi-node madengine runs. |
| .claude/skills/mad-slurm-multinode/references/gotchas.md | Adds cross-cutting + per-workload pitfalls and validated workarounds/expectations. |
| .claude/skills/mad-slurm-multinode/references/deploy-bootstrap.md | Adds detailed idempotent bootstrap steps for fresh nodes (clone, conda, install, env). |
| .claude/skills/mad-slurm-multinode/references/cluster-types.md | Documents archetype-specific transport settings (CX7/AINIC/Thor2) and validation guidance. |
| .claude/skills/mad-slurm-multinode/examples/thor2-bnxt-walkthrough.md | Adds sanitized end-to-end example for Broadcom Thor2 clusters. |
| .claude/skills/mad-slurm-multinode/examples/cx7-roce-walkthrough.md | Adds sanitized end-to-end example for CX7/Mellanox RoCE clusters. |
| .claude/skills/mad-slurm-multinode/examples/amd-ainic-walkthrough.md | Adds sanitized end-to-end example for AMD AINIC/Pollara clusters. |
| .claude/skills/mad-slurm-multinode/assets/manifests/sglang_disagg_deepseek-r1.template.json | Adds a filled-workload template manifest for SGLang disagg DeepSeek-R1. |
| .claude/skills/mad-slurm-multinode/assets/manifests/primus_llama-3.1-8b.template.json | Adds a filled-workload template manifest for Primus 8B scaleout. |
| .claude/skills/mad-slurm-multinode/assets/manifests/primus_llama-3.1-70b.template.json | Adds a filled-workload template manifest for Primus 70B scaleout. |
| .claude/skills/mad-slurm-multinode/assets/mad.env/mad.env.thor2-bnxt.template | Adds archetype mad.env template for Broadcom Thor2 environments. |
| .claude/skills/mad-slurm-multinode/assets/mad.env/mad.env.cx7-roce.template | Adds archetype mad.env template for CX7/Mellanox RoCE environments. |
| .claude/skills/mad-slurm-multinode/assets/mad.env/mad.env.amd-ainic.template | Adds archetype mad.env template for AMD AINIC/Pollara environments. |
Comments suppressed due to low confidence (1)
scripts/sglang_disagg/sglang_disagg_mori_io_ep.sh:191
- KV_TRANSFER_BACKEND is appended into PREFILL_MODEL_CONFIG/DECODE_MODEL_CONFIG and later executed via eval, but it is not validated. This allows invalid backends and can enable shell-token injection via KV_TRANSFER_BACKEND.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| export xP="${xP:-${SGLANG_DISAGG_PREFILL_NODES:-1}}" | ||
| export yD="${yD:-${SGLANG_DISAGG_DECODE_NODES:-1}}" | ||
| export NNODES="${NNODES:-${SGLANG_DISAGG_TOTAL_NODES:-${WORLD_SIZE:-$((xP + yD))}}}" | ||
| export MASTER_PORT="${MASTER_PORT:-23731}" | ||
| export IO_EP_TP_SIZE="${IO_EP_TP_SIZE:-${SGLANG_TP_SIZE:-8}}" |
| if dp_mode == "1": | ||
| backend = "mori_dp" | ||
| elif run_mori == "1": | ||
| backend = "mori_io" | ||
| else: |
| RUN_LOG_JOB_ID="${SLURM_JOB_ID:-0}" | ||
| RUN_LOG_DIR="/run_logs/${RUN_LOG_JOB_ID}" | ||
| mkdir -p "$RUN_LOG_DIR" 2>/dev/null || true | ||
|
|
||
| LOG="${RUN_LOG_DIR}/benchmark_${RUN_LOG_JOB_ID}_${timestamp}_xP${xP}_yD${yD}_${MODEL_NAME}" |
There was a problem hiding this comment.
/tmp/run_logs is not NFS bound - /run_logs shoudl NFS backed dir passed on through docker env variables - this is incorrect for disagg
cc: @lcskrishna @basemam
| # Parse named arguments | ||
| while [[ "$#" -gt 0 ]]; do | ||
| case $1 in | ||
| --model_repo) MODEL_REPO="$2"; shift ;; | ||
| *) echo "Unknown parameter passed: $1"; usage ;; | ||
| esac | ||
| shift | ||
| done |
| fi | ||
|
|
||
| # Run primus pytorch setup script | ||
| echo "Running setup script to download tokenizers" | ||
| bash ./primus_megatron-lm_benchmark_setup.sh -m $model |
| # Parse named arguments | ||
| while [[ "$#" -gt 0 ]]; do | ||
| case $1 in | ||
| -m) MODEL_NAME="$2"; shift ;; | ||
| *) echo "Unknown parameter passed: $1"; usage ;; | ||
| esac | ||
| shift | ||
| done |
| cd /workspace/Primus | ||
| git pull No newline at end of file |
| RUN_LOG_JOB_ID="${SLURM_JOB_ID:-0}" | ||
| RUN_LOG_DIR="/run_logs/${RUN_LOG_JOB_ID}" | ||
| mkdir -p "$RUN_LOG_DIR" 2>/dev/null || true | ||
|
|
||
| LOG="${RUN_LOG_DIR}/benchmark_${RUN_LOG_JOB_ID}_${timestamp}_xP${xP}_yD${yD}_${MODEL_NAME}" |
There was a problem hiding this comment.
/tmp/run_logs is not NFS bound - /run_logs shoudl NFS backed dir passed on through docker env variables - this is incorrect for disagg
cc: @lcskrishna @basemam
There was a problem hiding this comment.
Why is tis file added ?
There was a problem hiding this comment.
Core sglanf_dissag is being changes needs thorough review
Summary
Adds two new multi-node benchmark workloads (SGLang disaggregated prefill/decode
inference and Primus Megatron-LM scaleout training) together with the
mad-slurm-multinodeskill that deploys and runs them on a fresh SLURM cluster,plus the corresponding
models.jsonentries.Branches off the current develop tip (
Primus v26.4, #172); 4 commits,34 files.
What's included
SGLang disaggregated P/D inference (
1a3e80e)KV-transfer into one build (no base-image chaining):
docker/sglang_disagg_inference_full_overlay.ubuntu.amd.Dockerfileand the*.oci-rdma62.*variant (rdma-core v62 baked in for OCI-CX7 hosts).scripts/sglang_disagg/run.shnative launcher with in-containerrank-ordered node-IP discovery (
ip_rendezvous.py).parse_to_csv.pyrewritten for full metric extraction (best-throughputiteration) into the madengine perf-CSV schema.
benchmark_xPyD.sh(fail-fast + per-point retries).Primus Megatron-LM scaleout (
348e360)docker/primus_megatron_train_rccl_overlay.ubuntu.amd.Dockerfileon therocm/primus:v26.4base (RCCL from source + librccl smifix, optionalrdma-core for Broadcom Thor2 bnxt_re).
scripts/primus_scaleout/megatron-lm/(run / setup / report .sh + report .py), covering Llama-3.1 8B/70B/405B.
mad-slurm-multinode skill (
cf07399)SKILL.md+ reference docs (cluster-types, deploy-bootstrap, manifests,launch-and-results, gotchas) and helper scripts (detect_cluster_env,
preflight, validate_manifest).
mad.envtemplates for CX7/Mellanox-RoCE,AMD-AINIC/Pollara, and Broadcom-Thor2-RoCE.
(
sglang_disagg_deepseek-r1, primusllama-3.1-8b/-70bon v26.4).Model configurations (
91fbb75)models.json: SGLang disaggregated DeepSeek-R1 inference, and PrimusMegatron-LM scaleout training for Llama-3.1 8B/70B/405B.
Notes / test plan
rocm/primus:v26.4; the RCCL-from-source overlay isversion-sensitive, so a build + smoke run should be confirmed before merge.
the madengine schema).
mad.envtemplates use<FILL_...>placeholders.