[vLLM/SGLang] multi-node by podkidyshev · Pull Request #918 · NVIDIA/cloudai

podkidyshev · 2026-06-09T15:59:16Z

Summary

Enable multi-node support for vLLM and SGLang

Test Plan

Automed CI
Manual runs

Additional Notes

coderabbitai · 2026-06-09T15:59:23Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: a1b02d90-f20b-4084-bf9e-46f952689508

📥 Commits

Reviewing files that changed from the base of the PR and between 0a98469 and dd1d605.

📒 Files selected for processing (1)

src/cloudai/workloads/vllm/slurm_command_gen_strategy.py

📝 Walkthrough

Walkthrough

Generalizes multi-node aggregated/disaggregated serving, adds role-aware node sizing, Ray orchestration for vLLM, refactors Slurm script generation and healthcheck routing, provides multinode SBATCH examples, updates tests, and refreshes documentation and scenarios.

Changes

Multi-node LLM Serving Support

Layer / File(s)	Summary
Configuration and documentation `conf/experimental/sglang/test_scenario/sglang.toml`, `conf/experimental/vllm/test_scenario/vllm.toml`, `conf/experimental/vllm/test/vllm.toml`, `doc/workloads/sglang.rst`, `doc/workloads/vllm.rst`	Adds heavy multinode test scenarios and updates workload docs with GPU selection precedence, multi-node semantics, and healthcheck fields.
Core LLM serving infrastructure `src/cloudai/workloads/common/llm_serving.py`	Generalizes GPU-id splitting to arbitrary multi-node, adds `num_nodes`, per-role node-count validation, standardized role srun prefixes, node-setup exports, `serve_healthcheck`/`proxy_router_healthcheck`, `render_serve_launch`, and cleanup hooks.
SGLang distributed launch `src/cloudai/workloads/sglang/slurm_command_gen_strategy.py`, `tests/ref_data/sglang-*.sbatch`	Implements role-aware preambles with role-specific dist-init ports and overrides `render_serve_launch` for distributed SGLang launches; updates reference sbatch scripts for explicit node-role targeting and healthcheck waits.
vLLM Ray backend orchestration `src/cloudai/workloads/vllm/vllm.py`, `src/cloudai/workloads/vllm/slurm_command_gen_strategy.py`, `src/cloudai/workloads/vllm/__init__.py`, `tests/ref_data/vllm-*.sbatch`	Adds `VllmRayStartArgs`, Ray head/worker args, conditional Ray-backed serve command generation, in-job Ray cluster orchestration in Slurm scripts, Ray PID/cleanup handling, and constraint-check updates to account for per-role node multipliers.
Reference SBATCH scripts and fixtures `tests/ref_data/sglang-.sbatch`, `tests/ref_data/vllm-.sbatch`	Adds `sglang-multinode.sbatch` and `vllm-multinode.sbatch`; rewrites disaggregated scripts to deterministic `PREFILL_/DECODE_NODES` slices and explicit `--nodelist` scheduling.
Test coverage `tests/test_acceptance.py`, `tests/workloads/common/test_llm_serving.py`, `tests/workloads/sglang/test_command_gen_strategy_slurm.py`, `tests/workloads/vllm/test_command_gen_strategy_slurm.py`, `tests/workloads/vllm/test_workload.py`	Adds multinode acceptance variants, updates GPU-ID precedence tests, enforces `>2` disagg role-size requirements, adds vLLM Ray/healthcheck tests, and extends constraint-check unit tests for multinode cases.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

srivatsankrishnan
jeffnvidia

Poem

🐰 A rabbit's ode to distributed hops—
Rows of GPUs line up like carrots bright,
Ray heads hum, workers scatter to their spots,
Health checks tap each role until they're right,
Multi-node serving — a hop, a joy, a byte!

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title '[vLLM/SGLang] multi-node' directly describes the main feature addition—enabling multi-node support for both vLLM and SGLang workloads, which is the primary objective of this changeset.
Description check	✅ Passed	The PR description is related to the changeset, mentioning multi-node support enablement for vLLM and SGLang with test plans, though it could provide more detail about implementation specifics.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ipod/llm-multi-node

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@conf/experimental/vllm/test/vllm-heavy.toml`:
- Around line 21-24: The git_repos entry currently pins vLLM to the branch name
commit = "main", which makes builds non-reproducible; update the commit field in
the [[git_repos]] block (the block containing url =
"https://github.com/vllm-project/vllm.git" and mount_as = "/vllm_repo") to a
specific commit hash string (replace "main" with the exact SHA1 commit hash for
the desired vLLM revision) so the repository checkout is deterministic.

In `@doc/workloads/sglang.rst`:
- Around line 144-161: The docs example is ambiguous about tensor-parallel size:
clarify that the tp setting in the scenario.toml (the [Tests.cmd_args.decode].tp
value) is a per-node TP size (each serving node started by sg_lang.launch_server
uses tp GPUs locally) rather than a cluster-wide span; update the text near the
multi-node example (id "sglang.multi_node", test_template_name "sglang",
CUDA_VISIBLE_DEVICES example) to explicitly state that tp is applied per-node
and how it interacts with CUDA_VISIBLE_DEVICES and number of nodes.

In `@doc/workloads/vllm.rst`:
- Around line 142-159: Update the multi-node vLLM example to clarify tensor
parallelism scope: explicitly state whether tensor_parallel_size in the
[[Tests]] block is cluster-wide or per-node, and either change the example value
tensor_parallel_size = 2 to tensor_parallel_size = 8 (to reflect using all 8
GPUs given CUDA_VISIBLE_DEVICES = "0,1,2,3" on two nodes) or add a short note
under the [Tests.cmd_args.decode] entry explaining that tensor_parallel_size is
cluster-wide (or per-node) and describing the implications for GPU utilization;
reference the tensor_parallel_size field, the CUDA_VISIBLE_DEVICES value, and
the vllm.multi_node test id when making the clarification.

In `@tests/ref_data/vllm-multinode.sbatch`:
- Around line 91-107: The cleanup() function currently SIGTERM's backgrounded
PIDs (SERVE_RAY_PID, SERVE_PID) but does not stop the Ray cluster started with
ray start --head; modify cleanup() to explicitly run ray stop --force (or ray
shutdown) against the head (use RAY_ADDRESS or SERVE_NODE:${SERVE_RAY_PORT} if
available) before/after killing SERVE_RAY_PID and SERVE_PID, ignore non-zero
exit codes, and ensure this is done even if vllm serve was exec'd so the Ray
head is deterministically shut down on test teardown.

In `@tests/test_acceptance.py`:
- Around line 636-650: The sglang-multinode test definition uses decode.tp=2 (in
SglangCmdArgs for "sglang-multinode") while the multi-node scenario TOMLs
(conf/experimental/sglang/test_scenario/sglang.toml and sglang-heavy.toml)
specify num_nodes=2 and tp=8, causing a mismatch; fix by aligning the
tensor-parallelism settings: either update the test in tests/test_acceptance.py
(the "sglang-multinode" SglangCmdArgs.decode.tp value and CUDA_VISIBLE_DEVICES)
to match tp=8 used by the scenario TOMLs, or change the TOML tp fields to 2 if 2
is intended, and add a brief comment documenting the chosen alignment so future
changes remain consistent.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 31eb0958-d1de-4cea-b76b-17ead0960dde

📥 Commits

Reviewing files that changed from the base of the PR and between d999b13 and 98bf489.

📒 Files selected for processing (25)

conf/experimental/sglang/test/sglang-heavy.toml
conf/experimental/sglang/test_scenario/sglang-heavy.toml
conf/experimental/sglang/test_scenario/sglang.toml
conf/experimental/vllm/test/vllm-heavy.toml
conf/experimental/vllm/test_scenario/vllm-heavy-perf.toml
conf/experimental/vllm/test_scenario/vllm-heavy.toml
conf/experimental/vllm/test_scenario/vllm.toml
doc/workloads/sglang.rst
doc/workloads/vllm.rst
src/cloudai/workloads/common/llm_serving.py
src/cloudai/workloads/sglang/slurm_command_gen_strategy.py
src/cloudai/workloads/vllm/slurm_command_gen_strategy.py
src/cloudai/workloads/vllm/vllm.py
tests/ref_data/sglang-disagg-2nodes.sbatch
tests/ref_data/sglang-disagg.sbatch
tests/ref_data/sglang-multinode.sbatch
tests/ref_data/vllm-disagg-2nodes.sbatch
tests/ref_data/vllm-disagg.sbatch
tests/ref_data/vllm-multinode.sbatch
tests/ref_data/vllm.sbatch
tests/test_acceptance.py
tests/workloads/common/test_llm_serving.py
tests/workloads/sglang/test_command_gen_strategy_slurm.py
tests/workloads/vllm/test_command_gen_strategy_slurm.py
tests/workloads/vllm/test_workload.py

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

src/cloudai/workloads/vllm/vllm.py (1)
172-188: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep constraint_check() consistent with the shared disaggregated node planner.

For num_nodes > 2, this falls back to prefill_nodes = decode_nodes = 1 whenever the role sizes are unset, and it never checks that the two role counts sum to tr.nnodes. The common Slurm helper now rejects both cases, so an invalid multinode config can pass definition-time validation here and only fail later when the sbatch is generated. Reuse the shared role-count validation here or return False on the same conditions.
Suggested fix
         prefill_nodes = 1
         decode_nodes = 1
         if num_nodes > 2:
             prefill_nodes_value = self.cmd_args.prefill.num_nodes
             decode_nodes_value = self.cmd_args.decode.num_nodes
-            prefill_nodes = prefill_nodes_value if isinstance(prefill_nodes_value, int) else prefill_nodes
-            decode_nodes = decode_nodes_value if isinstance(decode_nodes_value, int) else decode_nodes
+            if not isinstance(prefill_nodes_value, int) or not isinstance(decode_nodes_value, int):
+                logging.error(
+                    "Disaggregated vLLM with %s nodes requires both prefill.num_nodes and decode.num_nodes.",
+                    num_nodes,
+                )
+                return False
+            if prefill_nodes_value + decode_nodes_value != num_nodes:
+                logging.error(
+                    "Configured prefill/decode nodes (%s + %s) must match allocated nodes (%s).",
+                    prefill_nodes_value,
+                    decode_nodes_value,
+                    num_nodes,
+                )
+                return False
+            prefill_nodes = prefill_nodes_value
+            decode_nodes = decode_nodes_value
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/vllm/vllm.py` around lines 172 - 188, The
constraint_check logic incorrectly defaults prefill_nodes and decode_nodes to 1
for num_nodes > 2 and never verifies that the role counts sum to the total
nodes; update the block handling num_nodes > 2 to (a) read prefill_nodes_value =
self.cmd_args.prefill.num_nodes and decode_nodes_value =
self.cmd_args.decode.num_nodes, (b) if either value is not an int or if
prefill_nodes_value + decode_nodes_value != num_nodes then return False, and (c)
otherwise set prefill_nodes and decode_nodes from those values and proceed to
call _validate_vllm_parallelism_constraints using calculate_prefill_gpu_ids and
calculate_decode_gpu_ids multiplied by the validated role counts; reference
calculate_prefill_gpu_ids, calculate_decode_gpu_ids, self.cmd_args.prefill,
self.cmd_args.decode, and _validate_vllm_parallelism_constraints when making the
changes.
tests/workloads/vllm/test_command_gen_strategy_slurm.py (1)
398-428: ⚠️ Potential issue | 🟠 Major

Disaggregated mode ignores custom VllmCmdArgs.healthcheck for role servers

src/cloudai/workloads/common/llm_serving.py hardcodes disaggregated role-server readiness checks to http://{prefill_host}:{prefill_port}/health and http://{decode_host}:{decode_port}/health; the compatibility logic only swaps /health ⇄ /healthcheck, so custom endpoints (e.g. /ready) will not be waited on.

The test suite does assert prefill/decode /health in disaggregated flows, but it doesn’t assert that a custom cmd_args.healthcheck is honored on those prefill/decode ports—only the router/proxy wait is exercised via proxy_router_healthcheck.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/workloads/vllm/test_command_gen_strategy_slurm.py` around lines 398 -
428, The disaggregated readiness-check logic in
src/cloudai/workloads/common/llm_serving.py currently hardcodes "/health" for
role-server endpoints and only applies the /health ⇄ /healthcheck compatibility
swap, so custom VllmCmdArgs.healthcheck values are ignored; update the code that
builds disaggregated prefill/decode wait_for_health URLs to use
vllm.cmd_args.healthcheck (falling back to "/health" if unset) and also add the
compatibility variant (if healthcheck ends with "/health" add "/healthcheck" and
vice versa) so both legacy and custom endpoints are waited on; look for
references to VllmCmdArgs.healthcheck, the disaggregated readiness construction
in the function/method that generates role-server health checks (used by
VllmSlurmCommandGenStrategy._gen_srun_command), and replace the hardcoded
"/health" occurrences with this computed health path + compatibility variant.
src/cloudai/workloads/common/llm_serving.py (1)
836-850: ⚠️ Potential issue | 🟠 Major

Fix disaggregated prefill/decode readiness endpoint mismatch with docs

doc/workloads/vllm.rst states that in disaggregated mode healthcheck controls the prefill/decode readiness endpoint, but the implementation hardcodes the prefill/decode wait URLs to ...:{prefill_port}/health and ...:{decode_port}/health (src/cloudai/workloads/common/llm_serving.py, lines 836-841). By contrast, aggregated mode correctly waits on ...:{serve_port}{self.tdef.cmd_args.healthcheck} (around lines 780-783).

While vllm/slurm_command_gen_strategy.py adds compatibility only between /health and /healthcheck, this does not cover arbitrary custom readiness paths—so the doc’s “custom paths are used as configured” guarantee can be broken for disaggregated prefill/decode waits.

Proxy/router behavior is consistent: it waits on proxy_router_healthcheck, which falls back to healthcheck only when proxy_healthcheck isn’t set (vllm/slurm_command_gen_strategy.py, lines ~129-133), matching the doc.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/common/llm_serving.py` around lines 836 - 850, The
disaggregated prefill/decode wait URLs are hardcoded to "/health" in
generate_wait_for_health_block calls; change them to use the configured
healthcheck path (same logic used by aggregated mode and
proxy_router_healthcheck) so custom readiness endpoints are respected. Update
the two URL entries that build
f"http://{self.disaggregated_role_host('prefill')}:{self.prefill_port}/health"
and f"http://{self.disaggregated_role_host('decode')}:{self.decode_port}/health"
to instead incorporate the configured health path (e.g., use
self.tdef.cmd_args.healthcheck or the same proxy_router_healthcheck fallback
logic) so generate_wait_for_health_block, disaggregated_role_host, prefill_port,
decode_port, serve_port and proxy_router_healthcheck all use the same
healthcheck resolution.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/cloudai/workloads/common/llm_serving.py`:
- Around line 836-850: The disaggregated prefill/decode wait URLs are hardcoded
to "/health" in generate_wait_for_health_block calls; change them to use the
configured healthcheck path (same logic used by aggregated mode and
proxy_router_healthcheck) so custom readiness endpoints are respected. Update
the two URL entries that build
f"http://{self.disaggregated_role_host('prefill')}:{self.prefill_port}/health"
and f"http://{self.disaggregated_role_host('decode')}:{self.decode_port}/health"
to instead incorporate the configured health path (e.g., use
self.tdef.cmd_args.healthcheck or the same proxy_router_healthcheck fallback
logic) so generate_wait_for_health_block, disaggregated_role_host, prefill_port,
decode_port, serve_port and proxy_router_healthcheck all use the same
healthcheck resolution.

In `@src/cloudai/workloads/vllm/vllm.py`:
- Around line 172-188: The constraint_check logic incorrectly defaults
prefill_nodes and decode_nodes to 1 for num_nodes > 2 and never verifies that
the role counts sum to the total nodes; update the block handling num_nodes > 2
to (a) read prefill_nodes_value = self.cmd_args.prefill.num_nodes and
decode_nodes_value = self.cmd_args.decode.num_nodes, (b) if either value is not
an int or if prefill_nodes_value + decode_nodes_value != num_nodes then return
False, and (c) otherwise set prefill_nodes and decode_nodes from those values
and proceed to call _validate_vllm_parallelism_constraints using
calculate_prefill_gpu_ids and calculate_decode_gpu_ids multiplied by the
validated role counts; reference calculate_prefill_gpu_ids,
calculate_decode_gpu_ids, self.cmd_args.prefill, self.cmd_args.decode, and
_validate_vllm_parallelism_constraints when making the changes.

In `@tests/workloads/vllm/test_command_gen_strategy_slurm.py`:
- Around line 398-428: The disaggregated readiness-check logic in
src/cloudai/workloads/common/llm_serving.py currently hardcodes "/health" for
role-server endpoints and only applies the /health ⇄ /healthcheck compatibility
swap, so custom VllmCmdArgs.healthcheck values are ignored; update the code that
builds disaggregated prefill/decode wait_for_health URLs to use
vllm.cmd_args.healthcheck (falling back to "/health" if unset) and also add the
compatibility variant (if healthcheck ends with "/health" add "/healthcheck" and
vice versa) so both legacy and custom endpoints are waited on; look for
references to VllmCmdArgs.healthcheck, the disaggregated readiness construction
in the function/method that generates role-server health checks (used by
VllmSlurmCommandGenStrategy._gen_srun_command), and replace the hardcoded
"/health" occurrences with this computed health path + compatibility variant.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 7ffecac9-cc89-4616-8304-5504dd9d4de1

📥 Commits

Reviewing files that changed from the base of the PR and between 98bf489 and 61b5499.

📒 Files selected for processing (11)

doc/workloads/sglang.rst
doc/workloads/vllm.rst
src/cloudai/workloads/common/llm_serving.py
src/cloudai/workloads/vllm/slurm_command_gen_strategy.py
src/cloudai/workloads/vllm/vllm.py
tests/ref_data/vllm-disagg-2nodes.sbatch
tests/ref_data/vllm-disagg.sbatch
tests/ref_data/vllm-multinode.sbatch
tests/ref_data/vllm.sbatch
tests/workloads/common/test_llm_serving.py
tests/workloads/vllm/test_command_gen_strategy_slurm.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/workloads/common/llm_serving.py`:
- Around line 654-657: The common role_server_healthcheck property currently
returns a hardcoded "/health", so disaggregated startup waits never use the
configured tdef.cmd_args.healthcheck; update role_server_healthcheck (the
`@property` on the class in src/cloudai/workloads/common/llm_serving.py) to return
self.tdef.cmd_args.healthcheck or "/health" (and preserve the vLLM compatibility
mapping where "/healthcheck" should be normalized to "/health" if you need the
same behavior), so non-vLLM strategies honor the configured healthcheck value.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 6cf8c2ad-3a40-4938-8e8a-62ff0049d782

📥 Commits

Reviewing files that changed from the base of the PR and between e884f15 and 56a91d6.

📒 Files selected for processing (13)

conf/experimental/vllm/test/vllm-heavy.toml
doc/workloads/sglang.rst
doc/workloads/vllm.rst
src/cloudai/workloads/common/llm_serving.py
src/cloudai/workloads/vllm/slurm_command_gen_strategy.py
src/cloudai/workloads/vllm/vllm.py
tests/ref_data/sglang-multinode.sbatch
tests/ref_data/vllm-disagg-2nodes.sbatch
tests/ref_data/vllm-disagg.sbatch
tests/ref_data/vllm-multinode.sbatch
tests/test_acceptance.py
tests/workloads/vllm/test_command_gen_strategy_slurm.py
tests/workloads/vllm/test_workload.py

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/cloudai/workloads/common/llm_serving.py (1)

287-291: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

used_gpus_count() undercounts GPUs for multi-node disaggregated runs with explicit role gpu_ids.

On Line 290, the explicit prefill.gpu_ids/decode.gpu_ids path returns only per-node GPU totals. With multi-node disaggregation (prefill.num_nodes/decode.num_nodes > 1), this inflates tps-per-gpu because cluster-wide GPU usage is undercounted.

💡 Suggested fix

         prefill_gpu_ids = tdef.cmd_args.prefill.gpu_ids
         decode_gpu_ids = tdef.cmd_args.decode.gpu_ids
         if prefill_gpu_ids and decode_gpu_ids:
-            return len(parse_gpu_ids(prefill_gpu_ids)) + len(parse_gpu_ids(decode_gpu_ids))
+            prefill_nodes = tdef.cmd_args.prefill.num_nodes if isinstance(tdef.cmd_args.prefill.num_nodes, int) else 1
+            decode_nodes = tdef.cmd_args.decode.num_nodes if isinstance(tdef.cmd_args.decode.num_nodes, int) else 1
+            return (
+                len(parse_gpu_ids(prefill_gpu_ids)) * prefill_nodes
+                + len(parse_gpu_ids(decode_gpu_ids)) * decode_nodes
+            )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/common/llm_serving.py` around lines 287 - 291,
used_gpus_count() currently sums parse_gpu_ids(prefill.gpu_ids) and
parse_gpu_ids(decode.gpu_ids) but only returns per-node GPU counts when explicit
gpu_ids are provided; for multi-node disaggregated runs you must multiply the
per-node GPU count by the corresponding num_nodes
(tdef.cmd_args.prefill.num_nodes and tdef.cmd_args.decode.num_nodes) before
summing. Update the explicit-path in used_gpus_count to compute
len(parse_gpu_ids(prefill.gpu_ids)) * max(1, tdef.cmd_args.prefill.num_nodes)
and similarly for decode, then return their sum so cluster-wide GPU usage is
correctly counted; reference the symbols used_gpus_count,
tdef.cmd_args.prefill.gpu_ids, tdef.cmd_args.prefill.num_nodes,
tdef.cmd_args.decode.gpu_ids, tdef.cmd_args.decode.num_nodes, and parse_gpu_ids.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/workloads/vllm/slurm_command_gen_strategy.py`:
- Around line 57-75: The serializer currently leaves non-dict scalar strings
unquoted (in _format_ray_value and used by _serialize_ray_start_args), causing
shell-splitting when embedding the generated args into bash -lc; update
_format_ray_value to shlex.quote string instances (not only dicts) before
returning, ensuring string scalars are shell-quoted when
_serialize_ray_start_args builds the "--key=value" segments for ray start (leave
dict handling as-is and preserve boolean handling in _serialize_ray_start_args).

In `@src/cloudai/workloads/vllm/vllm.py`:
- Around line 191-223: The validator rejects the explicit single-node
disaggregated config prefill.num_nodes=1 and decode.num_nodes=1 because
prefill_nodes_value + decode_nodes_value != num_nodes; update the check in the
vLLM validation (using prefill_nodes_value, decode_nodes_value, num_nodes,
prefill_nodes, decode_nodes) to allow the special-case when num_nodes == 1 and
both role counts are 1 (treat it as valid and set prefill_nodes/decode_nodes
accordingly); concretely, change the sum-check condition to skip the error if
(num_nodes == 1 && prefill_nodes_value == 1 && decode_nodes_value == 1) so the
explicit 1/1 case is accepted.

In `@tests/workloads/vllm/test_command_gen_strategy_slurm.py`:
- Around line 399-429: The test is letting callers override Ray topology flags
(head/address/block/port) via VllmRayStartArgs, which we must prevent; update
VllmSlurmCommandGenStrategy._gen_srun_command to ignore topology-defining fields
from user-provided VllmRayStartArgs and only honor resource/telemetry knobs
(num_gpus, num_cpus, dashboard_host, disable_usage_stats) when building the "ray
start" srun command; specifically, filter the incoming VllmRayStartArgs for
ray_head and ray_worker to drop/override head, address, block, and port before
rendering flags so orchestration controls topology while still allowing
num_gpus/num_cpus/dashboard_host/disable_usage_stats to be applied.

---

Outside diff comments:
In `@src/cloudai/workloads/common/llm_serving.py`:
- Around line 287-291: used_gpus_count() currently sums
parse_gpu_ids(prefill.gpu_ids) and parse_gpu_ids(decode.gpu_ids) but only
returns per-node GPU counts when explicit gpu_ids are provided; for multi-node
disaggregated runs you must multiply the per-node GPU count by the corresponding
num_nodes (tdef.cmd_args.prefill.num_nodes and tdef.cmd_args.decode.num_nodes)
before summing. Update the explicit-path in used_gpus_count to compute
len(parse_gpu_ids(prefill.gpu_ids)) * max(1, tdef.cmd_args.prefill.num_nodes)
and similarly for decode, then return their sum so cluster-wide GPU usage is
correctly counted; reference the symbols used_gpus_count,
tdef.cmd_args.prefill.gpu_ids, tdef.cmd_args.prefill.num_nodes,
tdef.cmd_args.decode.gpu_ids, tdef.cmd_args.decode.num_nodes, and parse_gpu_ids.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: f3a123c8-d2cf-4eae-802d-0731d342443a

📥 Commits

Reviewing files that changed from the base of the PR and between 56a91d6 and 7df6955.

📒 Files selected for processing (22)

conf/experimental/sglang/test_scenario/sglang-heavy.toml
conf/experimental/sglang/test_scenario/sglang.toml
conf/experimental/vllm/test/vllm.toml
conf/experimental/vllm/test_scenario/vllm-heavy.toml
conf/experimental/vllm/test_scenario/vllm.toml
doc/workloads/sglang.rst
doc/workloads/vllm.rst
src/cloudai/workloads/common/llm_serving.py
src/cloudai/workloads/sglang/slurm_command_gen_strategy.py
src/cloudai/workloads/vllm/__init__.py
src/cloudai/workloads/vllm/slurm_command_gen_strategy.py
src/cloudai/workloads/vllm/vllm.py
tests/ref_data/sglang-disagg-2nodes.sbatch
tests/ref_data/sglang-disagg.sbatch
tests/ref_data/sglang-multinode.sbatch
tests/ref_data/vllm-disagg-2nodes.sbatch
tests/ref_data/vllm-disagg.sbatch
tests/ref_data/vllm-multinode.sbatch
tests/test_acceptance.py
tests/workloads/sglang/test_command_gen_strategy_slurm.py
tests/workloads/vllm/test_command_gen_strategy_slurm.py
tests/workloads/vllm/test_workload.py

implement multi-node for vllm/sglang

ff43106

podkidyshev self-assigned this Jun 9, 2026

podkidyshev added the enhancement New feature or request label Jun 9, 2026

podkidyshev added 11 commits June 9, 2026 19:08

expand set of test configs to cover multi-node setups

7e36158

fix ray readiness probe

edf2afd

update vllm healthchecking

0996164

add heavy vllm test cases

c9f3ec7

fix heavy vllm config

0857352

further improve configs

9abedcf

set proper gpu ids

1111fd7

heavy perf vllm conf

7503613

added sglang heavy config

5e4cea5

fix sglang multi-node

4c535d6

fix unittest

98bf489

podkidyshev marked this pull request as ready for review June 11, 2026 10:52

podkidyshev requested review from jeffnvidia and srivatsankrishnan as code owners June 11, 2026 10:52

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread conf/experimental/vllm/test/vllm-heavy.toml Outdated

Comment thread doc/workloads/sglang.rst Outdated

Comment thread doc/workloads/vllm.rst Outdated

Comment thread tests/ref_data/vllm-multinode.sbatch

Comment thread tests/test_acceptance.py

podkidyshev added 2 commits June 11, 2026 13:25

backwards compatibility preserved

61b5499

remove redundant tests

e884f15

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

Address LLM multi-node review comments

56a91d6

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread src/cloudai/workloads/common/llm_serving.py Outdated

podkidyshev force-pushed the ipod/llm-multi-node branch 4 times, most recently from 41a8eab to 62bf7cf Compare June 11, 2026 17:00

Keep vLLM healthcheck compatibility

9ea8e17

podkidyshev force-pushed the ipod/llm-multi-node branch from 62bf7cf to 9ea8e17 Compare June 11, 2026 18:18

podkidyshev added 8 commits June 11, 2026 20:50

Make standard LLM scenarios use four visible GPUs

d4b53de

Reference regular LLM workloads from heavy scenarios

675e1b4

Keep heavy LLM scenarios to eight-node topology

f4d3661

remove redundant vllm-heavy-perf

c379044

Fix vLLM scenario healthcheck and GPU visibility

b11155c

add explicit ray sections

a5fe32e

simplify ray args model

4142cf1

minor issues and docs fixes

7df6955

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread src/cloudai/workloads/vllm/slurm_command_gen_strategy.py

Comment thread src/cloudai/workloads/vllm/vllm.py

Comment thread tests/workloads/vllm/test_command_gen_strategy_slurm.py

podkidyshev added 2 commits June 12, 2026 14:37

resolve ai comments

0a98469

revert quotation

dd1d605

amaslenn approved these changes Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[vLLM/SGLang] multi-node#918

[vLLM/SGLang] multi-node#918
podkidyshev wants to merge 26 commits into
mainfrom
ipod/llm-multi-node

podkidyshev commented Jun 9, 2026

Uh oh!

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

podkidyshev commented Jun 9, 2026

Summary

Test Plan

Additional Notes

Uh oh!

coderabbitai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading