Skip to content

[vLLM/SGLang] multi-node#918

Open
podkidyshev wants to merge 26 commits into
mainfrom
ipod/llm-multi-node
Open

[vLLM/SGLang] multi-node#918
podkidyshev wants to merge 26 commits into
mainfrom
ipod/llm-multi-node

Conversation

@podkidyshev

Copy link
Copy Markdown
Contributor

Summary

  • Enable multi-node support for vLLM and SGLang

Test Plan

  • Automed CI
  • Manual runs

Additional Notes

@podkidyshev podkidyshev self-assigned this Jun 9, 2026
@podkidyshev podkidyshev added the enhancement New feature or request label Jun 9, 2026
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: a1b02d90-f20b-4084-bf9e-46f952689508

📥 Commits

Reviewing files that changed from the base of the PR and between 0a98469 and dd1d605.

📒 Files selected for processing (1)
  • src/cloudai/workloads/vllm/slurm_command_gen_strategy.py

📝 Walkthrough

Walkthrough

Generalizes multi-node aggregated/disaggregated serving, adds role-aware node sizing, Ray orchestration for vLLM, refactors Slurm script generation and healthcheck routing, provides multinode SBATCH examples, updates tests, and refreshes documentation and scenarios.

Changes

Multi-node LLM Serving Support

Layer / File(s) Summary
Configuration and documentation
conf/experimental/sglang/test_scenario/sglang*.toml, conf/experimental/vllm/test_scenario/vllm*.toml, conf/experimental/vllm/test/vllm.toml, doc/workloads/sglang.rst, doc/workloads/vllm.rst
Adds heavy multinode test scenarios and updates workload docs with GPU selection precedence, multi-node semantics, and healthcheck fields.
Core LLM serving infrastructure
src/cloudai/workloads/common/llm_serving.py
Generalizes GPU-id splitting to arbitrary multi-node, adds num_nodes, per-role node-count validation, standardized role srun prefixes, node-setup exports, serve_healthcheck/proxy_router_healthcheck, render_serve_launch, and cleanup hooks.
SGLang distributed launch
src/cloudai/workloads/sglang/slurm_command_gen_strategy.py, tests/ref_data/sglang-*.sbatch
Implements role-aware preambles with role-specific dist-init ports and overrides render_serve_launch for distributed SGLang launches; updates reference sbatch scripts for explicit node-role targeting and healthcheck waits.
vLLM Ray backend orchestration
src/cloudai/workloads/vllm/vllm.py, src/cloudai/workloads/vllm/slurm_command_gen_strategy.py, src/cloudai/workloads/vllm/__init__.py, tests/ref_data/vllm-*.sbatch
Adds VllmRayStartArgs, Ray head/worker args, conditional Ray-backed serve command generation, in-job Ray cluster orchestration in Slurm scripts, Ray PID/cleanup handling, and constraint-check updates to account for per-role node multipliers.
Reference SBATCH scripts and fixtures
tests/ref_data/sglang-*.sbatch, tests/ref_data/vllm-*.sbatch
Adds sglang-multinode.sbatch and vllm-multinode.sbatch; rewrites disaggregated scripts to deterministic PREFILL_/DECODE_NODES slices and explicit --nodelist scheduling.
Test coverage
tests/test_acceptance.py, tests/workloads/common/test_llm_serving.py, tests/workloads/sglang/test_command_gen_strategy_slurm.py, tests/workloads/vllm/test_command_gen_strategy_slurm.py, tests/workloads/vllm/test_workload.py
Adds multinode acceptance variants, updates GPU-ID precedence tests, enforces >2 disagg role-size requirements, adds vLLM Ray/healthcheck tests, and extends constraint-check unit tests for multinode cases.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • srivatsankrishnan
  • jeffnvidia

Poem

🐰 A rabbit's ode to distributed hops—
Rows of GPUs line up like carrots bright,
Ray heads hum, workers scatter to their spots,
Health checks tap each role until they're right,
Multi-node serving — a hop, a joy, a byte!

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title '[vLLM/SGLang] multi-node' directly describes the main feature addition—enabling multi-node support for both vLLM and SGLang workloads, which is the primary objective of this changeset.
Description check ✅ Passed The PR description is related to the changeset, mentioning multi-node support enablement for vLLM and SGLang with test plans, though it could provide more detail about implementation specifics.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ipod/llm-multi-node

Comment @coderabbitai help to get the list of available commands and usage tips.

@podkidyshev podkidyshev marked this pull request as ready for review June 11, 2026 10:52

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@conf/experimental/vllm/test/vllm-heavy.toml`:
- Around line 21-24: The git_repos entry currently pins vLLM to the branch name
commit = "main", which makes builds non-reproducible; update the commit field in
the [[git_repos]] block (the block containing url =
"https://github.com/vllm-project/vllm.git" and mount_as = "/vllm_repo") to a
specific commit hash string (replace "main" with the exact SHA1 commit hash for
the desired vLLM revision) so the repository checkout is deterministic.

In `@doc/workloads/sglang.rst`:
- Around line 144-161: The docs example is ambiguous about tensor-parallel size:
clarify that the tp setting in the scenario.toml (the [Tests.cmd_args.decode].tp
value) is a per-node TP size (each serving node started by sg_lang.launch_server
uses tp GPUs locally) rather than a cluster-wide span; update the text near the
multi-node example (id "sglang.multi_node", test_template_name "sglang",
CUDA_VISIBLE_DEVICES example) to explicitly state that tp is applied per-node
and how it interacts with CUDA_VISIBLE_DEVICES and number of nodes.

In `@doc/workloads/vllm.rst`:
- Around line 142-159: Update the multi-node vLLM example to clarify tensor
parallelism scope: explicitly state whether tensor_parallel_size in the
[[Tests]] block is cluster-wide or per-node, and either change the example value
tensor_parallel_size = 2 to tensor_parallel_size = 8 (to reflect using all 8
GPUs given CUDA_VISIBLE_DEVICES = "0,1,2,3" on two nodes) or add a short note
under the [Tests.cmd_args.decode] entry explaining that tensor_parallel_size is
cluster-wide (or per-node) and describing the implications for GPU utilization;
reference the tensor_parallel_size field, the CUDA_VISIBLE_DEVICES value, and
the vllm.multi_node test id when making the clarification.

In `@tests/ref_data/vllm-multinode.sbatch`:
- Around line 91-107: The cleanup() function currently SIGTERM's backgrounded
PIDs (SERVE_RAY_PID, SERVE_PID) but does not stop the Ray cluster started with
ray start --head; modify cleanup() to explicitly run ray stop --force (or ray
shutdown) against the head (use RAY_ADDRESS or SERVE_NODE:${SERVE_RAY_PORT} if
available) before/after killing SERVE_RAY_PID and SERVE_PID, ignore non-zero
exit codes, and ensure this is done even if vllm serve was exec'd so the Ray
head is deterministically shut down on test teardown.

In `@tests/test_acceptance.py`:
- Around line 636-650: The sglang-multinode test definition uses decode.tp=2 (in
SglangCmdArgs for "sglang-multinode") while the multi-node scenario TOMLs
(conf/experimental/sglang/test_scenario/sglang.toml and sglang-heavy.toml)
specify num_nodes=2 and tp=8, causing a mismatch; fix by aligning the
tensor-parallelism settings: either update the test in tests/test_acceptance.py
(the "sglang-multinode" SglangCmdArgs.decode.tp value and CUDA_VISIBLE_DEVICES)
to match tp=8 used by the scenario TOMLs, or change the TOML tp fields to 2 if 2
is intended, and add a brief comment documenting the chosen alignment so future
changes remain consistent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 31eb0958-d1de-4cea-b76b-17ead0960dde

📥 Commits

Reviewing files that changed from the base of the PR and between d999b13 and 98bf489.

📒 Files selected for processing (25)
  • conf/experimental/sglang/test/sglang-heavy.toml
  • conf/experimental/sglang/test_scenario/sglang-heavy.toml
  • conf/experimental/sglang/test_scenario/sglang.toml
  • conf/experimental/vllm/test/vllm-heavy.toml
  • conf/experimental/vllm/test_scenario/vllm-heavy-perf.toml
  • conf/experimental/vllm/test_scenario/vllm-heavy.toml
  • conf/experimental/vllm/test_scenario/vllm.toml
  • doc/workloads/sglang.rst
  • doc/workloads/vllm.rst
  • src/cloudai/workloads/common/llm_serving.py
  • src/cloudai/workloads/sglang/slurm_command_gen_strategy.py
  • src/cloudai/workloads/vllm/slurm_command_gen_strategy.py
  • src/cloudai/workloads/vllm/vllm.py
  • tests/ref_data/sglang-disagg-2nodes.sbatch
  • tests/ref_data/sglang-disagg.sbatch
  • tests/ref_data/sglang-multinode.sbatch
  • tests/ref_data/vllm-disagg-2nodes.sbatch
  • tests/ref_data/vllm-disagg.sbatch
  • tests/ref_data/vllm-multinode.sbatch
  • tests/ref_data/vllm.sbatch
  • tests/test_acceptance.py
  • tests/workloads/common/test_llm_serving.py
  • tests/workloads/sglang/test_command_gen_strategy_slurm.py
  • tests/workloads/vllm/test_command_gen_strategy_slurm.py
  • tests/workloads/vllm/test_workload.py

Comment thread conf/experimental/vllm/test/vllm-heavy.toml Outdated
Comment thread doc/workloads/sglang.rst Outdated
Comment thread doc/workloads/vllm.rst Outdated
Comment thread tests/ref_data/vllm-multinode.sbatch
Comment thread tests/test_acceptance.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/cloudai/workloads/vllm/vllm.py (1)

172-188: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep constraint_check() consistent with the shared disaggregated node planner.

For num_nodes > 2, this falls back to prefill_nodes = decode_nodes = 1 whenever the role sizes are unset, and it never checks that the two role counts sum to tr.nnodes. The common Slurm helper now rejects both cases, so an invalid multinode config can pass definition-time validation here and only fail later when the sbatch is generated. Reuse the shared role-count validation here or return False on the same conditions.

Suggested fix
         prefill_nodes = 1
         decode_nodes = 1
         if num_nodes > 2:
             prefill_nodes_value = self.cmd_args.prefill.num_nodes
             decode_nodes_value = self.cmd_args.decode.num_nodes
-            prefill_nodes = prefill_nodes_value if isinstance(prefill_nodes_value, int) else prefill_nodes
-            decode_nodes = decode_nodes_value if isinstance(decode_nodes_value, int) else decode_nodes
+            if not isinstance(prefill_nodes_value, int) or not isinstance(decode_nodes_value, int):
+                logging.error(
+                    "Disaggregated vLLM with %s nodes requires both prefill.num_nodes and decode.num_nodes.",
+                    num_nodes,
+                )
+                return False
+            if prefill_nodes_value + decode_nodes_value != num_nodes:
+                logging.error(
+                    "Configured prefill/decode nodes (%s + %s) must match allocated nodes (%s).",
+                    prefill_nodes_value,
+                    decode_nodes_value,
+                    num_nodes,
+                )
+                return False
+            prefill_nodes = prefill_nodes_value
+            decode_nodes = decode_nodes_value
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/vllm/vllm.py` around lines 172 - 188, The
constraint_check logic incorrectly defaults prefill_nodes and decode_nodes to 1
for num_nodes > 2 and never verifies that the role counts sum to the total
nodes; update the block handling num_nodes > 2 to (a) read prefill_nodes_value =
self.cmd_args.prefill.num_nodes and decode_nodes_value =
self.cmd_args.decode.num_nodes, (b) if either value is not an int or if
prefill_nodes_value + decode_nodes_value != num_nodes then return False, and (c)
otherwise set prefill_nodes and decode_nodes from those values and proceed to
call _validate_vllm_parallelism_constraints using calculate_prefill_gpu_ids and
calculate_decode_gpu_ids multiplied by the validated role counts; reference
calculate_prefill_gpu_ids, calculate_decode_gpu_ids, self.cmd_args.prefill,
self.cmd_args.decode, and _validate_vllm_parallelism_constraints when making the
changes.
tests/workloads/vllm/test_command_gen_strategy_slurm.py (1)

398-428: ⚠️ Potential issue | 🟠 Major

Disaggregated mode ignores custom VllmCmdArgs.healthcheck for role servers

  • src/cloudai/workloads/common/llm_serving.py hardcodes disaggregated role-server readiness checks to http://{prefill_host}:{prefill_port}/health and http://{decode_host}:{decode_port}/health; the compatibility logic only swaps /health/healthcheck, so custom endpoints (e.g. /ready) will not be waited on.
  • The test suite does assert prefill/decode /health in disaggregated flows, but it doesn’t assert that a custom cmd_args.healthcheck is honored on those prefill/decode ports—only the router/proxy wait is exercised via proxy_router_healthcheck.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/workloads/vllm/test_command_gen_strategy_slurm.py` around lines 398 -
428, The disaggregated readiness-check logic in
src/cloudai/workloads/common/llm_serving.py currently hardcodes "/health" for
role-server endpoints and only applies the /health ⇄ /healthcheck compatibility
swap, so custom VllmCmdArgs.healthcheck values are ignored; update the code that
builds disaggregated prefill/decode wait_for_health URLs to use
vllm.cmd_args.healthcheck (falling back to "/health" if unset) and also add the
compatibility variant (if healthcheck ends with "/health" add "/healthcheck" and
vice versa) so both legacy and custom endpoints are waited on; look for
references to VllmCmdArgs.healthcheck, the disaggregated readiness construction
in the function/method that generates role-server health checks (used by
VllmSlurmCommandGenStrategy._gen_srun_command), and replace the hardcoded
"/health" occurrences with this computed health path + compatibility variant.
src/cloudai/workloads/common/llm_serving.py (1)

836-850: ⚠️ Potential issue | 🟠 Major

Fix disaggregated prefill/decode readiness endpoint mismatch with docs

doc/workloads/vllm.rst states that in disaggregated mode healthcheck controls the prefill/decode readiness endpoint, but the implementation hardcodes the prefill/decode wait URLs to ...:{prefill_port}/health and ...:{decode_port}/health (src/cloudai/workloads/common/llm_serving.py, lines 836-841). By contrast, aggregated mode correctly waits on ...:{serve_port}{self.tdef.cmd_args.healthcheck} (around lines 780-783).

While vllm/slurm_command_gen_strategy.py adds compatibility only between /health and /healthcheck, this does not cover arbitrary custom readiness paths—so the doc’s “custom paths are used as configured” guarantee can be broken for disaggregated prefill/decode waits.

Proxy/router behavior is consistent: it waits on proxy_router_healthcheck, which falls back to healthcheck only when proxy_healthcheck isn’t set (vllm/slurm_command_gen_strategy.py, lines ~129-133), matching the doc.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/common/llm_serving.py` around lines 836 - 850, The
disaggregated prefill/decode wait URLs are hardcoded to "/health" in
generate_wait_for_health_block calls; change them to use the configured
healthcheck path (same logic used by aggregated mode and
proxy_router_healthcheck) so custom readiness endpoints are respected. Update
the two URL entries that build
f"http://{self.disaggregated_role_host('prefill')}:{self.prefill_port}/health"
and f"http://{self.disaggregated_role_host('decode')}:{self.decode_port}/health"
to instead incorporate the configured health path (e.g., use
self.tdef.cmd_args.healthcheck or the same proxy_router_healthcheck fallback
logic) so generate_wait_for_health_block, disaggregated_role_host, prefill_port,
decode_port, serve_port and proxy_router_healthcheck all use the same
healthcheck resolution.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/cloudai/workloads/common/llm_serving.py`:
- Around line 836-850: The disaggregated prefill/decode wait URLs are hardcoded
to "/health" in generate_wait_for_health_block calls; change them to use the
configured healthcheck path (same logic used by aggregated mode and
proxy_router_healthcheck) so custom readiness endpoints are respected. Update
the two URL entries that build
f"http://{self.disaggregated_role_host('prefill')}:{self.prefill_port}/health"
and f"http://{self.disaggregated_role_host('decode')}:{self.decode_port}/health"
to instead incorporate the configured health path (e.g., use
self.tdef.cmd_args.healthcheck or the same proxy_router_healthcheck fallback
logic) so generate_wait_for_health_block, disaggregated_role_host, prefill_port,
decode_port, serve_port and proxy_router_healthcheck all use the same
healthcheck resolution.

In `@src/cloudai/workloads/vllm/vllm.py`:
- Around line 172-188: The constraint_check logic incorrectly defaults
prefill_nodes and decode_nodes to 1 for num_nodes > 2 and never verifies that
the role counts sum to the total nodes; update the block handling num_nodes > 2
to (a) read prefill_nodes_value = self.cmd_args.prefill.num_nodes and
decode_nodes_value = self.cmd_args.decode.num_nodes, (b) if either value is not
an int or if prefill_nodes_value + decode_nodes_value != num_nodes then return
False, and (c) otherwise set prefill_nodes and decode_nodes from those values
and proceed to call _validate_vllm_parallelism_constraints using
calculate_prefill_gpu_ids and calculate_decode_gpu_ids multiplied by the
validated role counts; reference calculate_prefill_gpu_ids,
calculate_decode_gpu_ids, self.cmd_args.prefill, self.cmd_args.decode, and
_validate_vllm_parallelism_constraints when making the changes.

In `@tests/workloads/vllm/test_command_gen_strategy_slurm.py`:
- Around line 398-428: The disaggregated readiness-check logic in
src/cloudai/workloads/common/llm_serving.py currently hardcodes "/health" for
role-server endpoints and only applies the /health ⇄ /healthcheck compatibility
swap, so custom VllmCmdArgs.healthcheck values are ignored; update the code that
builds disaggregated prefill/decode wait_for_health URLs to use
vllm.cmd_args.healthcheck (falling back to "/health" if unset) and also add the
compatibility variant (if healthcheck ends with "/health" add "/healthcheck" and
vice versa) so both legacy and custom endpoints are waited on; look for
references to VllmCmdArgs.healthcheck, the disaggregated readiness construction
in the function/method that generates role-server health checks (used by
VllmSlurmCommandGenStrategy._gen_srun_command), and replace the hardcoded
"/health" occurrences with this computed health path + compatibility variant.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 7ffecac9-cc89-4616-8304-5504dd9d4de1

📥 Commits

Reviewing files that changed from the base of the PR and between 98bf489 and 61b5499.

📒 Files selected for processing (11)
  • doc/workloads/sglang.rst
  • doc/workloads/vllm.rst
  • src/cloudai/workloads/common/llm_serving.py
  • src/cloudai/workloads/vllm/slurm_command_gen_strategy.py
  • src/cloudai/workloads/vllm/vllm.py
  • tests/ref_data/vllm-disagg-2nodes.sbatch
  • tests/ref_data/vllm-disagg.sbatch
  • tests/ref_data/vllm-multinode.sbatch
  • tests/ref_data/vllm.sbatch
  • tests/workloads/common/test_llm_serving.py
  • tests/workloads/vllm/test_command_gen_strategy_slurm.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/workloads/common/llm_serving.py`:
- Around line 654-657: The common role_server_healthcheck property currently
returns a hardcoded "/health", so disaggregated startup waits never use the
configured tdef.cmd_args.healthcheck; update role_server_healthcheck (the
`@property` on the class in src/cloudai/workloads/common/llm_serving.py) to return
self.tdef.cmd_args.healthcheck or "/health" (and preserve the vLLM compatibility
mapping where "/healthcheck" should be normalized to "/health" if you need the
same behavior), so non-vLLM strategies honor the configured healthcheck value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 6cf8c2ad-3a40-4938-8e8a-62ff0049d782

📥 Commits

Reviewing files that changed from the base of the PR and between e884f15 and 56a91d6.

📒 Files selected for processing (13)
  • conf/experimental/vllm/test/vllm-heavy.toml
  • doc/workloads/sglang.rst
  • doc/workloads/vllm.rst
  • src/cloudai/workloads/common/llm_serving.py
  • src/cloudai/workloads/vllm/slurm_command_gen_strategy.py
  • src/cloudai/workloads/vllm/vllm.py
  • tests/ref_data/sglang-multinode.sbatch
  • tests/ref_data/vllm-disagg-2nodes.sbatch
  • tests/ref_data/vllm-disagg.sbatch
  • tests/ref_data/vllm-multinode.sbatch
  • tests/test_acceptance.py
  • tests/workloads/vllm/test_command_gen_strategy_slurm.py
  • tests/workloads/vllm/test_workload.py

Comment thread src/cloudai/workloads/common/llm_serving.py Outdated
@podkidyshev podkidyshev force-pushed the ipod/llm-multi-node branch 4 times, most recently from 41a8eab to 62bf7cf Compare June 11, 2026 17:00
@podkidyshev podkidyshev force-pushed the ipod/llm-multi-node branch from 62bf7cf to 9ea8e17 Compare June 11, 2026 18:18

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/cloudai/workloads/common/llm_serving.py (1)

287-291: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

used_gpus_count() undercounts GPUs for multi-node disaggregated runs with explicit role gpu_ids.

On Line 290, the explicit prefill.gpu_ids/decode.gpu_ids path returns only per-node GPU totals. With multi-node disaggregation (prefill.num_nodes/decode.num_nodes > 1), this inflates tps-per-gpu because cluster-wide GPU usage is undercounted.

💡 Suggested fix
         prefill_gpu_ids = tdef.cmd_args.prefill.gpu_ids
         decode_gpu_ids = tdef.cmd_args.decode.gpu_ids
         if prefill_gpu_ids and decode_gpu_ids:
-            return len(parse_gpu_ids(prefill_gpu_ids)) + len(parse_gpu_ids(decode_gpu_ids))
+            prefill_nodes = tdef.cmd_args.prefill.num_nodes if isinstance(tdef.cmd_args.prefill.num_nodes, int) else 1
+            decode_nodes = tdef.cmd_args.decode.num_nodes if isinstance(tdef.cmd_args.decode.num_nodes, int) else 1
+            return (
+                len(parse_gpu_ids(prefill_gpu_ids)) * prefill_nodes
+                + len(parse_gpu_ids(decode_gpu_ids)) * decode_nodes
+            )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/common/llm_serving.py` around lines 287 - 291,
used_gpus_count() currently sums parse_gpu_ids(prefill.gpu_ids) and
parse_gpu_ids(decode.gpu_ids) but only returns per-node GPU counts when explicit
gpu_ids are provided; for multi-node disaggregated runs you must multiply the
per-node GPU count by the corresponding num_nodes
(tdef.cmd_args.prefill.num_nodes and tdef.cmd_args.decode.num_nodes) before
summing. Update the explicit-path in used_gpus_count to compute
len(parse_gpu_ids(prefill.gpu_ids)) * max(1, tdef.cmd_args.prefill.num_nodes)
and similarly for decode, then return their sum so cluster-wide GPU usage is
correctly counted; reference the symbols used_gpus_count,
tdef.cmd_args.prefill.gpu_ids, tdef.cmd_args.prefill.num_nodes,
tdef.cmd_args.decode.gpu_ids, tdef.cmd_args.decode.num_nodes, and parse_gpu_ids.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/workloads/vllm/slurm_command_gen_strategy.py`:
- Around line 57-75: The serializer currently leaves non-dict scalar strings
unquoted (in _format_ray_value and used by _serialize_ray_start_args), causing
shell-splitting when embedding the generated args into bash -lc; update
_format_ray_value to shlex.quote string instances (not only dicts) before
returning, ensuring string scalars are shell-quoted when
_serialize_ray_start_args builds the "--key=value" segments for ray start (leave
dict handling as-is and preserve boolean handling in _serialize_ray_start_args).

In `@src/cloudai/workloads/vllm/vllm.py`:
- Around line 191-223: The validator rejects the explicit single-node
disaggregated config prefill.num_nodes=1 and decode.num_nodes=1 because
prefill_nodes_value + decode_nodes_value != num_nodes; update the check in the
vLLM validation (using prefill_nodes_value, decode_nodes_value, num_nodes,
prefill_nodes, decode_nodes) to allow the special-case when num_nodes == 1 and
both role counts are 1 (treat it as valid and set prefill_nodes/decode_nodes
accordingly); concretely, change the sum-check condition to skip the error if
(num_nodes == 1 && prefill_nodes_value == 1 && decode_nodes_value == 1) so the
explicit 1/1 case is accepted.

In `@tests/workloads/vllm/test_command_gen_strategy_slurm.py`:
- Around line 399-429: The test is letting callers override Ray topology flags
(head/address/block/port) via VllmRayStartArgs, which we must prevent; update
VllmSlurmCommandGenStrategy._gen_srun_command to ignore topology-defining fields
from user-provided VllmRayStartArgs and only honor resource/telemetry knobs
(num_gpus, num_cpus, dashboard_host, disable_usage_stats) when building the "ray
start" srun command; specifically, filter the incoming VllmRayStartArgs for
ray_head and ray_worker to drop/override head, address, block, and port before
rendering flags so orchestration controls topology while still allowing
num_gpus/num_cpus/dashboard_host/disable_usage_stats to be applied.

---

Outside diff comments:
In `@src/cloudai/workloads/common/llm_serving.py`:
- Around line 287-291: used_gpus_count() currently sums
parse_gpu_ids(prefill.gpu_ids) and parse_gpu_ids(decode.gpu_ids) but only
returns per-node GPU counts when explicit gpu_ids are provided; for multi-node
disaggregated runs you must multiply the per-node GPU count by the corresponding
num_nodes (tdef.cmd_args.prefill.num_nodes and tdef.cmd_args.decode.num_nodes)
before summing. Update the explicit-path in used_gpus_count to compute
len(parse_gpu_ids(prefill.gpu_ids)) * max(1, tdef.cmd_args.prefill.num_nodes)
and similarly for decode, then return their sum so cluster-wide GPU usage is
correctly counted; reference the symbols used_gpus_count,
tdef.cmd_args.prefill.gpu_ids, tdef.cmd_args.prefill.num_nodes,
tdef.cmd_args.decode.gpu_ids, tdef.cmd_args.decode.num_nodes, and parse_gpu_ids.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: f3a123c8-d2cf-4eae-802d-0731d342443a

📥 Commits

Reviewing files that changed from the base of the PR and between 56a91d6 and 7df6955.

📒 Files selected for processing (22)
  • conf/experimental/sglang/test_scenario/sglang-heavy.toml
  • conf/experimental/sglang/test_scenario/sglang.toml
  • conf/experimental/vllm/test/vllm.toml
  • conf/experimental/vllm/test_scenario/vllm-heavy.toml
  • conf/experimental/vllm/test_scenario/vllm.toml
  • doc/workloads/sglang.rst
  • doc/workloads/vllm.rst
  • src/cloudai/workloads/common/llm_serving.py
  • src/cloudai/workloads/sglang/slurm_command_gen_strategy.py
  • src/cloudai/workloads/vllm/__init__.py
  • src/cloudai/workloads/vllm/slurm_command_gen_strategy.py
  • src/cloudai/workloads/vllm/vllm.py
  • tests/ref_data/sglang-disagg-2nodes.sbatch
  • tests/ref_data/sglang-disagg.sbatch
  • tests/ref_data/sglang-multinode.sbatch
  • tests/ref_data/vllm-disagg-2nodes.sbatch
  • tests/ref_data/vllm-disagg.sbatch
  • tests/ref_data/vllm-multinode.sbatch
  • tests/test_acceptance.py
  • tests/workloads/sglang/test_command_gen_strategy_slurm.py
  • tests/workloads/vllm/test_command_gen_strategy_slurm.py
  • tests/workloads/vllm/test_workload.py

Comment thread src/cloudai/workloads/vllm/slurm_command_gen_strategy.py
Comment thread src/cloudai/workloads/vllm/vllm.py
Comment thread tests/workloads/vllm/test_command_gen_strategy_slurm.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants