[vLLM/SGLang] multi-node#918
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughGeneralizes multi-node aggregated/disaggregated serving, adds role-aware node sizing, Ray orchestration for vLLM, refactors Slurm script generation and healthcheck routing, provides multinode SBATCH examples, updates tests, and refreshes documentation and scenarios. ChangesMulti-node LLM Serving Support
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@conf/experimental/vllm/test/vllm-heavy.toml`:
- Around line 21-24: The git_repos entry currently pins vLLM to the branch name
commit = "main", which makes builds non-reproducible; update the commit field in
the [[git_repos]] block (the block containing url =
"https://github.com/vllm-project/vllm.git" and mount_as = "/vllm_repo") to a
specific commit hash string (replace "main" with the exact SHA1 commit hash for
the desired vLLM revision) so the repository checkout is deterministic.
In `@doc/workloads/sglang.rst`:
- Around line 144-161: The docs example is ambiguous about tensor-parallel size:
clarify that the tp setting in the scenario.toml (the [Tests.cmd_args.decode].tp
value) is a per-node TP size (each serving node started by sg_lang.launch_server
uses tp GPUs locally) rather than a cluster-wide span; update the text near the
multi-node example (id "sglang.multi_node", test_template_name "sglang",
CUDA_VISIBLE_DEVICES example) to explicitly state that tp is applied per-node
and how it interacts with CUDA_VISIBLE_DEVICES and number of nodes.
In `@doc/workloads/vllm.rst`:
- Around line 142-159: Update the multi-node vLLM example to clarify tensor
parallelism scope: explicitly state whether tensor_parallel_size in the
[[Tests]] block is cluster-wide or per-node, and either change the example value
tensor_parallel_size = 2 to tensor_parallel_size = 8 (to reflect using all 8
GPUs given CUDA_VISIBLE_DEVICES = "0,1,2,3" on two nodes) or add a short note
under the [Tests.cmd_args.decode] entry explaining that tensor_parallel_size is
cluster-wide (or per-node) and describing the implications for GPU utilization;
reference the tensor_parallel_size field, the CUDA_VISIBLE_DEVICES value, and
the vllm.multi_node test id when making the clarification.
In `@tests/ref_data/vllm-multinode.sbatch`:
- Around line 91-107: The cleanup() function currently SIGTERM's backgrounded
PIDs (SERVE_RAY_PID, SERVE_PID) but does not stop the Ray cluster started with
ray start --head; modify cleanup() to explicitly run ray stop --force (or ray
shutdown) against the head (use RAY_ADDRESS or SERVE_NODE:${SERVE_RAY_PORT} if
available) before/after killing SERVE_RAY_PID and SERVE_PID, ignore non-zero
exit codes, and ensure this is done even if vllm serve was exec'd so the Ray
head is deterministically shut down on test teardown.
In `@tests/test_acceptance.py`:
- Around line 636-650: The sglang-multinode test definition uses decode.tp=2 (in
SglangCmdArgs for "sglang-multinode") while the multi-node scenario TOMLs
(conf/experimental/sglang/test_scenario/sglang.toml and sglang-heavy.toml)
specify num_nodes=2 and tp=8, causing a mismatch; fix by aligning the
tensor-parallelism settings: either update the test in tests/test_acceptance.py
(the "sglang-multinode" SglangCmdArgs.decode.tp value and CUDA_VISIBLE_DEVICES)
to match tp=8 used by the scenario TOMLs, or change the TOML tp fields to 2 if 2
is intended, and add a brief comment documenting the chosen alignment so future
changes remain consistent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 31eb0958-d1de-4cea-b76b-17ead0960dde
📒 Files selected for processing (25)
conf/experimental/sglang/test/sglang-heavy.tomlconf/experimental/sglang/test_scenario/sglang-heavy.tomlconf/experimental/sglang/test_scenario/sglang.tomlconf/experimental/vllm/test/vllm-heavy.tomlconf/experimental/vllm/test_scenario/vllm-heavy-perf.tomlconf/experimental/vllm/test_scenario/vllm-heavy.tomlconf/experimental/vllm/test_scenario/vllm.tomldoc/workloads/sglang.rstdoc/workloads/vllm.rstsrc/cloudai/workloads/common/llm_serving.pysrc/cloudai/workloads/sglang/slurm_command_gen_strategy.pysrc/cloudai/workloads/vllm/slurm_command_gen_strategy.pysrc/cloudai/workloads/vllm/vllm.pytests/ref_data/sglang-disagg-2nodes.sbatchtests/ref_data/sglang-disagg.sbatchtests/ref_data/sglang-multinode.sbatchtests/ref_data/vllm-disagg-2nodes.sbatchtests/ref_data/vllm-disagg.sbatchtests/ref_data/vllm-multinode.sbatchtests/ref_data/vllm.sbatchtests/test_acceptance.pytests/workloads/common/test_llm_serving.pytests/workloads/sglang/test_command_gen_strategy_slurm.pytests/workloads/vllm/test_command_gen_strategy_slurm.pytests/workloads/vllm/test_workload.py
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
src/cloudai/workloads/vllm/vllm.py (1)
172-188:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep
constraint_check()consistent with the shared disaggregated node planner.For
num_nodes > 2, this falls back toprefill_nodes = decode_nodes = 1whenever the role sizes are unset, and it never checks that the two role counts sum totr.nnodes. The common Slurm helper now rejects both cases, so an invalid multinode config can pass definition-time validation here and only fail later when the sbatch is generated. Reuse the shared role-count validation here or returnFalseon the same conditions.Suggested fix
prefill_nodes = 1 decode_nodes = 1 if num_nodes > 2: prefill_nodes_value = self.cmd_args.prefill.num_nodes decode_nodes_value = self.cmd_args.decode.num_nodes - prefill_nodes = prefill_nodes_value if isinstance(prefill_nodes_value, int) else prefill_nodes - decode_nodes = decode_nodes_value if isinstance(decode_nodes_value, int) else decode_nodes + if not isinstance(prefill_nodes_value, int) or not isinstance(decode_nodes_value, int): + logging.error( + "Disaggregated vLLM with %s nodes requires both prefill.num_nodes and decode.num_nodes.", + num_nodes, + ) + return False + if prefill_nodes_value + decode_nodes_value != num_nodes: + logging.error( + "Configured prefill/decode nodes (%s + %s) must match allocated nodes (%s).", + prefill_nodes_value, + decode_nodes_value, + num_nodes, + ) + return False + prefill_nodes = prefill_nodes_value + decode_nodes = decode_nodes_value🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/workloads/vllm/vllm.py` around lines 172 - 188, The constraint_check logic incorrectly defaults prefill_nodes and decode_nodes to 1 for num_nodes > 2 and never verifies that the role counts sum to the total nodes; update the block handling num_nodes > 2 to (a) read prefill_nodes_value = self.cmd_args.prefill.num_nodes and decode_nodes_value = self.cmd_args.decode.num_nodes, (b) if either value is not an int or if prefill_nodes_value + decode_nodes_value != num_nodes then return False, and (c) otherwise set prefill_nodes and decode_nodes from those values and proceed to call _validate_vllm_parallelism_constraints using calculate_prefill_gpu_ids and calculate_decode_gpu_ids multiplied by the validated role counts; reference calculate_prefill_gpu_ids, calculate_decode_gpu_ids, self.cmd_args.prefill, self.cmd_args.decode, and _validate_vllm_parallelism_constraints when making the changes.tests/workloads/vllm/test_command_gen_strategy_slurm.py (1)
398-428:⚠️ Potential issue | 🟠 MajorDisaggregated mode ignores custom
VllmCmdArgs.healthcheckfor role servers
src/cloudai/workloads/common/llm_serving.pyhardcodes disaggregated role-server readiness checks tohttp://{prefill_host}:{prefill_port}/healthandhttp://{decode_host}:{decode_port}/health; the compatibility logic only swaps/health⇄/healthcheck, so custom endpoints (e.g./ready) will not be waited on.- The test suite does assert prefill/decode
/healthin disaggregated flows, but it doesn’t assert that a customcmd_args.healthcheckis honored on those prefill/decode ports—only the router/proxy wait is exercised viaproxy_router_healthcheck.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/workloads/vllm/test_command_gen_strategy_slurm.py` around lines 398 - 428, The disaggregated readiness-check logic in src/cloudai/workloads/common/llm_serving.py currently hardcodes "/health" for role-server endpoints and only applies the /health ⇄ /healthcheck compatibility swap, so custom VllmCmdArgs.healthcheck values are ignored; update the code that builds disaggregated prefill/decode wait_for_health URLs to use vllm.cmd_args.healthcheck (falling back to "/health" if unset) and also add the compatibility variant (if healthcheck ends with "/health" add "/healthcheck" and vice versa) so both legacy and custom endpoints are waited on; look for references to VllmCmdArgs.healthcheck, the disaggregated readiness construction in the function/method that generates role-server health checks (used by VllmSlurmCommandGenStrategy._gen_srun_command), and replace the hardcoded "/health" occurrences with this computed health path + compatibility variant.src/cloudai/workloads/common/llm_serving.py (1)
836-850:⚠️ Potential issue | 🟠 MajorFix disaggregated prefill/decode readiness endpoint mismatch with docs
doc/workloads/vllm.rststates that in disaggregated modehealthcheckcontrols the prefill/decode readiness endpoint, but the implementation hardcodes the prefill/decode wait URLs to...:{prefill_port}/healthand...:{decode_port}/health(src/cloudai/workloads/common/llm_serving.py, lines 836-841). By contrast, aggregated mode correctly waits on...:{serve_port}{self.tdef.cmd_args.healthcheck}(around lines 780-783).While
vllm/slurm_command_gen_strategy.pyadds compatibility only between/healthand/healthcheck, this does not cover arbitrary custom readiness paths—so the doc’s “custom paths are used as configured” guarantee can be broken for disaggregated prefill/decode waits.Proxy/router behavior is consistent: it waits on
proxy_router_healthcheck, which falls back tohealthcheckonly whenproxy_healthcheckisn’t set (vllm/slurm_command_gen_strategy.py, lines ~129-133), matching the doc.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/workloads/common/llm_serving.py` around lines 836 - 850, The disaggregated prefill/decode wait URLs are hardcoded to "/health" in generate_wait_for_health_block calls; change them to use the configured healthcheck path (same logic used by aggregated mode and proxy_router_healthcheck) so custom readiness endpoints are respected. Update the two URL entries that build f"http://{self.disaggregated_role_host('prefill')}:{self.prefill_port}/health" and f"http://{self.disaggregated_role_host('decode')}:{self.decode_port}/health" to instead incorporate the configured health path (e.g., use self.tdef.cmd_args.healthcheck or the same proxy_router_healthcheck fallback logic) so generate_wait_for_health_block, disaggregated_role_host, prefill_port, decode_port, serve_port and proxy_router_healthcheck all use the same healthcheck resolution.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/cloudai/workloads/common/llm_serving.py`:
- Around line 836-850: The disaggregated prefill/decode wait URLs are hardcoded
to "/health" in generate_wait_for_health_block calls; change them to use the
configured healthcheck path (same logic used by aggregated mode and
proxy_router_healthcheck) so custom readiness endpoints are respected. Update
the two URL entries that build
f"http://{self.disaggregated_role_host('prefill')}:{self.prefill_port}/health"
and f"http://{self.disaggregated_role_host('decode')}:{self.decode_port}/health"
to instead incorporate the configured health path (e.g., use
self.tdef.cmd_args.healthcheck or the same proxy_router_healthcheck fallback
logic) so generate_wait_for_health_block, disaggregated_role_host, prefill_port,
decode_port, serve_port and proxy_router_healthcheck all use the same
healthcheck resolution.
In `@src/cloudai/workloads/vllm/vllm.py`:
- Around line 172-188: The constraint_check logic incorrectly defaults
prefill_nodes and decode_nodes to 1 for num_nodes > 2 and never verifies that
the role counts sum to the total nodes; update the block handling num_nodes > 2
to (a) read prefill_nodes_value = self.cmd_args.prefill.num_nodes and
decode_nodes_value = self.cmd_args.decode.num_nodes, (b) if either value is not
an int or if prefill_nodes_value + decode_nodes_value != num_nodes then return
False, and (c) otherwise set prefill_nodes and decode_nodes from those values
and proceed to call _validate_vllm_parallelism_constraints using
calculate_prefill_gpu_ids and calculate_decode_gpu_ids multiplied by the
validated role counts; reference calculate_prefill_gpu_ids,
calculate_decode_gpu_ids, self.cmd_args.prefill, self.cmd_args.decode, and
_validate_vllm_parallelism_constraints when making the changes.
In `@tests/workloads/vllm/test_command_gen_strategy_slurm.py`:
- Around line 398-428: The disaggregated readiness-check logic in
src/cloudai/workloads/common/llm_serving.py currently hardcodes "/health" for
role-server endpoints and only applies the /health ⇄ /healthcheck compatibility
swap, so custom VllmCmdArgs.healthcheck values are ignored; update the code that
builds disaggregated prefill/decode wait_for_health URLs to use
vllm.cmd_args.healthcheck (falling back to "/health" if unset) and also add the
compatibility variant (if healthcheck ends with "/health" add "/healthcheck" and
vice versa) so both legacy and custom endpoints are waited on; look for
references to VllmCmdArgs.healthcheck, the disaggregated readiness construction
in the function/method that generates role-server health checks (used by
VllmSlurmCommandGenStrategy._gen_srun_command), and replace the hardcoded
"/health" occurrences with this computed health path + compatibility variant.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 7ffecac9-cc89-4616-8304-5504dd9d4de1
📒 Files selected for processing (11)
doc/workloads/sglang.rstdoc/workloads/vllm.rstsrc/cloudai/workloads/common/llm_serving.pysrc/cloudai/workloads/vllm/slurm_command_gen_strategy.pysrc/cloudai/workloads/vllm/vllm.pytests/ref_data/vllm-disagg-2nodes.sbatchtests/ref_data/vllm-disagg.sbatchtests/ref_data/vllm-multinode.sbatchtests/ref_data/vllm.sbatchtests/workloads/common/test_llm_serving.pytests/workloads/vllm/test_command_gen_strategy_slurm.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cloudai/workloads/common/llm_serving.py`:
- Around line 654-657: The common role_server_healthcheck property currently
returns a hardcoded "/health", so disaggregated startup waits never use the
configured tdef.cmd_args.healthcheck; update role_server_healthcheck (the
`@property` on the class in src/cloudai/workloads/common/llm_serving.py) to return
self.tdef.cmd_args.healthcheck or "/health" (and preserve the vLLM compatibility
mapping where "/healthcheck" should be normalized to "/health" if you need the
same behavior), so non-vLLM strategies honor the configured healthcheck value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 6cf8c2ad-3a40-4938-8e8a-62ff0049d782
📒 Files selected for processing (13)
conf/experimental/vllm/test/vllm-heavy.tomldoc/workloads/sglang.rstdoc/workloads/vllm.rstsrc/cloudai/workloads/common/llm_serving.pysrc/cloudai/workloads/vllm/slurm_command_gen_strategy.pysrc/cloudai/workloads/vllm/vllm.pytests/ref_data/sglang-multinode.sbatchtests/ref_data/vllm-disagg-2nodes.sbatchtests/ref_data/vllm-disagg.sbatchtests/ref_data/vllm-multinode.sbatchtests/test_acceptance.pytests/workloads/vllm/test_command_gen_strategy_slurm.pytests/workloads/vllm/test_workload.py
41a8eab to
62bf7cf
Compare
62bf7cf to
9ea8e17
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/cloudai/workloads/common/llm_serving.py (1)
287-291:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
used_gpus_count()undercounts GPUs for multi-node disaggregated runs with explicit rolegpu_ids.On Line 290, the explicit
prefill.gpu_ids/decode.gpu_idspath returns only per-node GPU totals. With multi-node disaggregation (prefill.num_nodes/decode.num_nodes> 1), this inflatestps-per-gpubecause cluster-wide GPU usage is undercounted.💡 Suggested fix
prefill_gpu_ids = tdef.cmd_args.prefill.gpu_ids decode_gpu_ids = tdef.cmd_args.decode.gpu_ids if prefill_gpu_ids and decode_gpu_ids: - return len(parse_gpu_ids(prefill_gpu_ids)) + len(parse_gpu_ids(decode_gpu_ids)) + prefill_nodes = tdef.cmd_args.prefill.num_nodes if isinstance(tdef.cmd_args.prefill.num_nodes, int) else 1 + decode_nodes = tdef.cmd_args.decode.num_nodes if isinstance(tdef.cmd_args.decode.num_nodes, int) else 1 + return ( + len(parse_gpu_ids(prefill_gpu_ids)) * prefill_nodes + + len(parse_gpu_ids(decode_gpu_ids)) * decode_nodes + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/workloads/common/llm_serving.py` around lines 287 - 291, used_gpus_count() currently sums parse_gpu_ids(prefill.gpu_ids) and parse_gpu_ids(decode.gpu_ids) but only returns per-node GPU counts when explicit gpu_ids are provided; for multi-node disaggregated runs you must multiply the per-node GPU count by the corresponding num_nodes (tdef.cmd_args.prefill.num_nodes and tdef.cmd_args.decode.num_nodes) before summing. Update the explicit-path in used_gpus_count to compute len(parse_gpu_ids(prefill.gpu_ids)) * max(1, tdef.cmd_args.prefill.num_nodes) and similarly for decode, then return their sum so cluster-wide GPU usage is correctly counted; reference the symbols used_gpus_count, tdef.cmd_args.prefill.gpu_ids, tdef.cmd_args.prefill.num_nodes, tdef.cmd_args.decode.gpu_ids, tdef.cmd_args.decode.num_nodes, and parse_gpu_ids.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cloudai/workloads/vllm/slurm_command_gen_strategy.py`:
- Around line 57-75: The serializer currently leaves non-dict scalar strings
unquoted (in _format_ray_value and used by _serialize_ray_start_args), causing
shell-splitting when embedding the generated args into bash -lc; update
_format_ray_value to shlex.quote string instances (not only dicts) before
returning, ensuring string scalars are shell-quoted when
_serialize_ray_start_args builds the "--key=value" segments for ray start (leave
dict handling as-is and preserve boolean handling in _serialize_ray_start_args).
In `@src/cloudai/workloads/vllm/vllm.py`:
- Around line 191-223: The validator rejects the explicit single-node
disaggregated config prefill.num_nodes=1 and decode.num_nodes=1 because
prefill_nodes_value + decode_nodes_value != num_nodes; update the check in the
vLLM validation (using prefill_nodes_value, decode_nodes_value, num_nodes,
prefill_nodes, decode_nodes) to allow the special-case when num_nodes == 1 and
both role counts are 1 (treat it as valid and set prefill_nodes/decode_nodes
accordingly); concretely, change the sum-check condition to skip the error if
(num_nodes == 1 && prefill_nodes_value == 1 && decode_nodes_value == 1) so the
explicit 1/1 case is accepted.
In `@tests/workloads/vllm/test_command_gen_strategy_slurm.py`:
- Around line 399-429: The test is letting callers override Ray topology flags
(head/address/block/port) via VllmRayStartArgs, which we must prevent; update
VllmSlurmCommandGenStrategy._gen_srun_command to ignore topology-defining fields
from user-provided VllmRayStartArgs and only honor resource/telemetry knobs
(num_gpus, num_cpus, dashboard_host, disable_usage_stats) when building the "ray
start" srun command; specifically, filter the incoming VllmRayStartArgs for
ray_head and ray_worker to drop/override head, address, block, and port before
rendering flags so orchestration controls topology while still allowing
num_gpus/num_cpus/dashboard_host/disable_usage_stats to be applied.
---
Outside diff comments:
In `@src/cloudai/workloads/common/llm_serving.py`:
- Around line 287-291: used_gpus_count() currently sums
parse_gpu_ids(prefill.gpu_ids) and parse_gpu_ids(decode.gpu_ids) but only
returns per-node GPU counts when explicit gpu_ids are provided; for multi-node
disaggregated runs you must multiply the per-node GPU count by the corresponding
num_nodes (tdef.cmd_args.prefill.num_nodes and tdef.cmd_args.decode.num_nodes)
before summing. Update the explicit-path in used_gpus_count to compute
len(parse_gpu_ids(prefill.gpu_ids)) * max(1, tdef.cmd_args.prefill.num_nodes)
and similarly for decode, then return their sum so cluster-wide GPU usage is
correctly counted; reference the symbols used_gpus_count,
tdef.cmd_args.prefill.gpu_ids, tdef.cmd_args.prefill.num_nodes,
tdef.cmd_args.decode.gpu_ids, tdef.cmd_args.decode.num_nodes, and parse_gpu_ids.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: f3a123c8-d2cf-4eae-802d-0731d342443a
📒 Files selected for processing (22)
conf/experimental/sglang/test_scenario/sglang-heavy.tomlconf/experimental/sglang/test_scenario/sglang.tomlconf/experimental/vllm/test/vllm.tomlconf/experimental/vllm/test_scenario/vllm-heavy.tomlconf/experimental/vllm/test_scenario/vllm.tomldoc/workloads/sglang.rstdoc/workloads/vllm.rstsrc/cloudai/workloads/common/llm_serving.pysrc/cloudai/workloads/sglang/slurm_command_gen_strategy.pysrc/cloudai/workloads/vllm/__init__.pysrc/cloudai/workloads/vllm/slurm_command_gen_strategy.pysrc/cloudai/workloads/vllm/vllm.pytests/ref_data/sglang-disagg-2nodes.sbatchtests/ref_data/sglang-disagg.sbatchtests/ref_data/sglang-multinode.sbatchtests/ref_data/vllm-disagg-2nodes.sbatchtests/ref_data/vllm-disagg.sbatchtests/ref_data/vllm-multinode.sbatchtests/test_acceptance.pytests/workloads/sglang/test_command_gen_strategy_slurm.pytests/workloads/vllm/test_command_gen_strategy_slurm.pytests/workloads/vllm/test_workload.py
Summary
Test Plan
Additional Notes