feat(templates): add bev-model-training-robotics template + test#719
feat(templates): add bev-model-training-robotics template + test#719maxpumperla wants to merge 9 commits into
Conversation
Add BEV (Bird's-Eye View) model training template for robotics applications. This template demonstrates end-to-end distributed training pipeline using Ray Data for multi-camera preprocessing and Ray Train for DDP training. Template includes: - README.ipynb with complete walkthrough of BEV training pipeline - Ray Data pipelines for NuScenes dataset preprocessing - Ray Train configuration with 2-worker DDP training on L4 GPUs - Three architecture diagrams (training pipeline, camera transform, distributed arch) - GPU compute configs for AWS (g6.4xlarge) and GCE (g2-standard-4) - Test script for validation Family: Robotics Target libraries: Ray Data, Ray Train Workload types: Distributed training, Vision training Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
/test-template bev-model-training-robotics |
There was a problem hiding this comment.
Pull request overview
Adds a new “BEV Model Training for Robotics” template to the templates catalog, intended to demonstrate an end-to-end distributed BEV training pipeline using Ray Data (CPU preprocessing) and Ray Train (DDP training), along with configs and a CI validation test.
Changes:
- Introduces a new
bev-model-training-roboticstemplate (README.md/README.ipynb, requirements, metadata, diagrams). - Adds AWS/GCP compute configs for a 2-worker GPU setup.
- Registers the template in
BUILD.yamland adds a papermill-based test runner.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/bev-model-training-robotics/tests.sh | Adds papermill-based execution of the template notebook for validation. |
| templates/bev-model-training-robotics/requirements.txt | Declares template-specific Python dependencies. |
| templates/bev-model-training-robotics/README.md | Markdown walkthrough for dataset staging, Ray Data preprocessing, and Ray Train DDP training. |
| templates/bev-model-training-robotics/README.ipynb | Notebook version of the walkthrough intended to be executed in the workspace. |
| templates/bev-model-training-robotics/metadata.json | Adds template metadata (intent, structure, dependencies, diagrams). |
| templates/bev-model-training-robotics/diagrams/*.xml | Adds architecture/flow diagrams for the template. |
| configs/bev-model-training-robotics/aws.yaml | Adds AWS cluster sizing for the template. |
| configs/bev-model-training-robotics/gce.yaml | Adds GCP cluster sizing for the template. |
| BUILD.yaml | Registers the new template with image, compute configs, and test command. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| }, | ||
| "structure": { | ||
| "archetype": "single-notebook", | ||
| "primary_notebook": "bev_model_training_robotics.ipynb" |
| "name": "python", | ||
| "version": "3.12.0" |
| @@ -0,0 +1,4 @@ | |||
| #!/usr/bin/env bash | |||
| set -euo pipefail | |||
| pip install papermill | |||
| # Block entries that would escape target directory | ||
| if not str(member_path).startswith(str(path)): | ||
| raise RuntimeError(f"Blocked path traversal in tar member: {member.name}") | ||
|
|
||
| tar.extractall(path) |
| with torch.cuda.amp.autocast(enabled=(device.type == "cuda"), dtype=torch.float16): | ||
| logits = model(images) | ||
| loss = F.cross_entropy(logits, labels) |
| if train_mode: | ||
| # Gradient accumulation | ||
| scaler.scale(loss / grad_accum).backward() | ||
| step += 1 | ||
| if step % grad_accum == 0: | ||
| scaler.step(optimizer) | ||
| scaler.update() | ||
| optimizer.zero_grad(set_to_none=True) | ||
|
|
Per repo convention, add completion time estimate to first notebook cell. Regenerated README.md to match.
Leftover from the template-generating skill — not read by tmpl-publish or any tooling, not part of the canonical template layout, and no other template carries one. Signed-off-by: Aydin Abiar <aydin@anyscale.com>
#311 failed `ModuleNotFoundError: nuscenes` (reqs never installed); full extract incl. unused sweeps/ overran 1800s. Adds pip-install cell; selective extract (skip sweeps/, guard kept, drop from idempotency check, nsweeps 5→1); parametrize epochs/subset (defaults 3/200, CI 1/16); materialize datasets; sweep (no_grad eval, torch.amp, worker-side alloc-conf, ~4 GB doc). Signed-off-by: Aydin Abiar <aydin@anyscale.com>
|
/test-template bev-model-training-robotics |
nuscenes-devkit transitively pulls GUI opencv-python (needs libGL, absent on the headless ray-llm image); pin opencv-python-headless and force-reinstall it so the headless cv2 owns the namespace, fixing ImportError: libGL.so.1 at NuScenes init (build #333). Signed-off-by: Aydin Abiar <aydin@anyscale.com>
|
/test-template bev-model-training-robotics |
nuscenes-devkit==1.1.11 pins matplotlib<3.6, which has no py311 wheel, so pip source-builds it (flaky freetype/sourceforge fetch) and the whole `pip install` aborts before nuscenes installs. Install nuscenes with --no-deps and bring its runtime deps as py311 wheels (pyquaternion, pycocotools, shapely<2.0, descartes, fire). Supersedes the prior commit's opencv handling: the ray-llm image already ships opencv-python-headless 4.13 (vllm requires >=4.13), so the earlier ==4.10.0.84 pin + --force-reinstall downgraded the image and broke vllm. Pin >=4.13 instead; --no-deps also keeps GUI opencv-python out, so cv2 stays headless without the force-reinstall. Signed-off-by: Aydin Abiar <aydin@anyscale.com>
|
/test-template bev-model-training-robotics |
Listing opencv-python-headless>=4.13.0 in requirements.txt made pip honor its declared numpy>=2 during `pip install -r requirements.txt`, upgrading the base image's numpy 1.26 -> 2.x and corrupting the numpy-1.x stack (vllm/scipy/torch) — build #341. The base image already ships opencv-python-headless 4.13, and installing nuscenes-devkit --no-deps keeps GUI opencv-python out, so cv2 stays importable (no libGL) without listing opencv at all. Drop the pin and lock numpy<2 so nothing upgrades the base's numpy. Signed-off-by: Aydin Abiar <aydin@anyscale.com>
|
/test-template bev-model-training-robotics |
nuscenes-devkit 1.1.11 targets matplotlib<3.6 and calls FigureCanvas.set_window_title in its render helpers; matplotlib removed it in 3.6 (the image has 3.7.4), so the lidar->camera projection render (render_pointcloud_in_image) raised AttributeError at build #342. Add a no-op set_window_title shim as a code cell before the first render so all visualization cells run headless. The camera and lidar+underlay_map renders already passed on 3.7.4, so this is the isolated remaining break. Signed-off-by: Aydin Abiar <aydin@anyscale.com>
|
/test-template bev-model-training-robotics |
Ray Data preprocessing failed on the WORKER nodes with libGL.so.1 (#343): the notebook's runtime pip cell only ran on the head, and --no-deps isn't honored cluster-wide, so workers lacked nuscenes / libgl1. BYOD bakes the verified deps + libgl1 into every node instead. Switch BUILD.yaml to cluster_env.byod (us-docker.pkg.dev/anyscale-workspace-templates/workspace-templates/bev-model-training-robotics:2.55.1, digest sha256:ccb792b940460a20b72e37a4cc9097f1d8dacdc9689fb6c1d1f21cebd447eb36); drop the notebook pip cell + requirements.txt; keep the set_window_title shim. Signed-off-by: Aydin Abiar <aydin@anyscale.com>
|
/test-template bev-model-training-robotics |
Add BEV (Bird's-Eye View) model training template for robotics applications. This template demonstrates end-to-end distributed training pipeline using Ray Data for multi-camera preprocessing and Ray Train for DDP training.
Template includes:
Family: Robotics
Target libraries: Ray Data, Ray Train
Workload types: Distributed training, Vision training