Skip to content

feat(templates): add bev-model-training-robotics template + test#719

Open
maxpumperla wants to merge 9 commits into
mainfrom
mp_bev_template
Open

feat(templates): add bev-model-training-robotics template + test#719
maxpumperla wants to merge 9 commits into
mainfrom
mp_bev_template

Conversation

@maxpumperla

Copy link
Copy Markdown
Contributor

Add BEV (Bird's-Eye View) model training template for robotics applications. This template demonstrates end-to-end distributed training pipeline using Ray Data for multi-camera preprocessing and Ray Train for DDP training.

Template includes:

  • README.ipynb with complete walkthrough of BEV training pipeline
  • Ray Data pipelines for NuScenes dataset preprocessing
  • Ray Train configuration with 2-worker DDP training on L4 GPUs
  • Three architecture diagrams (training pipeline, camera transform, distributed arch)
  • GPU compute configs for AWS (g6.4xlarge) and GCE (g2-standard-4)
  • Test script for validation

Family: Robotics
Target libraries: Ray Data, Ray Train
Workload types: Distributed training, Vision training

Add BEV (Bird's-Eye View) model training template for robotics applications.
This template demonstrates end-to-end distributed training pipeline using
Ray Data for multi-camera preprocessing and Ray Train for DDP training.

Template includes:
- README.ipynb with complete walkthrough of BEV training pipeline
- Ray Data pipelines for NuScenes dataset preprocessing
- Ray Train configuration with 2-worker DDP training on L4 GPUs
- Three architecture diagrams (training pipeline, camera transform, distributed arch)
- GPU compute configs for AWS (g6.4xlarge) and GCE (g2-standard-4)
- Test script for validation

Family: Robotics
Target libraries: Ray Data, Ray Train
Workload types: Distributed training, Vision training

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@maxpumperla maxpumperla requested a review from Aydin-ab May 28, 2026 14:51
@maxpumperla maxpumperla requested a review from a team as a code owner May 28, 2026 14:51
@maxpumperla

Copy link
Copy Markdown
Contributor Author

/test-template bev-model-training-robotics

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “BEV Model Training for Robotics” template to the templates catalog, intended to demonstrate an end-to-end distributed BEV training pipeline using Ray Data (CPU preprocessing) and Ray Train (DDP training), along with configs and a CI validation test.

Changes:

  • Introduces a new bev-model-training-robotics template (README.md/README.ipynb, requirements, metadata, diagrams).
  • Adds AWS/GCP compute configs for a 2-worker GPU setup.
  • Registers the template in BUILD.yaml and adds a papermill-based test runner.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/bev-model-training-robotics/tests.sh Adds papermill-based execution of the template notebook for validation.
templates/bev-model-training-robotics/requirements.txt Declares template-specific Python dependencies.
templates/bev-model-training-robotics/README.md Markdown walkthrough for dataset staging, Ray Data preprocessing, and Ray Train DDP training.
templates/bev-model-training-robotics/README.ipynb Notebook version of the walkthrough intended to be executed in the workspace.
templates/bev-model-training-robotics/metadata.json Adds template metadata (intent, structure, dependencies, diagrams).
templates/bev-model-training-robotics/diagrams/*.xml Adds architecture/flow diagrams for the template.
configs/bev-model-training-robotics/aws.yaml Adds AWS cluster sizing for the template.
configs/bev-model-training-robotics/gce.yaml Adds GCP cluster sizing for the template.
BUILD.yaml Registers the new template with image, compute configs, and test command.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

},
"structure": {
"archetype": "single-notebook",
"primary_notebook": "bev_model_training_robotics.ipynb"
Comment on lines +11 to +12
"name": "python",
"version": "3.12.0"
@@ -0,0 +1,4 @@
#!/usr/bin/env bash
set -euo pipefail
pip install papermill
Comment on lines +154 to +158
# Block entries that would escape target directory
if not str(member_path).startswith(str(path)):
raise RuntimeError(f"Blocked path traversal in tar member: {member.name}")

tar.extractall(path)
Comment on lines +845 to +847
with torch.cuda.amp.autocast(enabled=(device.type == "cuda"), dtype=torch.float16):
logits = model(images)
loss = F.cross_entropy(logits, labels)
Comment on lines +849 to +857
if train_mode:
# Gradient accumulation
scaler.scale(loss / grad_accum).backward()
step += 1
if step % grad_accum == 0:
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad(set_to_none=True)

maxpumperla and others added 3 commits May 28, 2026 17:01
Per repo convention, add completion time estimate to first notebook cell.
Regenerated README.md to match.
Leftover from the template-generating skill — not read by tmpl-publish or any tooling, not part of the canonical template layout, and no other template carries one.

Signed-off-by: Aydin Abiar <aydin@anyscale.com>
#311 failed `ModuleNotFoundError: nuscenes` (reqs never installed); full extract incl. unused sweeps/ overran 1800s. Adds pip-install cell; selective extract (skip sweeps/, guard kept, drop from idempotency check, nsweeps 5→1); parametrize epochs/subset (defaults 3/200, CI 1/16); materialize datasets; sweep (no_grad eval, torch.amp, worker-side alloc-conf, ~4 GB doc).

Signed-off-by: Aydin Abiar <aydin@anyscale.com>
@Aydin-ab

Copy link
Copy Markdown
Contributor

/test-template bev-model-training-robotics

nuscenes-devkit transitively pulls GUI opencv-python (needs libGL, absent on the headless ray-llm image); pin opencv-python-headless and force-reinstall it so the headless cv2 owns the namespace, fixing ImportError: libGL.so.1 at NuScenes init (build #333).

Signed-off-by: Aydin Abiar <aydin@anyscale.com>
@Aydin-ab

Copy link
Copy Markdown
Contributor

/test-template bev-model-training-robotics

nuscenes-devkit==1.1.11 pins matplotlib<3.6, which has no py311 wheel, so pip source-builds it (flaky freetype/sourceforge fetch) and the whole `pip install` aborts before nuscenes installs. Install nuscenes with --no-deps and bring its runtime deps as py311 wheels (pyquaternion, pycocotools, shapely<2.0, descartes, fire).

Supersedes the prior commit's opencv handling: the ray-llm image already ships opencv-python-headless 4.13 (vllm requires >=4.13), so the earlier ==4.10.0.84 pin + --force-reinstall downgraded the image and broke vllm. Pin >=4.13 instead; --no-deps also keeps GUI opencv-python out, so cv2 stays headless without the force-reinstall.

Signed-off-by: Aydin Abiar <aydin@anyscale.com>
@Aydin-ab

Copy link
Copy Markdown
Contributor

/test-template bev-model-training-robotics

Listing opencv-python-headless>=4.13.0 in requirements.txt made pip honor its declared numpy>=2 during `pip install -r requirements.txt`, upgrading the base image's numpy 1.26 -> 2.x and corrupting the numpy-1.x stack (vllm/scipy/torch) — build #341.

The base image already ships opencv-python-headless 4.13, and installing nuscenes-devkit --no-deps keeps GUI opencv-python out, so cv2 stays importable (no libGL) without listing opencv at all. Drop the pin and lock numpy<2 so nothing upgrades the base's numpy.

Signed-off-by: Aydin Abiar <aydin@anyscale.com>
@Aydin-ab

Copy link
Copy Markdown
Contributor

/test-template bev-model-training-robotics

nuscenes-devkit 1.1.11 targets matplotlib<3.6 and calls FigureCanvas.set_window_title in its render helpers; matplotlib removed it in 3.6 (the image has 3.7.4), so the lidar->camera projection render (render_pointcloud_in_image) raised AttributeError at build #342. Add a no-op set_window_title shim as a code cell before the first render so all visualization cells run headless. The camera and lidar+underlay_map renders already passed on 3.7.4, so this is the isolated remaining break.

Signed-off-by: Aydin Abiar <aydin@anyscale.com>
@Aydin-ab

Copy link
Copy Markdown
Contributor

/test-template bev-model-training-robotics

Ray Data preprocessing failed on the WORKER nodes with libGL.so.1 (#343): the notebook's runtime pip cell only ran on the head, and --no-deps isn't honored cluster-wide, so workers lacked nuscenes / libgl1. BYOD bakes the verified deps + libgl1 into every node instead.

Switch BUILD.yaml to cluster_env.byod (us-docker.pkg.dev/anyscale-workspace-templates/workspace-templates/bev-model-training-robotics:2.55.1, digest sha256:ccb792b940460a20b72e37a4cc9097f1d8dacdc9689fb6c1d1f21cebd447eb36); drop the notebook pip cell + requirements.txt; keep the set_window_title shim.

Signed-off-by: Aydin Abiar <aydin@anyscale.com>
@Aydin-ab

Copy link
Copy Markdown
Contributor

/test-template bev-model-training-robotics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants