Compact Dynamic State Representation for Fine-Grained Robot Manipulation
STAIR extends static compact state representations (StaMo) with a dynamic compact feature that captures short-range local dynamics in a [-delta, +delta] neighborhood — improving fine-grained manipulation tasks such as peg insertion, furniture assembly, and flower arrangement.
Fine manipulation requires more than knowing "what is there" — it requires knowing how the local state changes near the target region. STAIR proposes:
- Static compact state (StaMo): captures what is there.
- Dynamic compact feature (STAIR): captures how the local state evolves around the contact region.
The core contribution is a 2-stage VLA training recipe that uses UniVAM as a frozen dynamic feature encoder, replacing WoG's FutureEnc+QFormer pipeline. VLM trainability follows the same freeze configuration mechanism as WoG.
observation + instruction
↓
VLA backbone (PrismaticVLM, trainability from config)
↓ cognition feature [B, 1, 4096]
↓
UniVAM encoder (frozen) + learnable projection
↓ per_token [B, 2, 4096]
↓
DiT flow-matching action head
↓
robot action [Ta, 7]
Stage 1 — Train: VLM modules selected by config + projection head + DiT action head. Freeze: UniVAM encoder.
Stage 2 — Train: VLM modules selected by config + oformer (QFormer) + DiT action head. Freeze: UniVAM encoder + Stage 1 projection target.
At inference the oformer replaces UniVAM — no video clip is needed at deploy time.
STAIR/
├── stair/ # Main STAIR package (WoG-derived, adapted for STAIR)
│ ├── action_model/ # DiT flow-matching action decoder
│ ├── conf/ # draccus dataclass configs (Stage1 + Stage2 registered)
│ ├── deploy/ # STAIRSimPolicy + STAIRRealPolicy (SimplerEnv interface)
│ ├── eval/ # SimplerEnvAdapter (gripper postprocessing)
│ ├── prismatic/ # Vendored OpenVLA/Prismatic backbone + data pipeline
│ ├── scripts/ # train.py + eval_simpler.py entry points
│ ├── training/ # FSDP training strategies
│ ├── utils/ # VideoFrameBuffer, image preprocessing
│ └── vla/ # STAIR_VLA model, QFormer, DynamicFeatureEncoder
├── third_party/
│ ├── UniVAM/ # Vendored dynamic feature encoder
│ └── SimplerEnv/ # Official evaluation benchmark submodule
└── pyproject.toml
The repository is intended to be installed inside a conda environment, with uv used for faster package resolution and installation.
UniVAM is vendored under third_party/UniVAM. SimplerEnv is tracked as an official submodule on its main branch:
git clone --recurse-submodules <STAIR-repo-url>
cd STAIRIf the repository was cloned without submodules, initialize SimplerEnv explicitly:
git submodule update --init --recursive third_party/SimplerEnvthird_party/WoG is only an architecture reference and remains outside the tracked repository.
| Layer | Purpose | Install target |
|---|---|---|
| Base | Import STAIR modules and lightweight model code | -e . |
| Dev | Local CPU syntax and lint checks | -e ".[dev]" |
| Train | H200 OpenVLA/OXE training stack | -e ".[train]" |
| UniVAM | Frozen dynamic encoder, prioritized for training compatibility | -e third_party/UniVAM |
| Eval | STAIR policy wrapper + SimplerEnv real-to-sim evaluator | -e ".[eval]" plus SimplerEnv installs |
UniVAM currently pins torch==2.8.0, so STAIR follows that torch major environment instead of WoG's older torch==2.2.0 pin.
For lightweight local checks and code editing. No CUDA, pretrained weights, OXE data, or SimplerEnv install is needed.
conda create -n stair python=3.10 -y
conda activate stair
python -m pip install -U uv
uv pip install -e ".[dev]"Use this for full STAIR training with OpenVLA/Prismatic, RLDS/OXE, and UniVAM. Install CUDA PyTorch first so UniVAM and STAIR both resolve against the same torch build.
conda create -n stair python=3.10 -y
conda activate stair
python -m pip install -U uv
# Pick the CUDA wheel index that matches the server driver. cu128 is the default target for torch==2.8.0.
uv pip install torch==2.8.0 torchvision --index-url https://download.pytorch.org/whl/cu128
# STAIR package and OpenVLA/OXE training dependencies.
uv pip install -e ".[train]"
# UniVAM dynamic encoder from the vendored third-party directory. Use --no-deps for the same
# reason: STAIR pins the shared runtime stack in pyproject.toml.
uv pip install --no-deps -e third_party/UniVAM
# FlashAttention is required by the Prismatic/OpenVLA backbone. Install after torch is present.
MAX_JOBS=4 uv pip install flash-attn --no-build-isolationTorchCodec is required by UniVAM video decoding in full workflows. Install the build matching the H200 torch/CUDA stack by following the TorchCodec release instructions for PyTorch 2.8.
The one-command setup script follows the same dependency isolation policy:
./scripts/setup_stair_env.shmake setup
# Downloads config/tokenizer files from NousResearch/Llama-2-7b-hf
# → stair/prismatic/models/backbones/llm/llama2_config/# OpenVLA-7b (VLA backbone)
mkdir -p pretrained/openvla-7b-prismatic
cd pretrained/openvla-7b-prismatic
git clone git@hf.co:openvla/openvla-7b-prismatic . && git lfs pull
cd -
# UniVAM weights
python third_party/UniVAM/download_models.py
# UniVAM training checkpoints are directories containing Wan22VM.pth and Projector.pth.Current STAIR training uses the same RLDS/OXE data path as WoG/OpenVLA. The selected mixture is bridge_rt_1, which expands to:
| Mixture item | Dataset directory under data_root_dir |
Transform |
|---|---|---|
| BridgeV2 | bridge_orig |
bridge_orig_dataset_transform |
| Fractal / RT-1 | fractal20220817_data |
rt1_dataset_transform |
On the remote server, point --data_root_dir at the full OXE root:
ls data/oxe/bridge_orig data/oxe/fractal20220817_dataThe current data flow is:
data/oxe
-> OXE mixture "bridge_rt_1"
-> RLDS standardization transform
-> current observation pixel_values for VLM at 224x224
-> keyframe-centered dynamic_video for dynamic encoder at [5, 3, 256, 256]
-> action chunk [16, 7]
The STAIR stage configs default to bridge_rt_1. Training commands still pass it explicitly so the active dataset is visible in logs.
If the UniVAM checkpoint is not ready yet, use the explicit mock encoder path to test the VLA training pipeline. This keeps image/video dimensions at the target STAIR shape but does not produce meaningful dynamic features.
torchrun --standalone --nnodes 1 --nproc-per-node 8 stair/scripts/train.py \
--vla.type stair-dinosiglip-224px+oxe+stage1 \
--stage 1 \
--dynamic_encoder_type mock \
--vla.data_mix bridge_rt_1 \
--video_num_frames 5 \
--dynamic_clip_delta 2 \
--dynamic_clip_stride 4 \
--sample_strategy dense \
--video_image_size "[224,224]" \
--data_root_dir data/oxe \
--run_root_dir runs/ \
--run_id stair_stage1_mock_pipeline \
--wandb_project stairThe scripts/ directory contains launchers for the two-stage training flow. They default to the real UniVAM teacher path, bridge_rt_1, data/oxe, 5 keyframe-centered dynamic video frames, stride 4, the latest third_party/UniVAM/configs/bridgev2.yaml, and 256x256 dynamic video inputs. A real UniVAM run requires UNIVAM_CHECKPOINT to point to the checkpoint directory containing Wan22VM.pth and Projector.pth.
UNIVAM_CHECKPOINT=pretrained/univam/bridgev2 ./scripts/train_stage1.sh
UNIVAM_CHECKPOINT=pretrained/univam/bridgev2 ./scripts/train_stage2.shStage 1 defaults to the local OpenVLA checkpoint downloaded above:
pretrained/openvla-7b-prismatic/checkpoints/step-295000-epoch-40-loss=0.2200.ptCommon overrides:
NPROC_PER_NODE=8 DATA_ROOT_DIR=data/oxe RUN_ID=stair_stage1_v1 ./scripts/train_stage1.sh
NPROC_PER_NODE=1 DATA_ROOT_DIR=data/oxe RUN_ID=stair_stage1_debug ./scripts/train_stage1.sh
BASE_VLM=pretrained/openvla-7b-prismatic PRETRAINED_CHECKPOINT=pretrained/openvla-7b-prismatic/checkpoints/step-295000-epoch-40-loss=0.2200.pt ./scripts/train_stage1.sh
STAGE1_RUN_ID=stair_stage1_v1 RUN_ID=stair_stage2_v1 ./scripts/train_stage2.sh
STAGE1_CHECKPOINT=runs/stair_stage1_v1/checkpoints/step-XXXXXX-epoch-XX-loss=X.XXXX.pt ./scripts/train_stage2.sh
UNIVAM_CONFIG_PATH=third_party/UniVAM/configs/fractal.yaml VIDEO_IMAGE_SIZE="[256,256]" ./scripts/train_stage1.shFor pipeline smoke tests without a trained UniVAM encoder, opt into the mock encoder explicitly:
DYNAMIC_ENCODER_TYPE=mock VIDEO_IMAGE_SIZE="[224,224]" ./scripts/train_stage1.sh
STAGE2_TRAIN_MODE=direct DYNAMIC_ENCODER_TYPE=mock PRETRAINED_CHECKPOINT=pretrained/openvla-7b-prismatic/checkpoints/step-295000-epoch-40-loss=0.2200.pt ./scripts/train_stage2.shInstall after the training environment above. Requires Vulkan for rendering.
Use the STAIR training environment as the base eval environment. This keeps Torch, Transformers, Prismatic/OpenVLA, and UniVAM aligned with the checkpoint that will be loaded. Do not install SimplerEnv's full RT-1/Octo dependency set for STAIR-only evaluation; that path pulls TensorFlow/JAX packages that are unrelated to STAIR and can make dependency resolution fragile.
# Pull official SimplerEnv and its ManiSkill2_real2sim nested submodule.
git submodule update --init --recursive third_party/SimplerEnv
# STAIR eval helper dependencies.
uv pip install -e ".[eval]"
# ManiSkill2 real-to-sim environments. Keep numpy<2.0; 1.26.4 matches SimplerEnv's IK stack.
uv pip install numpy==1.26.4
uv pip install -e third_party/SimplerEnv/ManiSkill2_real2sim/
# SimplerEnv evaluator.
uv pip install -e third_party/SimplerEnv/
# Optional only if evaluating RT-1 or Octo baselines through the same checkout.
# uv pip install tensorflow==2.15.0
# uv pip install -r third_party/SimplerEnv/requirements_full_install.txt
# Validate headless Vulkan before running the evaluator.
DISPLAY="" CUDA_VISIBLE_DEVICES=0 VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/nvidia_icd.json vulkaninfo --summaryOn headless servers, the GLFW/X11 warning is expected because SimplerEnv disables DISPLAY. The fatal failures are SAPIEN/SVulkan messages such as Vulkan is incompatible with your driver, Some required Vulkan extension is not present, or a following Segmentation fault.
libvulkan1 only installs the Vulkan loader. SimplerEnv rendering also needs the NVIDIA Vulkan/GLX/EGL user-space driver libraries that match the kernel driver. vulkaninfo --summary must report an NVIDIA device; if it reports Mesa llvmpipe, Vulkan is using CPU software rendering and SimplerEnv will fail.
Check the compute node before launching STAIR:
nvidia-smi
find /usr/share /etc -path '*vulkan*icd*' -name '*nvidia*.json' -print
find /usr/share /etc -path '*glvnd*egl_vendor*' -name '*nvidia*.json' -print
ldconfig -p | grep libGLX_nvidia
find /usr /etc /lib /lib64 -name 'libGLX_nvidia.so*' -print
DISPLAY="" CUDA_VISIBLE_DEVICES=0 VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/nvidia_icd.json vulkaninfo --summaryIf dpkg -l does not list NVIDIA packages but the files above exist on the GPU instance, that is acceptable: the GPU runtime may inject driver files from the host instead of installing them through apt in the image. If libGLX_nvidia.so is missing on the GPU instance, install or expose the NVIDIA GL/Vulkan user-space package that matches the driver shown by nvidia-smi (libnvidia-gl-<driver-version> / full nvidia-driver-<driver-version> on Ubuntu-style systems). On clusters or containers, the host may expose CUDA compute libraries but not graphics/Vulkan libraries; the job runtime must include NVIDIA graphics capability and the host Vulkan/GLVND files.
For CPU-image-to-GPU-instance workflows, install only the generic runtime dependencies in the networked CPU image, then let the GPU instance provide the matching NVIDIA driver files:
apt-get update
apt-get install -y \
vulkan-tools libvulkan1 \
libxext6 libx11-6 libx11-xcb1 libxcb1 \
libxrandr2 libxrender1 libxi6 libxfixes3 \
libglvnd0 libglx0 libegl1The helper script runs this check by default; set CHECK_VULKAN=0 only after vulkaninfo --summary reports NVIDIA on the same compute node.
torchrun --standalone --nnodes 1 --nproc-per-node 8 stair/scripts/train.py \
--vla.type stair-dinosiglip-224px+oxe+stage1 \
--vla.base_vlm pretrained/openvla-7b-prismatic \
--stage 1 \
--pretrained_checkpoint pretrained/openvla-7b-prismatic/checkpoints/step-295000-epoch-40-loss=0.2200.pt \
--univam_config_path third_party/UniVAM/configs/bridgev2.yaml \
--univam_checkpoint pretrained/univam/bridgev2 \
--vla.data_mix bridge_rt_1 \
--video_num_frames 5 \
--dynamic_clip_delta 2 \
--dynamic_clip_stride 4 \
--sample_strategy dense \
--video_image_size "[256,256]" \
--data_root_dir data/oxe \
--run_root_dir runs/ \
--run_id stair_stage1_v1 \
--wandb_project stairtorchrun --standalone --nnodes 1 --nproc-per-node 8 stair/scripts/train.py \
--vla.type stair-dinosiglip-224px+oxe+stage2 \
--stage 2 \
--pretrained_checkpoint runs/stair_stage1_v1/checkpoints/step-XXXXXX.pt \
--univam_config_path third_party/UniVAM/configs/bridgev2.yaml \
--univam_checkpoint pretrained/univam/bridgev2 \
--vla.data_mix bridge_rt_1 \
--video_num_frames 5 \
--dynamic_clip_delta 2 \
--dynamic_clip_stride 4 \
--sample_strategy dense \
--video_image_size "[256,256]" \
--data_root_dir data/oxe \
--run_root_dir runs/ \
--run_id stair_stage2_v1 \
--wandb_project stairThe public checkout keeps third_party/SimplerEnv on the official SimplerEnv main branch. Official SimplerEnv does not register a --policy-model stair branch, so the supported STAIR entry point builds the STAIR policy inside this repository and uses SimplerEnv as the environment dependency:
stair/scripts/eval_simpler.py
-> stair.deploy.stair_policy_loader.build_stair_simpler_policy(...)
-> STAIRSimPolicy
-> simpler_env.make(...)
-> rollout success summary
The WoG-style full SimplerEnv evaluator integration requires a SimplerEnv entry point that registers STAIR as a policy model. That integration is not part of the official SimplerEnv submodule.
Target tasks — fine-grained, contact-sensitive:
| Task | Octo-small baseline |
|---|---|
widowx_put_eggplant_in_basket |
56.9% |
widowx_carrot_on_plate |
9.7% |
widowx_stack_cube |
4.2% |
widowx_spoon_on_towel |
47.2% |
google_robot_place_apple_in_closed_top_drawer |
21.3% |
Use the same task names as SimplerEnv:
| Group | Task names |
|---|---|
| Google Robot coke can | google_robot_pick_coke_can, google_robot_pick_horizontal_coke_can, google_robot_pick_vertical_coke_can, google_robot_pick_standing_coke_can |
| Google Robot object | google_robot_pick_object |
| Google Robot move near | google_robot_move_near, google_robot_move_near_v0, google_robot_move_near_v1 |
| Google Robot drawer | google_robot_open_drawer, google_robot_open_top_drawer, google_robot_open_middle_drawer, google_robot_open_bottom_drawer, google_robot_close_drawer, google_robot_close_top_drawer, google_robot_close_middle_drawer, google_robot_close_bottom_drawer |
| Google Robot place in drawer | google_robot_place_in_closed_drawer, google_robot_place_in_closed_top_drawer, google_robot_place_in_closed_middle_drawer, google_robot_place_in_closed_bottom_drawer, google_robot_place_apple_in_closed_top_drawer |
| WidowX Bridge | widowx_spoon_on_towel, widowx_carrot_on_plate, widowx_stack_cube, widowx_put_eggplant_in_basket |
The paper metric table in simpler_env/utils/metrics.py uses these nine main tasks:
google_robot_pick_coke_can
google_robot_move_near
google_robot_open_drawer
google_robot_close_drawer
google_robot_place_apple_in_closed_top_drawer
widowx_spoon_on_towel
widowx_carrot_on_plate
widowx_stack_cube
widowx_put_eggplant_in_basket
Use stair/scripts/eval_simpler.py for one SimplerEnv task.
python stair/scripts/eval_simpler.py \
--checkpoint runs/stair_stage2_v1/checkpoints/step-XXXXXX.pt \
--policy_setup widowx_bridge \
--task widowx_put_eggplant_in_basket \
--num_episodes 50| Flag | Default | Meaning |
|---|---|---|
--action-model-type |
DiT-B |
Must match the trained STAIR action head |
--num-inference-steps |
10 |
Euler flow-matching integration steps in STAIR_VLA.predict_action |
--unnorm-key |
inferred from --policy-setup |
bridge_orig for WidowX, fractal20220817_data for Google Robot |
--base-vlm |
checkpoint config value | Local OpenVLA/Prismatic run directory; set this to avoid Hugging Face downloads |
--action-scale |
1.0 |
Multiplier applied after action unnormalization |
Results are written under --logging-dir using SimplerEnv's nested naming scheme:
runs/simpler_eval/<checkpoint>/<scene>/<control_mode>/<env>/<robot_pose>/<success_or_failure_video>.mp4
runs/simpler_eval/<checkpoint>/<scene>/<control_mode>/<env>/<robot_pose>/actions/*.png
| File | Purpose |
|---|---|
stair/vla/vla.py |
STAIR_VLA — 2-stage model with UniVAM + oformer |
stair/vla/dynamic_feature.py |
DynamicFeatureEncoder — frozen UniVAM + learnable projection |
stair/vla/qformer.py |
QFormer — oformer maps LLM tokens → per_token at inference |
stair/action_model/action_model.py |
ActionModel — DiT flow-matching decoder |
stair/deploy/stair_policy_sim.py |
STAIRSimPolicy — SimplerEnv policy interface |
stair/scripts/train.py |
Training entry point (stage 1 or 2) |
stair/scripts/eval_simpler.py |
SimplerEnv evaluation entry point |
- StaMo — static compact state representation (predecessor)
- World Guidance (WoG) — architecture this codebase is based on
- UniVAM — dynamic video feature encoder (frozen plug-in)
- SimplerEnv — simulation evaluation benchmark