STAIR: Compact State as Unified Interface for Robot Control

Compact Dynamic State Representation for Fine-Grained Robot Manipulation

STAIR extends static compact state representations (StaMo) with a dynamic compact feature that captures short-range local dynamics in a [-delta, +delta] neighborhood — improving fine-grained manipulation tasks such as peg insertion, furniture assembly, and flower arrangement.

Research Overview

Fine manipulation requires more than knowing "what is there" — it requires knowing how the local state changes near the target region. STAIR proposes:

Static compact state (StaMo): captures what is there.
Dynamic compact feature (STAIR): captures how the local state evolves around the contact region.

The core contribution is a 2-stage VLA training recipe that uses UniVAM as a frozen dynamic feature encoder, replacing WoG's FutureEnc+QFormer pipeline. VLM trainability follows the same freeze configuration mechanism as WoG.

observation + instruction
        ↓
VLA backbone (PrismaticVLM, trainability from config)
        ↓  cognition feature [B, 1, 4096]
        ↓
UniVAM encoder (frozen) + learnable projection
        ↓  per_token [B, 2, 4096]
        ↓
DiT flow-matching action head
        ↓
robot action [Ta, 7]

Stage 1 — Train: VLM modules selected by config + projection head + DiT action head. Freeze: UniVAM encoder.
Stage 2 — Train: VLM modules selected by config + oformer (QFormer) + DiT action head. Freeze: UniVAM encoder + Stage 1 projection target.
At inference the oformer replaces UniVAM — no video clip is needed at deploy time.

Repository Structure

STAIR/
├── stair/                     # Main STAIR package (WoG-derived, adapted for STAIR)
│   ├── action_model/          # DiT flow-matching action decoder
│   ├── conf/                  # draccus dataclass configs (Stage1 + Stage2 registered)
│   ├── deploy/                # STAIRSimPolicy + STAIRRealPolicy (SimplerEnv interface)
│   ├── eval/                  # SimplerEnvAdapter (gripper postprocessing)
│   ├── prismatic/             # Vendored OpenVLA/Prismatic backbone + data pipeline
│   ├── scripts/               # train.py + eval_simpler.py entry points
│   ├── training/              # FSDP training strategies
│   ├── utils/                 # VideoFrameBuffer, image preprocessing
│   └── vla/                   # STAIR_VLA model, QFormer, DynamicFeatureEncoder
├── third_party/
│   ├── UniVAM/                # Vendored dynamic feature encoder
│   └── SimplerEnv/            # Official evaluation benchmark submodule
└── pyproject.toml

Installation

The repository is intended to be installed inside a conda environment, with uv used for faster package resolution and installation.

Clone

UniVAM is vendored under third_party/UniVAM. SimplerEnv is tracked as an official submodule on its main branch:

git clone --recurse-submodules <STAIR-repo-url>
cd STAIR

If the repository was cloned without submodules, initialize SimplerEnv explicitly:

git submodule update --init --recursive third_party/SimplerEnv

third_party/WoG is only an architecture reference and remains outside the tracked repository.

Dependency Layers

Layer	Purpose	Install target
Base	Import STAIR modules and lightweight model code	`-e .`
Dev	Local CPU syntax and lint checks	`-e ".[dev]"`
Train	H200 OpenVLA/OXE training stack	`-e ".[train]"`
UniVAM	Frozen dynamic encoder, prioritized for training compatibility	`-e third_party/UniVAM`
Eval	STAIR policy wrapper + SimplerEnv real-to-sim evaluator	`-e ".[eval]"` plus SimplerEnv installs

UniVAM currently pins torch==2.8.0, so STAIR follows that torch major environment instead of WoG's older torch==2.2.0 pin.

1. Local Development — Mac / CPU Debug

For lightweight local checks and code editing. No CUDA, pretrained weights, OXE data, or SimplerEnv install is needed.

conda create -n stair python=3.10 -y
conda activate stair
python -m pip install -U uv

uv pip install -e ".[dev]"

2. Training — H200 Server

Use this for full STAIR training with OpenVLA/Prismatic, RLDS/OXE, and UniVAM. Install CUDA PyTorch first so UniVAM and STAIR both resolve against the same torch build.

conda create -n stair python=3.10 -y
conda activate stair
python -m pip install -U uv

# Pick the CUDA wheel index that matches the server driver. cu128 is the default target for torch==2.8.0.
uv pip install torch==2.8.0 torchvision --index-url https://download.pytorch.org/whl/cu128

# STAIR package and OpenVLA/OXE training dependencies.
uv pip install -e ".[train]"

# UniVAM dynamic encoder from the vendored third-party directory. Use --no-deps for the same
# reason: STAIR pins the shared runtime stack in pyproject.toml.
uv pip install --no-deps -e third_party/UniVAM

# FlashAttention is required by the Prismatic/OpenVLA backbone. Install after torch is present.
MAX_JOBS=4 uv pip install flash-attn --no-build-isolation

TorchCodec is required by UniVAM video decoding in full workflows. Install the build matching the H200 torch/CUDA stack by following the TorchCodec release instructions for PyTorch 2.8.

The one-command setup script follows the same dependency isolation policy:

./scripts/setup_stair_env.sh

Llama-2 Tokenizer Config

make setup
# Downloads config/tokenizer files from NousResearch/Llama-2-7b-hf
# → stair/prismatic/models/backbones/llm/llama2_config/

Pretrained Weights

# OpenVLA-7b (VLA backbone)
mkdir -p pretrained/openvla-7b-prismatic
cd pretrained/openvla-7b-prismatic
git clone git@hf.co:openvla/openvla-7b-prismatic . && git lfs pull
cd -

# UniVAM weights
python third_party/UniVAM/download_models.py
# UniVAM training checkpoints are directories containing Wan22VM.pth and Projector.pth.

OXE Data

Current STAIR training uses the same RLDS/OXE data path as WoG/OpenVLA. The selected mixture is bridge_rt_1, which expands to:

Mixture item	Dataset directory under `data_root_dir`	Transform
BridgeV2	`bridge_orig`	`bridge_orig_dataset_transform`
Fractal / RT-1	`fractal20220817_data`	`rt1_dataset_transform`

On the remote server, point --data_root_dir at the full OXE root:

ls data/oxe/bridge_orig data/oxe/fractal20220817_data

The current data flow is:

data/oxe
  -> OXE mixture "bridge_rt_1"
  -> RLDS standardization transform
  -> current observation pixel_values for VLM at 224x224
  -> keyframe-centered dynamic_video for dynamic encoder at [5, 3, 256, 256]
  -> action chunk [16, 7]

The STAIR stage configs default to bridge_rt_1. Training commands still pass it explicitly so the active dataset is visible in logs.

Pipeline Smoke Test with Mock Encoder

If the UniVAM checkpoint is not ready yet, use the explicit mock encoder path to test the VLA training pipeline. This keeps image/video dimensions at the target STAIR shape but does not produce meaningful dynamic features.

torchrun --standalone --nnodes 1 --nproc-per-node 8 stair/scripts/train.py \
  --vla.type stair-dinosiglip-224px+oxe+stage1 \
  --stage 1 \
  --dynamic_encoder_type mock \
  --vla.data_mix bridge_rt_1 \
  --video_num_frames 5 \
  --dynamic_clip_delta 2 \
  --dynamic_clip_stride 4 \
  --sample_strategy dense \
  --video_image_size "[224,224]" \
  --data_root_dir data/oxe \
  --run_root_dir runs/ \
  --run_id stair_stage1_mock_pipeline \
  --wandb_project stair

One-command Training Scripts

The scripts/ directory contains launchers for the two-stage training flow. They default to the real UniVAM teacher path, bridge_rt_1, data/oxe, 5 keyframe-centered dynamic video frames, stride 4, the latest third_party/UniVAM/configs/bridgev2.yaml, and 256x256 dynamic video inputs. A real UniVAM run requires UNIVAM_CHECKPOINT to point to the checkpoint directory containing Wan22VM.pth and Projector.pth.

UNIVAM_CHECKPOINT=pretrained/univam/bridgev2 ./scripts/train_stage1.sh
UNIVAM_CHECKPOINT=pretrained/univam/bridgev2 ./scripts/train_stage2.sh

Stage 1 defaults to the local OpenVLA checkpoint downloaded above:

pretrained/openvla-7b-prismatic/checkpoints/step-295000-epoch-40-loss=0.2200.pt

Common overrides:

NPROC_PER_NODE=8 DATA_ROOT_DIR=data/oxe RUN_ID=stair_stage1_v1 ./scripts/train_stage1.sh
NPROC_PER_NODE=1 DATA_ROOT_DIR=data/oxe RUN_ID=stair_stage1_debug ./scripts/train_stage1.sh
BASE_VLM=pretrained/openvla-7b-prismatic PRETRAINED_CHECKPOINT=pretrained/openvla-7b-prismatic/checkpoints/step-295000-epoch-40-loss=0.2200.pt ./scripts/train_stage1.sh
STAGE1_RUN_ID=stair_stage1_v1 RUN_ID=stair_stage2_v1 ./scripts/train_stage2.sh
STAGE1_CHECKPOINT=runs/stair_stage1_v1/checkpoints/step-XXXXXX-epoch-XX-loss=X.XXXX.pt ./scripts/train_stage2.sh
UNIVAM_CONFIG_PATH=third_party/UniVAM/configs/fractal.yaml VIDEO_IMAGE_SIZE="[256,256]" ./scripts/train_stage1.sh

For pipeline smoke tests without a trained UniVAM encoder, opt into the mock encoder explicitly:

DYNAMIC_ENCODER_TYPE=mock VIDEO_IMAGE_SIZE="[224,224]" ./scripts/train_stage1.sh
STAGE2_TRAIN_MODE=direct DYNAMIC_ENCODER_TYPE=mock PRETRAINED_CHECKPOINT=pretrained/openvla-7b-prismatic/checkpoints/step-295000-epoch-40-loss=0.2200.pt ./scripts/train_stage2.sh

3. Evaluation — SimplerEnv

Install after the training environment above. Requires Vulkan for rendering.

Use the STAIR training environment as the base eval environment. This keeps Torch, Transformers, Prismatic/OpenVLA, and UniVAM aligned with the checkpoint that will be loaded. Do not install SimplerEnv's full RT-1/Octo dependency set for STAIR-only evaluation; that path pulls TensorFlow/JAX packages that are unrelated to STAIR and can make dependency resolution fragile.

# Pull official SimplerEnv and its ManiSkill2_real2sim nested submodule.
git submodule update --init --recursive third_party/SimplerEnv

# STAIR eval helper dependencies.
uv pip install -e ".[eval]"

# ManiSkill2 real-to-sim environments. Keep numpy<2.0; 1.26.4 matches SimplerEnv's IK stack.
uv pip install numpy==1.26.4
uv pip install -e third_party/SimplerEnv/ManiSkill2_real2sim/

# SimplerEnv evaluator.
uv pip install -e third_party/SimplerEnv/

# Optional only if evaluating RT-1 or Octo baselines through the same checkout.
# uv pip install tensorflow==2.15.0
# uv pip install -r third_party/SimplerEnv/requirements_full_install.txt

# Validate headless Vulkan before running the evaluator.
DISPLAY="" CUDA_VISIBLE_DEVICES=0 VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/nvidia_icd.json vulkaninfo --summary

On headless servers, the GLFW/X11 warning is expected because SimplerEnv disables DISPLAY. The fatal failures are SAPIEN/SVulkan messages such as Vulkan is incompatible with your driver, Some required Vulkan extension is not present, or a following Segmentation fault.

libvulkan1 only installs the Vulkan loader. SimplerEnv rendering also needs the NVIDIA Vulkan/GLX/EGL user-space driver libraries that match the kernel driver. vulkaninfo --summary must report an NVIDIA device; if it reports Mesa llvmpipe, Vulkan is using CPU software rendering and SimplerEnv will fail.

Check the compute node before launching STAIR:

nvidia-smi
find /usr/share /etc -path '*vulkan*icd*' -name '*nvidia*.json' -print
find /usr/share /etc -path '*glvnd*egl_vendor*' -name '*nvidia*.json' -print
ldconfig -p | grep libGLX_nvidia
find /usr /etc /lib /lib64 -name 'libGLX_nvidia.so*' -print

DISPLAY="" CUDA_VISIBLE_DEVICES=0 VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/nvidia_icd.json vulkaninfo --summary

If dpkg -l does not list NVIDIA packages but the files above exist on the GPU instance, that is acceptable: the GPU runtime may inject driver files from the host instead of installing them through apt in the image. If libGLX_nvidia.so is missing on the GPU instance, install or expose the NVIDIA GL/Vulkan user-space package that matches the driver shown by nvidia-smi (libnvidia-gl-<driver-version> / full nvidia-driver-<driver-version> on Ubuntu-style systems). On clusters or containers, the host may expose CUDA compute libraries but not graphics/Vulkan libraries; the job runtime must include NVIDIA graphics capability and the host Vulkan/GLVND files.

For CPU-image-to-GPU-instance workflows, install only the generic runtime dependencies in the networked CPU image, then let the GPU instance provide the matching NVIDIA driver files:

apt-get update
apt-get install -y \
  vulkan-tools libvulkan1 \
  libxext6 libx11-6 libx11-xcb1 libxcb1 \
  libxrandr2 libxrender1 libxi6 libxfixes3 \
  libglvnd0 libglx0 libegl1

The helper script runs this check by default; set CHECK_VULKAN=0 only after vulkaninfo --summary reports NVIDIA on the same compute node.

Stage 1 — Train projection head + DiT

torchrun --standalone --nnodes 1 --nproc-per-node 8 stair/scripts/train.py \
  --vla.type stair-dinosiglip-224px+oxe+stage1 \
  --vla.base_vlm pretrained/openvla-7b-prismatic \
  --stage 1 \
  --pretrained_checkpoint pretrained/openvla-7b-prismatic/checkpoints/step-295000-epoch-40-loss=0.2200.pt \
  --univam_config_path third_party/UniVAM/configs/bridgev2.yaml \
  --univam_checkpoint pretrained/univam/bridgev2 \
  --vla.data_mix bridge_rt_1 \
  --video_num_frames 5 \
  --dynamic_clip_delta 2 \
  --dynamic_clip_stride 4 \
  --sample_strategy dense \
  --video_image_size "[256,256]" \
  --data_root_dir data/oxe \
  --run_root_dir runs/ \
  --run_id stair_stage1_v1 \
  --wandb_project stair

Stage 2 — Train oformer + DiT

torchrun --standalone --nnodes 1 --nproc-per-node 8 stair/scripts/train.py \
  --vla.type stair-dinosiglip-224px+oxe+stage2 \
  --stage 2 \
  --pretrained_checkpoint runs/stair_stage1_v1/checkpoints/step-XXXXXX.pt \
  --univam_config_path third_party/UniVAM/configs/bridgev2.yaml \
  --univam_checkpoint pretrained/univam/bridgev2 \
  --vla.data_mix bridge_rt_1 \
  --video_num_frames 5 \
  --dynamic_clip_delta 2 \
  --dynamic_clip_stride 4 \
  --sample_strategy dense \
  --video_image_size "[256,256]" \
  --data_root_dir data/oxe \
  --run_root_dir runs/ \
  --run_id stair_stage2_v1 \
  --wandb_project stair

Evaluation (SimplerEnv)

The public checkout keeps third_party/SimplerEnv on the official SimplerEnv main branch. Official SimplerEnv does not register a --policy-model stair branch, so the supported STAIR entry point builds the STAIR policy inside this repository and uses SimplerEnv as the environment dependency:

stair/scripts/eval_simpler.py
  -> stair.deploy.stair_policy_loader.build_stair_simpler_policy(...)
  -> STAIRSimPolicy
  -> simpler_env.make(...)
  -> rollout success summary

The WoG-style full SimplerEnv evaluator integration requires a SimplerEnv entry point that registers STAIR as a policy model. That integration is not part of the official SimplerEnv submodule.

Target tasks — fine-grained, contact-sensitive:

Task	Octo-small baseline
`widowx_put_eggplant_in_basket`	56.9%
`widowx_carrot_on_plate`	9.7%
`widowx_stack_cube`	4.2%
`widowx_spoon_on_towel`	47.2%
`google_robot_place_apple_in_closed_top_drawer`	21.3%

SimplerEnv Tasks

Use the same task names as SimplerEnv:

Group	Task names
Google Robot coke can	`google_robot_pick_coke_can`, `google_robot_pick_horizontal_coke_can`, `google_robot_pick_vertical_coke_can`, `google_robot_pick_standing_coke_can`
Google Robot object	`google_robot_pick_object`
Google Robot move near	`google_robot_move_near`, `google_robot_move_near_v0`, `google_robot_move_near_v1`
Google Robot drawer	`google_robot_open_drawer`, `google_robot_open_top_drawer`, `google_robot_open_middle_drawer`, `google_robot_open_bottom_drawer`, `google_robot_close_drawer`, `google_robot_close_top_drawer`, `google_robot_close_middle_drawer`, `google_robot_close_bottom_drawer`
Google Robot place in drawer	`google_robot_place_in_closed_drawer`, `google_robot_place_in_closed_top_drawer`, `google_robot_place_in_closed_middle_drawer`, `google_robot_place_in_closed_bottom_drawer`, `google_robot_place_apple_in_closed_top_drawer`
WidowX Bridge	`widowx_spoon_on_towel`, `widowx_carrot_on_plate`, `widowx_stack_cube`, `widowx_put_eggplant_in_basket`

The paper metric table in simpler_env/utils/metrics.py uses these nine main tasks:

google_robot_pick_coke_can
google_robot_move_near
google_robot_open_drawer
google_robot_close_drawer
google_robot_place_apple_in_closed_top_drawer
widowx_spoon_on_towel
widowx_carrot_on_plate
widowx_stack_cube
widowx_put_eggplant_in_basket

Single-Task Evaluation

Use stair/scripts/eval_simpler.py for one SimplerEnv task.

python stair/scripts/eval_simpler.py \
  --checkpoint runs/stair_stage2_v1/checkpoints/step-XXXXXX.pt \
  --policy_setup widowx_bridge \
  --task widowx_put_eggplant_in_basket \
  --num_episodes 50

Useful STAIR-Specific Flags

Flag	Default	Meaning
`--action-model-type`	`DiT-B`	Must match the trained STAIR action head
`--num-inference-steps`	`10`	Euler flow-matching integration steps in `STAIR_VLA.predict_action`
`--unnorm-key`	inferred from `--policy-setup`	`bridge_orig` for WidowX, `fractal20220817_data` for Google Robot
`--base-vlm`	checkpoint config value	Local OpenVLA/Prismatic run directory; set this to avoid Hugging Face downloads
`--action-scale`	`1.0`	Multiplier applied after action unnormalization

Results are written under --logging-dir using SimplerEnv's nested naming scheme:

runs/simpler_eval/<checkpoint>/<scene>/<control_mode>/<env>/<robot_pose>/<success_or_failure_video>.mp4
runs/simpler_eval/<checkpoint>/<scene>/<control_mode>/<env>/<robot_pose>/actions/*.png

Key Files

File	Purpose
`stair/vla/vla.py`	`STAIR_VLA` — 2-stage model with UniVAM + oformer
`stair/vla/dynamic_feature.py`	`DynamicFeatureEncoder` — frozen UniVAM + learnable projection
`stair/vla/qformer.py`	`QFormer` — oformer maps LLM tokens → per_token at inference
`stair/action_model/action_model.py`	`ActionModel` — DiT flow-matching decoder
`stair/deploy/stair_policy_sim.py`	`STAIRSimPolicy` — SimplerEnv policy interface
`stair/scripts/train.py`	Training entry point (stage 1 or 2)
`stair/scripts/eval_simpler.py`	SimplerEnv evaluation entry point

Related Works

StaMo — static compact state representation (predecessor)
World Guidance (WoG) — architecture this codebase is based on
UniVAM — dynamic video feature encoder (frozen plug-in)
SimplerEnv — simulation evaluation benchmark

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
scripts		scripts
stair		stair
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
stair_paths.pth		stair_paths.pth

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STAIR: Compact State as Unified Interface for Robot Control

Research Overview

Repository Structure

Installation

Clone

Dependency Layers

1. Local Development — Mac / CPU Debug

2. Training — H200 Server

Llama-2 Tokenizer Config

Pretrained Weights

OXE Data

Pipeline Smoke Test with Mock Encoder

One-command Training Scripts

3. Evaluation — SimplerEnv

Stage 1 — Train projection head + DiT

Stage 2 — Train oformer + DiT

Evaluation (SimplerEnv)

SimplerEnv Tasks

Single-Task Evaluation

Useful STAIR-Specific Flags

Key Files

Related Works

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

STAIR: Compact State as Unified Interface for Robot Control

Research Overview

Repository Structure

Installation

Clone

Dependency Layers

1. Local Development — Mac / CPU Debug

2. Training — H200 Server

Llama-2 Tokenizer Config

Pretrained Weights

OXE Data

Pipeline Smoke Test with Mock Encoder

One-command Training Scripts

3. Evaluation — SimplerEnv

Stage 1 — Train projection head + DiT

Stage 2 — Train oformer + DiT

Evaluation (SimplerEnv)

SimplerEnv Tasks

Single-Task Evaluation

Useful STAIR-Specific Flags

Key Files

Related Works

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages