LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion

We introduce LDA-1B, a robot foundation model that scales through universal embodied data ingestion by jointly learning dynamics, policy, and visual forecasting, assigning distinct roles to data of varying quality.

✨ Key Features

Feature	Description
Unified Multi-Task Learning	Single MMDiT backbone jointly predicts future visual features (`DINOv3` tokens) and 16-step action chunks
Data Quality Hierarchy	High-quality teleop → policy learning; Low-quality scripted → dynamics learning; No-annotation videos → visual forecasting
Latent Dynamics Modeling	Predicts future latent visual features instead of pixels → better generalization
Cross-Embodiment	Pre-trained on multi embodiments (Agibot, Unitree-G1, Human, etc.)

Latest Updates

[2026-02-12] We publish LDA-1B, check our paper here.

🛠 Environment Setup

Step 1: Clone the Repository

git clone https://github.com/jiangranlv/latent-dynamics-action.git LDA
cd LDA

Step 2: Set Up Python Environment

Create and activate a conda environment with the required dependencies, for example:

# Create a conda environment
conda create -n LDA python=3.10
conda activate LDA

# Install requirements
pip install -r requirements.txt

# Install FlashAttention2 with a version compatible with your PyTorch and CUDA versions
pip install flash-attn --no-build-isolation

# Install LDA
pip install -e .

Step 3: Download Pretrained Model Weights

Follow the instruction in Qwen3-VL and DINOv3 to download the pretrained VLM and vision encoder.

or you could directly download the pretrained from the following link:

Qwen3-VL-4B: link🤗
DINO-ViT-S: link🤗

🧩 Model Architecture

LDA jointly denoises action chunks and future visual latent under multiple co-training objectives. Conditioned on VLM tokens, diffusion timesteps, and task embeddings, the model adopts a multimodal diffusion transformer architecture.

Core components:

Language and Vision Encoder: Qwen3-VL (4B) → extracts semantics information
Latent Visual Representation: DINOv3-ViT-S → extracts spatial features (frozen during training)
MM-DiT Backbone: A 16-layer multi-modal diffusion transformer (hidden_dim=1536, num_heads=32).

Below is a description of the MM-DiT forward pass.

Stage	Operation	Details
1. Input Tokenization	• Image tokens: DINOv3 patch embeddings (`[B, N_img, D]`) • Action tokens: Linear projection of action chunks (`[B, N_act, D]`) • VLM tokens: Qwen3-VL instruction embeddings (`[B, N_vlm, D]`)	All tokens share hidden dimension `D=1536`
2. Self-Attention (Image + Action)	• Image and action tokens compute separate Q/K/V projections • Tokens are concatenated • Shared self-attention over the combined sequence	Enables joint reasoning between visual observations and actions
3. Cross-Attention (VLM → Image/Action)	• VLM tokens serve as queries • Image/action tokens serve as keys&values • Two parallel cross-attention streams: – VLM → Image (for spatial grounding) – VLM → Action (for task conditioning)	The semantic information extracted by the VLM is incorporated into the generation process of action tokens and latent image tokens.
4. AdaLN-Zero Conditioning	Per-layer modulation of attention + MLP outputs via: • Diffusion timestep `t` • Task embedding (4-way categorical: Policy / Forward Dynamics / Inverse Dynamics / Visual Forecasting)	Dynamically adjusts model's behavior based on diffusion schedule and task objective
5. Output Heads	• Latent dynamics head: Predicts future DINOv3 tokens • Action head: Predicts denoised 16-step action chunks	All four tasks are trained jointly within a single unified framework.

💡 Training & Evaluation

🔥 Train LDA on RoboCasa-GR1 tabletop dataset

We provide training and evaluation scripts for the RoboCasa-GR1 dataset. Follow the steps described in Robocasa_tabletop to reproduce our results.

We also provide a demo dataset for quick debugging and validation.

You can launch training by running this script.

Make sure to update the following arguments in the script before execution:

base_vlm: local path to the Qwen3 checkpoint
vision_encoder_path: local path to the DINOv3 checkpoint
data_root_dir: dataset root directory
data_mix: target dataset name, defined in data_config.py
run_root_dir: directory for saving checkpoints
run_id: name used for the current training run

🧪 Evaluate

In addition to closed-loop evaluation in simulation (interactive execution with environment feedback), we also provide an open-loop evaluation interface for offline assessment. Open-loop evaluation quantitatively measures model performance by comparing predicted action sequences against ground-truth demonstrations from the dataset, without environment interaction.

bash LDA/scripts/eval_scripts/eval_lerobot_datasets_LDA.sh

TODO

The following features are planned for future implementation:

Pre-trained model checkpoints.
Pre-training data.
Data preprocess scripts.

🙏 Acknowledgements

Our code is built upon starVLA and mmdit. These code serve as an essential foundation for our implementation, and we deeply appreciate the time, effort, and expertise they shared with the community.

✍️ Citation

If you find our work useful, please cite us:

@article{lyu2026lda,
  title={LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion},
  author={Lyu, Jiangran and Liu, Kai and Zhang, Xuheng and Liao, Haoran and Feng, Yusen and Zhu, Wenxuan and Shen, Tingrui and Chen, Jiayi and Zhang, Jiazhao and Dong, Yifei and others},
  journal={arXiv preprint arXiv:2602.12215},
  year={2026}
}

License

This work and the dataset are licensed under CC BY-NC 4.0.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
assets		assets
deployment		deployment
eval		eval
examples		examples
lda		lda
playground/demo_data/sim_pick_place		playground/demo_data/sim_pick_place
scripts		scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion

📋 Table of Contents

✨ Key Features

Latest Updates

🛠 Environment Setup

Step 1: Clone the Repository

Step 2: Set Up Python Environment

Step 3: Download Pretrained Model Weights

🧩 Model Architecture

💡 Training & Evaluation

🔥 Train LDA on RoboCasa-GR1 tabletop dataset

🧪 Evaluate

TODO

🙏 Acknowledgements

✍️ Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion

📋 Table of Contents

✨ Key Features

Latest Updates

🛠 Environment Setup

Step 1: Clone the Repository

Step 2: Set Up Python Environment

Step 3: Download Pretrained Model Weights

🧩 Model Architecture

💡 Training & Evaluation

🔥 Train LDA on RoboCasa-GR1 tabletop dataset

🧪 Evaluate

TODO

🙏 Acknowledgements

✍️ Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages