A Transformer-free Vision-Language-Action model for real-time edge robotics.
FluidVLA is a research program that replaces quadratic self-attention with a local reaction-diffusion dynamic for vision, video, control, and ultimately embedded Vision-Language-Action systems.
The goal is not to propose a slightly lighter Transformer. The goal is to change the computational primitive itself to achieve better compatibility with continuous video, bounded memory, adaptive compute, and edge robotics.
Validation on a synthetic 7-DOF pick-and-place:
- Best Val MSE: 0.01345
- Latency: ~4.1 ms per step
- Effective FPS: ~244 Hz
- Adaptive compute: 1 / 12 steps post-training in the reported regime
| Resolution | FluidVLA | ViT estimated |
|---|---|---|
| 128x128 | 59.2 MB | ~900 MB |
| 256x256 | 202.9 MB | ~14,000 MB |
| Frames | FluidVLA | Transformer estimated |
|---|---|---|
| 8 | 36.9 MB | ~272 MB |
| 16 | 62.9 MB | ~1,088 MB |
| 32 | 114.6 MB | ~4,352 MB |
| Dataset | Accuracy | Params |
|---|---|---|
| MNIST | 91.75% | ~0.30M |
| CIFAR-10 | 62.25% | ~0.31M |
| Config | Val Dice | Latency | VRAM |
|---|---|---|---|
| PDE OFF | 0.9129 | ~60 ms | ~632 MiB |
| PDE ON | 0.9177 | ~89 ms | ~632 MiB |
Measured state as of March 10, 2026 on the real dataset local/so101_balle_bol_test:
- Dataset: 44 episodes / 18,081 samples / 2 cameras / 15 fps
- Bridge validated: LeRobot -> FluidVLA
.npzconversion - Smoke training: best.pt generated with
action_dim=6,proprio_dim=6 - Full run front-only V1: Val MSE 16.69330 at epoch 10 -- converged but produces a fixed point at live deployment
- Live diagnostic V1: frozen policy, constant raw deltas regardless of image (confirmed with
filter_alpha=1.0) - V2 pipeline implemented: spatial pool 4x4, normalized delta-actions, cosine loss, optional action chunking
- V2 re-conversion in progress:
--delta-actions --filter-static 0.5 --subsample-static 4 - Measured delta statistics:
mean=[0.17, -0.77, -1.98, -0.25, -0.08, -3.55]/std=[3.67, 5.07, 4.55, 2.39, 1.68, 8.59] - Static frames (|delta| < 0.5 deg): 0/18,081 (0.0%) -- dataset is fully usable as-is
- V2 training: pending
- V2 live test: pending
Current VLAs almost all inherit from Transformers originally designed for NLP. In embedded robotics, this decision quickly hits the KV-cache memory and latency wall.
As spatial resolution and temporal length increase, computational pressure explodes and forces destructive trade-offs: shortened context, video degraded to a few frames, discontinuous perception, inability to properly reason about continuous motion.
The central intuition of the project is simple: a robot that only sees the world through sparse snapshots cannot properly model physics, velocity, or continuous manipulation.
FluidVLA replaces the attention matrix with a reaction-diffusion core with lightweight memory.
Instead of making all tokens communicate via a dense matrix, information propagates locally in space and time, with a small global memory state and an iterative computation that can stop early when the scene stabilizes.
| Term | Role |
|---|---|
| multi-scale local diffusion | |
| nonlinear per-position reaction | |
| lightweight global memory | |
| low-resolution local memory | |
| learned integration step |
Adaptive compute / Turing Equilibrium.
The model can reduce the number of PDE steps at inference when the scene is simple or stable. In practice, some phases converge in very few iterations while more complex scenes retain more computation steps.
git clone https://github.com/infinition/FluidVLA.git
cd FluidVLA
pip install -e .python -m pytest tests/ -vpython experiments/step0_mnist/train_step0.py --dataset mnist --model small
python experiments/step0_mnist/train_step0.py --dataset cifar10 --model small --batch_size 512
python experiments/step1_video/train_step1_video.py --epochs 30 --d_model 128
python experiments/step2_sim/isaac_env.py --mode synthetic --episodes 1000 --image_size 64
python experiments/step2_sim/train_step2.py --dataset ./data/step2_sim --epochs 50 --batch_size 32 --d_model 128python fluidvla_server.py --port 7860FluidVLA/
├── fluidvla/ Core package
│ └── core/
│ ├── diffusion.py Multi-scale Laplacian operators
│ ├── fluid_layer.py Reaction-diffusion PDE layer
│ ├── fluid_layer3d.py 3D volumetric variant
│ ├── vision_models.py Image classifier
│ ├── video_models.py Video encoder
│ ├── vla_models.py VLA action heads
│ ├── fluid_medical_model.py 3D medical segmentation
│ └── ...
├── experiments/
│ ├── step0_mnist/ Image classification
│ ├── step1_video/ Video + adaptive compute
│ ├── step1b_medical_msd/ Medical 3D segmentation
│ ├── step2_sim/ Isaac Sim pick & place
│ ├── step2a_synthetic/ Synthetic imitation learning
│ ├── step2d_so101_urdf/ URDF viewer
│ ├── step3_lerobot/ Real robot (LeRobot SO-101)
│ └── _archive/
├── tests/ Smoke tests
├── fluidvla_server.py Web platform server
├── fluidvla_platform/ Web UI and dataset explorer
├── setup.py Package installation
├── requirements.txt Dependencies
└── LICENSE MIT
Design principles:
- a single main root README,
- a stable facade via fluidvla.core for public imports,
- a local README per step for commands and usage,
- a
data/subfolder per experiment, - legacy variants go to archive.
- experiments/README.md
- experiments/step0_mnist/README.md
- experiments/step1_video/README.md
- experiments/step1b_medical_msd/README.md
- experiments/step2_sim/README.md
- experiments/step2a_synthetic/README.md
- experiments/step2d_so101_urdf/README.md
- experiments/step3_lerobot/README.md
Validates that reaction-diffusion learns visual features without attention.
| Dataset | Accuracy | Params | Status |
|---|---|---|---|
| MNIST | 91.75% | ~0.30M | measured |
| CIFAR-10 | 62.25% | ~0.31M | measured |
Correct reading: mechanistic validation, not a SOTA benchmark.
| Resolution | Pixels | FluidVLA VRAM | ViT estimated |
|---|---|---|---|
| 32x32 | 1,024 | 13.3 MB | ~15 MB |
| 64x64 | 4,096 | 22.8 MB | ~60 MB |
| 128x128 | 16,384 | 59.2 MB | ~900 MB |
| 256x256 | 65,536 | 202.9 MB | ~14,000 MB |
On this point, the signal is genuinely impressive. Even with the necessary caution that the ViT column remains an estimate, the gap observed on the FluidVLA side is massive enough to justify the architectural interest.
Step 1 is the most important architectural validation after classification.
Changes historically integrated in this step:
- correction of a legacy motion loss that provided no real signal,
- addition of a proper spatial gradient loss,
- cleaned VRAM benchmark,
- separation between stop turbulence and differentiable turbulence,
- instrumentation:
steps_used,final_turbulence,min_turbulence.
| Mode | Settings | Steps | Test MSE | Status |
|---|---|---|---|---|
| Quality | epsilon=0.08, min_steps=3, stop_patience=2 |
12.00 | 0.17804 | measured |
| Adaptive | epsilon=0.09, min_steps=3, stop_patience=2 |
7.50 | 0.19695 | measured |
| Aggressive | epsilon~0.11-0.12 |
3.00 | 0.20317 | measured |
This result shows that the compute dial is not a presentation gimmick but a real quality/compute trade-off control at inference.
This branch has two historical layers:
- an initial BrainTumour 3D prototype, now archived,
- a generalized pipeline over the 10 MSD tasks, now canonical.
Consolidated setup:
- dataset: MSD
Task01_BrainTumour, - 4 MRI modalities, volume shape
(240, 240, 155, 4), - main protocol:
16 train / 4 val / 5 epochs, - loss:
Cross-Entropy + Soft Dice, - crop:
128^3, - model:
FluidBotMedical3Dwithd_model=32,n_layers=2,max_steps=6, i.e.,16,632parameters.
The historical medical step mainly produced an important methodological result: the discovery and correction of a real diffusion calibration bug.
| Config | Val Dice | Reading |
|---|---|---|
| PDE OFF | 0.8490 | control baseline |
| PDE ON before fix | 0.3846 | invalid comparison, diffusion broken |
PDE ON fixed, scale=0.05 |
0.8756 | becomes competitive again |
PDE ON fixed, scale=0.08 |
0.8867 | best point of the ablation |
Main controlled result:
| Config | Val Dice | Latency | VRAM | Delta Dice |
|---|---|---|---|---|
| PDE OFF | 0.9129 | ~60 ms | ~632 MiB | - |
PDE ON, scale=0.08 |
0.9177 | ~89 ms | ~632 MiB | +0.0048 |
The gain is not enormous, but it is real, measurable, and achieved without memory explosion.
Reported baseline comparison against U-Net 3D:
| Model | Val Dice | Params | Latency | VRAM |
|---|---|---|---|---|
| FluidVLA PDE ON | 0.9177 | 16,632 | ~44 ms GPU | ~632 MiB |
| UNet3D_Tiny | 0.8494 | 88,278 | 272 ms CPU | ~326 MiB |
| UNet3D_Std | 0.8233 | 5,603,746 | 2704 ms CPU | ~1891 MiB |
This comparison should be read as a signal of potential, not as a universally locked benchmark victory, since latencies are not measured on the same hardware.
The current canonical pipeline covers:
- all 10 Medical Segmentation Decathlon tasks,
- single-modal and multi-modal inputs,
- FluidVLA training,
- U-Net 3D Tiny and Std baselines,
- unified inference,
- slice, multislice, 3D PNG and HTML renders,
- comparable splits via the same seed.
Supported tasks:
- Task01_BrainTumour
- Task02_Heart
- Task03_Liver
- Task04_Hippocampus
- Task05_Prostate
- Task06_Lung
- Task07_Pancreas
- Task08_HepaticVessel
- Task09_Spleen
- Task10_Colon
| Metric | Value | Status |
|---|---|---|
| Best Val MSE | 0.01345 | measured |
| Latency | ~4.1 ms | measured |
| Effective FPS | ~244 Hz | measured |
| Adaptive compute | 1/12 steps post-training | measured |
A vision + proprio + action stack that converges, holds around 4.1 ms, and loops at ~244 Hz on consumer GPU is already a very strong result for this phase.
| Metric | Value | Status |
|---|---|---|
| Black frames | 0 / 139 | measured |
| Red cube detection | 100% | measured |
| Green target detection | 100% | measured |
| Capture latency | ~1.6 ms/step | measured |
| Oracle success | 100% | measured |
Status: in progress.
Status: visualization and demonstration tool present.
Status: active, with first offline validation on real LeRobot data.
Objectives:
- real-time latency on embedded hardware,
- bounded memory footprint,
- stable perception-action loop on real hardware.
Projection present in the repository: Jetson latency around ~40 ms, to be considered estimated until a proper full benchmark is finalized.
Goal: transform a real local LeRobot dataset into episodes compatible with the FluidVLA data contract.
Results measured on our local SO-101 dataset local/so101_balle_bol_test / so101_balle_bol_dashboard_01:
- 44 episodes,
- 18,081 frames,
- 2 recorded cameras:
observation.images.frontandobservation.images.wrist, fps=15,action_dim=6,proprio_dim=6.
Bridge implementation: experiments/step3_lerobot/convert_lerobot_dataset.py
Validated smoke conversion:
| Element | Value | Status |
|---|---|---|
| Source dataset | so101_balle_bol_dashboard_01 |
measured |
| Camera used for smoke test | observation.images.front |
measured |
| Converted episodes | 2 | measured |
| Output format | frames=(steps,3,4,96,96) |
measured |
| Output proprio | (steps,6) |
measured |
| Output actions | (steps,6) |
measured |
Correct reading: real LeRobot data can already be injected into the FluidVLA pipeline without depending on the historical Step 3 prototype.
Goal: verify that FluidVLA learns offline from real SO-101 demonstrations collected with LeRobot.
Dedicated script: experiments/step3_lerobot/train_lerobot_so101.py
Smoke training measured on the converted subset so101_front_smoketest:
| Metric | Value | Status |
|---|---|---|
| Samples | 843 | measured |
| Parameters | ~0.16M | measured |
| Train MSE at epoch 1 | 2125.58554 | measured |
| Val MSE at epoch 1 | 1494.26098 | measured |
| Generated checkpoint | best.pt |
measured |
Correct reading: this is not yet a final performance result, but an integration and offline learning validation on our real LeRobot data.
Full single-camera front run currently in progress on so101_front_full:
| Epoch | Train MSE | Val MSE | Val L1 | Eval latency | Status |
|---|---|---|---|---|---|
| 1 | 183.42098 | 22.78010 | 2.68786 | 6.40 ms | measured |
| 2 | 20.78908 | 20.58880 | 2.56215 | 6.29 ms | measured |
| 3 | 18.82856 | 19.26452 | 2.43343 | 6.77 ms | measured |
| 4 | 18.01675 | 18.03719 | 2.27493 | 6.55 ms | best provisional point |
| 5 | 17.62973 | 18.93611 | 2.65407 | 6.81 ms | slight val regression |
| 6 | 17.09176 | 18.58472 | 2.41866 | 6.93 ms | partial stabilization |
| 7 | 16.80891 | 17.82738 | 2.25727 | 6.77 ms | new best point |
| 8 | 16.65505 | 17.39248 | 2.21785 | 7.35 ms | new best point |
| 9 | 16.21888 | 17.01312 | 2.19877 | 7.64 ms | new best point |
| 10 | 16.29993 | 16.69330 | 2.19973 | 7.43 ms | new best point |
| 11 | 15.78486 | 17.39699 | 2.26212 | 6.67 ms | val regression, best remains epoch 10 |
| 12 | 15.63500 | 17.35693 | 2.38019 | 6.74 ms | high plateau, best remains epoch 10 |
Active configuration for this run:
- dataset:
data/step3_lerobot/so101_front_full, - checkpoints:
checkpoints/step3_lerobot/so101_front_full, - model:
~0.52Mparameters, - input:
(3, 4, 224, 224), action_dim=6,proprio_dim=6,max_steps=12,epsilon=0.02,batch_size=16,epochs=40.
Correct reading: the important proof is already acquired. FluidVLA does not merely load this real data, it learns on it with a real validation decrease. The next milestone is now consolidation of the best checkpoint followed by a proper benchmark.
Status: partially started, not yet properly validated, and not yet tested in a live loop on the robot.
Current state:
- benchmark script available,
- benchmark run on smoke test checkpoint,
- benchmark run on the current best real checkpoint,
- smoke latency numbers available but non-canonical due to GPU contention or under-trained model,
- adaptive compute not yet observed on this smoke test.
Smoke benchmark currently measured on so101_front_smoketest:
| Metric | Value | Status |
|---|---|---|
| Mean latency | 77.84 ms | measured but non-canonical |
| P95 latency | 93.35 ms | measured but non-canonical |
| FPS | 12.85 | measured but non-canonical |
| Avg steps | 12.0 | measured |
| Dynamic steps | 12.0 | measured |
Dedicated benchmark on the current best real checkpoint so101_front_full/best.pt:
| Metric | Value | Status |
|---|---|---|
| Mean latency | 112.80 ms | measured, strict GPU benchmark |
| P50 latency | 109.48 ms | measured |
| P95 latency | 133.55 ms | measured |
| P99 latency | 153.56 ms | measured |
| FPS | 8.87 | measured |
| Avg steps | 12.0 | measured |
| Static steps | 12.0 | measured |
| Dynamic steps | 12.0 | measured |
Runtime epsilon sweep on the same checkpoint, without retraining:
| Epsilon | Mean latency | FPS | Avg steps | Static steps | Dynamic steps | Reading |
|---|---|---|---|---|---|---|
| 0.20 | 99.31 ms | 10.07 | 12.0 | 12.0 | 12.0 | no triggering |
| 0.30 | 40.33 ms | 24.80 | 4.33 | 9.33 | 4.33 | first clear triggering |
| 0.40 | 44.09 ms | 22.68 | 4.33 | 7.0 | 4.33 | adaptive active |
| 0.50 | 32.98 ms | 30.32 | 3.67 | 4.0 | 4.33 | very promising zone |
| 0.60 | 34.96 ms | 28.61 | 3.67 | 3.33 | 3.67 | adaptive active |
| 0.80 | 34.06 ms | 29.36 | 3.67 | 3.0 | 3.67 | adaptive active |
| 1.00 | 33.87 ms | 29.53 | 3.67 | 3.0 | 3.67 | adaptive active |
Quality control on a clean validation split with the same checkpoint:
| Epsilon | Val MSE | Val L1 | Val steps | Batch latency val | Reading |
|---|---|---|---|---|---|
| 0.02 | 16.69330 | 2.19973 | 12.0 | 7.14 ms | quality reference |
| 0.30 | 16.69442 | 2.20014 | 9.33 | 6.62 ms | nearly identical quality |
| 0.50 | 17.05135 | 2.27522 | 4.0 | 3.14 ms | slight quality degradation, large compute gain |
Correct reading for Step 3 at this stage:
- offline real learning on SO-101 is validated,
- the structural memory advantage of FluidVLA is supported by the benchmarks already present in the Spatial Memory Scaling and Video Scaling sections,
- but we have not yet finalized a canonical memory measurement specific to this Step 3 run nor a live test on the robot,
- and the real-time / adaptive compute demonstration on this real checkpoint is starting to become credible via runtime benchmark through an epsilon sweep. The most interesting point observed so far is
epsilon=0.30, which maintains nearly identical validation quality while finally triggering early stopping.epsilon=0.50goes further on compute but begins to slightly degrade quality.
Current adaptive compute diagnostic on real data:
- during training, early stop is disabled by design in the PDE core,
- in evaluation, the full real run still stays at
12.0/12, - the current Step 3 setting
epsilon=0.02is stricter than the video presets that show adaptation, - therefore one must clearly distinguish "real pipeline that learns" from "real adaptive compute already validated": the former is acquired, the latter remains to be calibrated,
- one must also clearly distinguish "architectural memory advantage already measured in the project" from "canonical memory profile of the real Step 3 runtime", the latter still needing proper measurement.
What will count as canonical Step 3 results:
- full training on the complete real dataset,
- clean GPU benchmark without concurrent load,
- dedicated benchmark of the best real checkpoint with explored adaptive settings,
- live inference connected to the current LeRobot stack,
- then, ideally, a dual-camera
front + wristvariant.
The first live test of the V1 checkpoint (so101_front_full/best.pt, Val MSE = 16.69330) on the real SO-101 robot revealed a frozen policy: joints barely vary regardless of the image, confirmed by disabling all smoothing (filter_alpha=1.0).
Root cause diagnosis -- 6 causes identified, ranked by impact:
AdaptiveAvgPool3d(1)invla_models.pycrushes all spatial information from the PDE backbone before the decision. The model no longer knows where objects are in the image.- Absolute action supervision + MSE on a dataset where action ~ proprio (gap ~1-3 deg) directly rewards immobility.
- No statistical normalization of actions or proprios -- raw scales (shoulder_lift: +/-90 deg vs gripper: 0-100) imbalance the MSE.
- Single-step prediction without temporal horizon -- no chunking, no trajectory, guaranteed convergence toward a static attractor under ambiguity.
- Front-only insufficient for fine grasping phase (no wrist view).
- Action MLP too small (134->256->128->6) to exploit the backbone signal.
Confirmation from logs: raw_delta values are constant at +/-0.001 deg during 480+ consecutive steps, confirming the model learned a fixed point and not a policy.
What the logs prove against other hypotheses:
- It is not the EMA filter (constant even with
alpha=1.0), - It is not
max_delta(deltas are well below the limit), - It is not the camera (open and feeding the loop),
- It is not the adaptive compute (PDE steps vary normally).
V2 corrections implemented:
| # | Problem | File | Correction |
|---|---|---|---|
| 1 | Global pool crushes spatial info | vla_models.py |
AdaptiveAvgPool3d((1, 4, 4)) -- retains 16 spatial positions |
| 2 | Absolute action biased toward immobility | convert_lerobot_dataset.py |
--delta-actions mode: target = normalized displacement |
| 3 | No state/action normalization | convert_lerobot_dataset.py |
Normalizes proprios and actions by mean/std, saves norm_stats.json |
| 4 | No temporal chunking | vla_models.py, train_lerobot_so101.py |
Optional chunk_size (1 by default, 4-8 recommended) |
| 5 | Action MLP undersized | vla_models.py |
New SpatialActionHead: 2054->512->256->6*chunk |
| 6 | No directional signal in the loss | train_lerobot_so101.py |
Cosine direction loss added |
Architectural impact: the PDE core is strictly unchanged. No modifications in fluid_layer.py, video_models.py, diffusion.py. Corrections target exclusively the pooling layer, the action head, the supervision formulation, and preprocessing.
V2 modified files:
fluidvla/core/vla_models.py--SpatialActionHead+FluidBotVLAwithspatial_pool_sizeandchunk_sizefluidvla/core/fluid_model.py-- addedSpatialActionHeadexportfluidvla/core/__init__.py-- addedSpatialActionHeadexportexperiments/step3_lerobot/convert_lerobot_dataset.py-- normalization, delta-actions, static filteringexperiments/step3_lerobot/train_lerobot_so101.py-- cosine loss, chunking, V2 config in checkpointexperiments/step3_lerobot/lerobot_inference.py--ActionDenormalizer, normalized proprio, chunk execution
Normalization statistics measured on the dataset:
proprio mean : [ -1.29 -71.31 67.14 74.21 10.03 13.57]
proprio std : [14.03 42.11 35.66 9.47 12.44 15.86]
delta mean : [ 0.17 -0.77 -1.98 -0.25 -0.08 -3.55]
delta std : [3.67 5.07 4.55 2.39 1.68 8.59]
Static frames (max|delta| < 0.5°): 0/18,081 (0.0%)
Current status:
- V2 dataset re-conversion: in progress,
- V2 training (
spatial_pool_size=4,chunk_size=1,--cosine_loss): pending, - V2 live test: pending.
For the continuation of Step 3, improvement tracks are organized by cost and risk level.
Level 0 - Critical corrections (V2 -- in progress)
frozen policy diagnostic-- done, 6 root causes identified,V2 pipeline implementation (spatial pool, delta-actions, normalization, cosine loss)-- done,- V2 dataset re-conversion -- in progress,
- V2 training and validation that the model learns non-zero deltas -- to do,
- V2 live test on the robot -- to do.
Level 1 - Quick Wins runtime
- epsilon runtime tuning, now validated as the main lever to activate adaptive compute without retraining;
epsilon=0.30is the current best candidate, - FP16 inference to reduce GPU cost and runtime memory footprint,
- systematic replacement of
torch.no_grad()withtorch.inference_mode()in dedicated inference paths, - stabilization of a latency-oriented Step 3 runtime preset with clear documentation of the quality/compute trade-off.
Level 2 - Software turbo
torch.compile()trial on the Step 3 inference loop,- TensorRT export exploration for stabilized Step 3 checkpoints,
- comparative benchmark: native PyTorch vs compiled PyTorch vs TensorRT,
- canonical measurement of latency, VRAM, and jitter for each backend.
Level 3 - Robotic and perception improvements
action chunking-- implemented in V2, to validate withchunk_size=4afterchunk_size=1validation,- dual-camera
front + wristfusion to reduce visual ambiguity, especially on depth and grasping phases, - comparison of
front-only,wrist-only, thenfront + wrist, - validation of the dual-camera effect on action quality and, potentially, on natural reduction of adaptive steps,
- cleaner camera calibration and geometric consistency between views before proper synchronous fusion,
- possible addition of lightweight action smoothing or low-pass filter to limit jitter during live deployment,
- test of a hybrid policy + robot safeguards mode for initial real trials.
Level 4 - System architecture
- stricter separation of camera, inference, and motor control loops,
- software jitter reduction for extended live tests,
- study of a lower-level inference loop if Python becomes the bottleneck,
- option of retraining with lower
max_stepsif latency must be constrained by design rather than runtime only.
Level 5 - Data, robustness, and learning efficiency
- dataset curation to remove or tag the noisiest, most hesitant, or heavily occluded episodes,
- comparison of full dataset vs cleaner subset to measure the effect on model stability and confidence,
- analytical segmentation of demos into
approach,grasp,transport,placephases to understand where the model hesitates most, - visual robustness tests beyond simple Val MSE: occlusions, blur, small lighting variations, camera displacement, modest placement changes,
- comparison of action quality under perturbations between
front,wrist, and futurefront + wrist.
Level 6 - Policy outputs and confidence control
action normalization-- done in V2 (mean/std normalization per joint, delta-actions),- addition of a confidence or uncertainty score to slow down or stabilize commands in ambiguous cases,
- possibility of retaining the previous action or limiting action delta when confidence drops,
- explicit instrumentation of zones where the policy becomes hesitant during live deployment.
Level 7 - Compression and edge variants
- creation of a dedicated
edgepreset with smaller image, fewer frames, or reducedd_model, - distillation of a strong Step 3 checkpoint toward a lighter and faster variant,
- comparison between purely runtime optimization and cost reduction through model design,
- targeted retraining with lower
max_stepsif runtime optimizations alone are insufficient.
Level 8 - Profiling and industrialization
- layer-by-layer profiling to identify the real bottleneck between image encoder, PDE core, CPU/GPU copies, and camera capture,
- canonical benchmark of jitter and temporal stability, not just mean latency,
- backend comparison: native PyTorch vs
torch.compile()vs TensorRT on the same checkpoints, - exploration of a lower-level rewrite only if stabilized Python results show that software runtime becomes the real blocker.
Correct reading: levels 1 and 2 remain the next most cost-effective levers. Levels 3 and 4 become priorities once a first stable live mode exists. Levels 5 to 8 serve to transform a working prototype into a more robust, explainable, and credible edge robotics stack.
Recommended Step 3 LeRobot battle plan
To avoid scattering efforts, the recommended order is as follows:
frozen policy diagnostic-- done,V2 pipeline implementation-- done,- V2 dataset re-conversion -- in progress,
- V2 training with
spatial_pool_size=4,chunk_size=1,--cosine_loss, - verify that V2 model learns non-zero deltas (Val MSE on normalized deltas < 1.0),
- V2 live test on the robot,
- if V2 works -> move to
chunk_size=4thenchunk_size=8, epsilon=0.30as default runtime preset,- FP16 inference +
torch.inference_mode(), wrist-onlybaseline, thenfront + wristcomparison,torch.compile()+ TensorRT,- heavy system optimization if necessary.
Correct reading: points 1 and 2 are done, point 3 is in progress. Points 4 to 6 are the current blocker. Points 7 to 12 become relevant once the robot moves.
To avoid any loss of rigor, the repository distinguishes three categories.
- Step 0 classification results,
- image and video benchmarks present in the project,
- Step 1 adaptive compute calibration,
- Step 2a synthetic imitation learning,
- Step 2b Isaac Sim camera validation,
- historical medical 3D validation on BrainTumour,
- active MSD medical pipeline,
- real LeRobot -> FluidVLA
.npzbridge validated on local SO-101 dataset, - Step 3 smoke training offline on real LeRobot data with
action_dim=6andproprio_dim=6, - complete conversion of the real SO-101
frontdataset over 44 episodes, - full Step 3 single-camera
frontrun with observed convergence and bestVal MSEmeasured at16.69330, - first V1 live test on real SO-101 robot: frozen policy confirmed (raw deltas constant at +/-0.001 deg),
- V2 diagnostic: 6 root causes identified and documented,
- V2 pipeline implemented: spatial pool 4x4, normalized delta-actions, cosine loss, action chunking,
- dataset normalization statistics measured:
delta_std=[3.67, 5.07, 4.55, 2.39, 1.68, 8.59], - static frames in the dataset: 0/18,081 (0.0%) with 0.5 deg threshold.
- some Transformer comparisons used as scaling intuition,
- Jetson projection,
- extrapolations beyond exact benchmark.
- persistent BeliefField memory,
- imaginary rollout,
- local text module,
- symbols and scratchpad,
- native dual-camera fusion,
- V2 training validation and V2 live test on the robot,
- full real hardware benchmark,
- canonical adaptive compute validation on real robot data.
- Step 0: attention-free classification
- Step 1: video + scaling + adaptive compute
- Step 1b: 3D medical segmentation and volumetric validation
- Step 2a: synthetic imitation learning
- Step 2b: Isaac camera validation
- Step 2c: more physical collection and training
- Step 2d: URDF viewer and 3D demo
- Step 3a: real LeRobot dataset bridge -> FluidVLA offline, validated
- Step 3b: offline single-camera training on full real SO-101 dataset, V1 converged (Val MSE 16.69), V2 in progress
- Step 3b-v2: frozen policy diagnostic + V2 corrections (spatial pool, delta-actions, normalization, cosine loss), implemented, training pending
- Step 3c: proper benchmark and adaptive compute validation on real checkpoint
- Step 3d: dual-camera
front + wristfusion on real LeRobot data - Step 3e: live SO-101 inference via current LeRobot stack, first V1 test executed (frozen policy), V2 test pending
- Step 3f: embedded Jetson / edge robotics benchmark
- leaky memory / synaptic fatigue,
- dynamic spatial pruning,
- semantic inhibition,
- modulated spatial integration,
- lightweight Hebbian adaptation.
- persistent BeliefField,
- Imaginary Rollout,
- distillation of heavier teachers toward a lighter PDE student.
- local text diffusion,
- symbolic anchors,
- external scratchpad,
- reasoning rollout,
- extensions toward certain scientific domains.
BeliefField is the persistent memory component designed to maintain a latent state across multiple calls, without falling back to a giant KV-cache.
BeliefField_t = λ · BeliefField_{t-1} + Write(u_t, obs_t) - Decay(BeliefField_{t-1})
Action_t = Policy(u_t, Read(BeliefField_t))
Imaginary Rollout is the short-range planning component: a few candidate latent futures, a fast action selection, and a local see -> simulate -> act mechanism.
Symbolic anchors aim to manipulate a small set of salient symbols or identities at a cost closer to
The ScratchPad is distinct from BeliefField:
- BeliefField for continuous latent memory,
- ScratchPad for explicit structured memory needed for certain multi-step reasoning tasks.
- Root README enriched to track Step 3 like other repository experiments.
- Real LeRobot -> FluidVLA bridge implemented and validated on the local SO-101 dataset.
- Smoke conversion validated on 2 episodes with coherent
frames,proprios,actionsoutputs. - Smoke training validated with generation of
best.pt,history.json, andbenchmark.json. - Full
frontconversion completed on 44 episodes indata/step3_lerobot/so101_front_full. - Full single-camera
frontrun launched incheckpoints/step3_lerobot/so101_front_full_train.log. - The full run then passed its first plateau: epoch 7,
Val MSE = 17.82738,Val L1 = 2.25727, eval latency~6.77 ms. - The run continues improving at epoch 8 with
Val MSE = 17.39248andVal L1 = 2.21785. - Epochs 9 and 10 extend this trend with a new best point at
Val MSE = 16.69330at epoch 10. - Epoch 11 drops on the train side but rises in validation to
17.39699, leaving the best checkpoint unchanged at epoch 10. - Epoch 12 currently confirms a validation plateau above epoch 10 with
Val MSE = 17.35693andVal L1 = 2.38019. - Code diagnostic confirmed: adaptive compute cannot activate in train mode and has not yet emerged in eval with
epsilon=0.02. - Runtime epsilon sweep executed on
best.pt: no effect up to0.20, then clear triggering from0.30onward withavg_steps ~ 4.33and latency around40 ms; the0.50-1.00zone drops to around3.67steps and~33-35 ms. - Quality control then executed on a clean validation split:
epsilon=0.30maintains nearly the same quality as0.02, while0.50slightly degrades validation to gain further compute savings.
- First live test of the V1 checkpoint (
so101_front_full/best.pt) on the real SO-101 robot. - Result: frozen policy, the robot barely moves. Raw deltas constant at +/-0.001 deg during 480+ steps.
- Confirmed with
filter_alpha=1.0andmax_delta=6: the problem comes from the model, not the filter or limits. - Full diagnostic executed on the 6 pipeline files (vla_models, video_models, fluid_layer, train, convert, inference).
- 6 root causes identified: (1) global pool destroys spatial info, (2) absolute action biased toward immobility, (3) no normalization, (4) no chunking, (5) MLP too small, (6) no directional signal.
- V2 pipeline implemented:
vla_models.py(SpatialActionHead, spatial_pool_size, chunk_size),convert_lerobot_dataset.py(delta-actions, normalization, static filtering),train_lerobot_so101.py(cosine loss, chunking, V2 config),lerobot_inference.py(ActionDenormalizer, normalized proprio, chunk execution). - Facades updated:
fluid_model.pyand__init__.pyexportSpatialActionHead. - PDE core strictly unchanged (fluid_layer.py, video_models.py, diffusion.py).
- V2 dataset re-conversion launched with
--delta-actions --filter-static 0.5 --subsample-static 4. - Delta statistics measured:
mean=[0.17, -0.77, -1.98, -0.25, -0.08, -3.55],std=[3.67, 5.07, 4.55, 2.39, 1.68, 8.59]. - Unexpected result: 0% static frames (|delta| < 0.5 deg) -- the dataset is more dynamic than expected, the problem is clearly algorithmic.
- Verify that V2 re-conversion is complete and that
norm_stats.jsonis present. - Launch V2 training:
--spatial_pool_size 4 --chunk_size 1 --cosine_loss --epochs 100. - Monitor that Val MSE (on normalized deltas) drops below 1.0 -- sign that the model is learning something beyond immobility.
- If it converges -> test live on the robot and verify that raw_delta varies with the scene.
- If raw_delta varies -> move to
chunk_size=4for a planning horizon. - If everything remains frozen -> investigate whether the 4x4 spatial pool is sufficient or if a larger MLP / more data is needed.
The repository contains a standalone local server for orchestrating training, inference, and model comparison.
Currently integrated features:
- dataset scanning,
- automatic checkpoint scanning,
- training and inference job launching,
- PNG, HTML, and JSON file rendering,
- modular SPA for experiments,
- native Dataset Explorer,
- REST API and WebSocket streams.
Useful entry points:
- fluidvla_server.py
- fluidvla_platform/dataset_explorer.py
- fluidvla_platform/interactive.html
- start_platform.bat
- fluidvla/core/README.md
- experiments/step1b_medical_msd
| Directory | Produced by | Consumed by |
|---|---|---|
data/step1_video/ |
download or Moving MNIST generation | train_step1_video.py |
data/step1b_medical_msd/<Task>/ |
manual import of MSD datasets | train_fluidvla_msd.py, train_unet3d_msd.py, infer_msd.py |
data/step2_sim/ |
isaac_env.py --mode synthetic |
train_step2.py |
data/step2a_synthetic/ |
synthetic_env.py |
train_step2a.py |
data/step2c_isaac/ |
isaac_env.py --mode collect |
fine-tuning / more physical Step 2 |
data/step3_lerobot/ |
conversion from real local LeRobot datasets | train_lerobot_so101.py, Step 3 benchmarks |
checkpoints/fluidvla/<Task>/ |
train_fluidvla_msd.py |
infer_msd.py, web platform |
checkpoints/unet3d/<Task>/ |
train_unet3d_msd.py |
infer_msd.py, web platform |
checkpoints/step2_sim/ |
train_step2.py |
Isaac evaluation / fine-tuning |
checkpoints/step2a_synthetic/ |
train_step2a.py |
URDF viewer or controlled resumptions |
checkpoints/step2c_isaac/ |
train_step2.py on data/step2c_isaac/ |
URDF viewer and Step 3 bridge |
checkpoints/step2d_so101_urdf/ |
so101_urdf_viewer.py |
demonstration and inspection |
checkpoints/step3_lerobot/ |
train_lerobot_so101.py on converted real LeRobot data |
benchmark, future live and edge passes |
The data/README.md directory fixes these conventions in compact form.
The project does not yet have a finalized public preprint or definitive DOI. In the meantime, the recommended citation form is a provisional software citation.
@software{fluidvla_prototype,
title = {FluidVLA: Transformer-Free Vision-Language-Action via Reaction-Diffusion PDEs},
author = {infinition},
year = {2026},
note = {Research prototype, code repository},
url = {https://github.com/infinition/FluidVLA}
}When a preprint or stabilized public release exists, this section should be replaced with the canonical reference.
- Status: research prototype
- Paper: not released yet
- Code: active and evolving
- Claim level: strong architectural evidence on several axes, but not yet a final product or definitive real-hardware paper result
Project maintained by infinition.
If a public page, contact email, or official mirror repository needs to be exposed, this section is the proper place to add it.