exoactor_01

BAAI just dropped ExoActor—a framework that turns video generation models into a robot's imagination. Instead of hand-coding every joint trajectory, you just tell the robot what to do and it "imagines" the execution as a video, then physically performs it.

How It Works

ExoActor has a three-stage pipeline that decouples intent from execution:

  1. Video Generation — A task instruction ("pick up the bottle") and an initial camera view are fed into video models like Kling 3. The system decomposes the instruction into atomic sub-actions (approach → bend → grasp → lift), then generates a third-person video of a human executing those steps.

  2. Motion Estimation — The generated video is analyzed by GENMO (whole-body) and WiLoR (hands) to extract 3D poses, trajectories, and hand configurations frame by frame.

  3. Motion Execution — The extracted motion is fed into a SONIC controller running on a Unitree G1 humanoid robot. The robot physically reproduces the "imagined" demonstration.

The clever trick is an embodiment transfer step. Video models are trained on humans, not robots. So ExoActor first converts the robot's observation into a human-like reference frame using Nano Banana Pro (Gemini 3.1 Pro) — preserving scene geometry while abstracting away the robot's morphology.

Benchmarks & Performance

ExoActor was validated on three difficulty tiers:

Level Type Examples
B (Easy) Locomotion & navigation Walking around obstacles, goal-directed navigation
A (Moderate) Coarse interaction Sweeping items off a table, sitting on a chair
S (Challenging) Fine-grained manipulation Picking up a bottle and placing it upright in a basket

Latency breakdown per second of video:

Module Time (s)
Embodiment transfer 10.7
Task decomposition 2.5
Video generation 13.2
Whole-body motion estimation 2.9
Hand motion estimation 16.4

Hand motion estimation is the bottleneck, taking 16.4 seconds per second of video. The current pipeline is entirely offline — you generate the full video before the robot moves.

What Makes It Different

"The uncomfortable part is that direct retargeting of human motion to a robot often introduces spatial errors. Just feeding the estimated motion directly into the controller preserves geometric fidelity better."

This insight from the paper is worth sitting with. Most approaches try to map human joint angles to robot equivalents (retargeting). ExoActor skips that entirely — it estimates motion from the generated video and lets the controller figure out the physical execution. The result is better spatial alignment for tasks that need precise positioning.

ExoActor also outperformed traditional RL and imitation learning baselines in interaction-rich scenarios. The video generation model implicitly encodes interaction dynamics that are nearly impossible to capture through conventional supervision.

Competitor comparison:

Method Approach Key Limitation
ExoActor Generated video → motion estimation → execution Offline, open-loop
VideoMimic Visual imitation from real video Needs real demos
X-Humanoid Human→humanoid video editing Focuses on data generation, not control
EgoActor VLM-based locomotion prediction Same lab, different paradigm

Limitations (There Are Several)

"Current video models prioritize visual plausibility over physical laws."

The generated videos look convincing but hallucinate contact dynamics. The robot might "see" itself grasping an object in the generated video, then fail because the physics don't actually check out.

Other issues:

  • Open-loop execution: No real-time feedback. The robot commits to the generated plan.
  • Wrist orientation: Fine-grained wrist rotations (vertical vs. horizontal grasps) are unreliable from monocular video.
  • Latency: 45+ seconds of processing per second of action makes real-time impossible.

The authors acknowledge these openly and point to "streaming task imagination" — generating short video segments and executing them concurrently — as the path forward.

The Bigger Picture

BAAI is building a family of related work. ExoActor pairs with EgoActor (VLM-based locomotion prediction) from the same lab. The shared goal is moving humanoid control away from task-specific training and toward generative, generalizable behavior.

Video generation models hold an enormous amount of implicit world knowledge — how objects behave, how humans interact with them, what sequences of actions lead to outcomes. ExoActor is one of the first frameworks to channel that knowledge directly into physical execution.

It's offline, it's slow, and it makes physics mistakes. But the direction is right. A robot that can "imagine" its actions before executing them beats a robot that needs 10,000 human demonstrations for every new task.

Sources

https://arxiv.org/html/2604.27711v1 https://baai-agents.github.io/ExoActor/ https://huggingface.co/papers/2604.27711 https://arxiv.org/abs/2604.27711 https://baai-agents.github.io/RoboNoid/EgoActor/