PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning

teaser

We formulate the motor system of an interactive avatar as a generative motion model that can drive the body to move through 3D space in a perpetual, realistic, controllable, and responsive manner. Although human motion generation has been extensively studied, many existing methods lack the responsiveness and realism of real human movements. Inspired by recent advances in foundation models, we propose PRIMAL, which is learned with a two-stage paradigm. In the pretraining stage, the model learns body movements from a large number of sub-second motion segments, providing a generative foundation from which more complex motions are built. This training is fully unsupervised without annotations. Given a single-frame initial state during inference, the pretrained model not only generates unbounded, realistic, and controllable motion, but also enables the avatar to be responsive to induced impulses in real time. In the adaptation phase, we employ a novel ControlNet-like adaptor to fine-tune the base model efficiently, adapting it to new tasks such as few-shot personalized action generation and spatial target reaching. Evaluations show that our proposed method outperforms state-of-the-art baselines. We leverage the model to create a real-time character animation system in Unreal Engine that feels highly responsive and natural.

FAQs

1. What is the key novelty of PRIMAL?

The key novelty is the formulation. Instead of using long sequences, our model only predicts the next half-second motion based on the initial frame, and runs recursively during inference. The neural network just uses standard transformer layers.

2. What are potential future works?

There are many. For example, PRIMAL does not have face and hands. It is still too slow to be used for on-device computation in games. When working on robots, additional RL-based tracking in physical simulation would be necessary.

3. How is the poking force applied to the body without physical simulation?

We don't use force. Instead, we reset the velocities on the body joints to induce impulses, according to Ft=m(v2-v1) (impulse-momentum theorem). The mass is assumed to be 1 here. Note that the velocity is part of the motion representation. Our diffusion model just generalizes to unseen velocities.

4. How is unbounded motion generated during testing?

The diffusion model only generates 0.5-second motion based on one frame. Afterwards, we use the last generated frame as the new condition to produce the next half-second motion, and so on.

5. How does the model keep long-term stability during runtime?

Unlike previous works, we neither use scheduled sampling nor VQVAE-like discretization. Our motion representation is in a continuous space. After one-day training, the stability just emerges. See 5:28 at our youtube video.

6. Since the model supports classifier-based guidance, can the avatar avoid obstacles in a 3D environment?

Yes probably, in particular when using some body SDF methods like VolumetricSMPL. But we did not try it in this work.

7. Does PRIMAL support more complex text-to-motion instructions, such as “move forward for 10 steps”?

No, PRIMAL does not use motion history and cannot remember how many steps it has taken. There are two ways to address this issue: 1. use the motion history and increase the context window. 2. use PRIMAL and another long-term memory mechanism. Note that PRIMAL can be fine-tuned with BABEL, so it can understand and generate some motions like "move forward".

8. Why doesn't PRIMAL have long-term memory?

The inference process of PRIMAL is Markovian. The previous and the current half-second motion segments are connected only through their overlapping frame. The long-term memory is often not considered part of the human motor system in cognitive science.

Demo Videos

Few-shot fine tuning and animation

PRIMAL supports few-shot action personalization. Provided a few seconds of videos, we can fine-tune the base model with the mocapaded results. The motion realism and responsiveness are kepted, and the avatar motion style is personalized. In this demo, we retarget the generated motion to Unitree G1 character in Unreal.

Walk

Run

→

Poke while limping

Text-to-motion generation in real time

PRIMAL also supports real-time text-to-motion generation. By finetuning the base model with text annotations, we can control the avatar's actions given text prompts in the terminal. The following demo is running on Macbook Pro. To ensure real-time performance on a single Apple M3 Pro chip, we employ a set of suboptimal hyper-parameters. The model's performance is better on a more powerful machine. Note that this text-to-motion feature is not included in our ICCV'25 paper.

Advantages of the 0.5-second atomic action

The key novelty is the formulation that generates 0.5-second motion given a single initial state. This contrasts with prior work that generates a long future motion conditioned on a past motion. Its benefits include reducing overfitting, making model training easier, and making the avatar reactive to impulses and classifier-based guidance.

To better understand the benefits of our formulation, we compare two identical settings except the motion length, where ours generates 15 frames given 1 frame, and baseline generates 40 frames given 20 frames. We replace in-context with cross-attention to handle multi-frame conditioning. Both models are successfully overfit to a ballet sequence with 229 frames, and the ballet motion can be reproduced given the first frame(s).

Ballet motion for training.

ours given the first frame.

baseline given the first 20 frames.

First, we generate 780 future frames given the end frame(s) of that ballet sequence, and use ASR to measure the foot-skating ratio. We find ours produces ballet stably, whereas baseline gradually fails as time progresses.

ours, ASR=0.08.

baseline, ASR=0.12.

Second, we generate 156 frames, with conditions from another walking sequence. We find ours produces fast and natural transitions to ballet, whereas baseline produces severe artifacts.

ours, ASR=0.06.

baseline, ASR=0.3.

These results indicate our setting makes the model more generalizable w.r.t. motion length and semantics.

PRIMAL

Physically Reactive and Interactive Motor Model for Avatar Learning

ICCV 2025

Paper Code

FAQs

1. What is the key novelty of PRIMAL?

2. What are potential future works?

3. How is the poking force applied to the body without physical simulation?

4. How is unbounded motion generated during testing?

5. How does the model keep long-term stability during runtime?

6. Since the model supports classifier-based guidance, can the avatar avoid obstacles in a 3D environment?

7. Does PRIMAL support more complex text-to-motion instructions, such as “move forward for 10 steps”?

8. Why doesn't PRIMAL have long-term memory?

Demo Videos

Few-shot fine tuning and animation

Text-to-motion generation in real time

Advantages of the 0.5-second atomic action

Poster

License and Citation

FAQs

1. What is the key novelty of PRIMAL?

2. What are potential future works?

3. How is the poking force applied to the body without physical simulation?

4. How is unbounded motion generated during testing?

5. How does the model keep long-term stability during runtime?

6. Since the model supports classifier-based guidance, can the avatar avoid obstacles in a 3D environment?

7. Does PRIMAL support more complex text-to-motion instructions, such as “move forward for 10 steps”?

8. Why doesn't PRIMAL have long-term memory?

Demo Videos

Few-shot fine tuning and animation

Text-to-motion generation in real time

Advantages of the 0.5-second atomic action

Poster

Related Works

License and Citation