We are More than Our Joints

Predicting How 3D Bodies Move

"We are more than our joints", or MOJO for short, is a solution to stochastic motion prediction of expressive 3D bodies. Given a short motion from the past, MOJO generates diverse plausible motions in the near future.



A key step towards understanding human behavior is the prediction of 3D human motion. Successful solutions have many applications in human tracking, HCI, and graphics. Most previous work focuses on predicting a time series of future 3D joint locations given a sequence 3D joints from the past. This Euclidean formulation generally works better than predicting pose in terms of joint rotations. Body joint locations, however, do not fully constrain 3D human pose, leaving degrees of freedom (like rotation about a limb) undefined, making it hard to animate a realistic human from only the joints. Note that the 3D joints can be viewed as a sparse point cloud. Thus the problem of human motion prediction can be seen as a problem of point cloud prediction. With this observation, we instead predict a sparse set of locations on the body surface that correspond to motion capture markers. Given such markers, we fit a parametric body model to recover the 3D body shape and pose of the person. These sparse surface markers also carry detailed information about human movement that is not present in the joints, increasing the naturalness of the predicted motions. Using the AMASS dataset, we train MOJO (More than Our JOints), which is a novel variational autoencoder with a latent DCT space that generates motions from latent frequencies. MOJO preserves the full temporal resolution of the input motion, and sampling from the latent frequencies explicitly introduces high-frequency components into the generated motion. We note that motion prediction methods accumulate errors over time, resulting in joints or markers that diverge from true human bodies. To address this, we exploit the body model and fit SMPL-X to the predictions at each time step, projecting the solution back onto the space of valid bodies. These valid markers are then propagated in time. Quantitative and qualitative experiments show that our approach produces state-of-the-art results and realistic 3D body animations.

MOJO in a nutshell:

  • A body surface marker-based representation. Compared to joint locations, body surface markers contain richer information of the body shape, and provide more body degree-of-freedom constraints. Compared to joint rotations, markers are located in the Euclidean space, which are easier for neural networks to learn.
  • A conditional VAE with latent frequencies. With latent frequencies, the generated motion has more high-frequency components and hence looks more realistic. Boosted by DLow as an advanced sampler in the latent space, MOJO produces highly diverse future motions based on the same motion seed.
  • A recursive marker reprojection scheme. This scheme is to recover the body meshes from the generated markers during testing. After reprojecting the markers to the mesh template at each time step, it always keeps the markers in the valid body space, and hence can eliminate error accumulation of the recurrent network.



Yan Zhang, Michael J. Black, Siyu Tang
We are More than Our Joints: Predicting how 3D Bodies Move
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021

      title={We are More than Our Joints: Predicting how 3D Bodies Move},
      author={Zhang, Yan and Black, Michael J and Tang, Siyu},
      booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},


card image


card image

Michael J. Black

card image