Humans in 4D: Tracking People in 3D Over Time Using Transformers

New research from Berkely present an approach that, given any video in the wild, can jointly reconstruct the underlying humans in 3D, and track them over time.

Humans in 4D: Tracking People in 3D Over Time Using Transformers
Image Credit: 4DHumans

A new paper from Berkeley researchers demonstrates a new state of the art approach for reconstructing detailed 3D models of humans and tracking them over time in video. The research was enabled by Stability AI, which provided critical compute resources to develop and train the complex AI models.

The paper details an approach for taking any monocular video and jointly reconstructing the humans in 3D, while also tracking their identities over time. The approach also allows the system to reliably track multiple people even through challenges like occlusions and unusual poses.

HMR 2.0 is a fully “transformerized” version of a network for Human Mesh Recovery

At the core of the system is Human Mesh Recovery (HMR) 2.0, a fully "transformerized" neural network architecture for predicting 3D human meshes from single images. It leverages the powerful self-attention mechanisms of transformers to understand spatial context and estimate detailed body shape, pose, and clothing. The researchers show HMR 2.0 provides state-of-the-art results for 3D pose estimation, significantly improving on previous convolutional architectures like the original HMR model.

By integrating HMR 2.0 pose predictions into a 3D tracking framework, the researchers built 4DHumans, which jointly reconstructs and tracks people over time. It establishes correspondences between predicted 3D poses to link identities across frames. 4DHumans achieves top performance on the task of multi-person 3D tracking.

HMR 2.0 is the backbone of the 4DHumans system used to reconstruct and track humans in 4D

Beyond tracking, the team shows HMR 2.0's high-fidelity poses also provide benefits for action recognition. By using pose features from HMR 2.0, they attain state-of-the-art results on a challenging benchmark for detecting human actions in video.

The work was enabled by Stability AI's generous compute grant, which provided access to high-performance cloud infrastructure. With continued advancements in transformer architectures and increases in compute, techniques like 4DHumans could unlock new applications in robotics, graphics, biomechanics, and other fields that require fine-grained 4D understanding of humans in motion.

The team has shared the code for 4DHumans, as well as demos on Hugging Face and Google Colab.

Let’s stay in touch. Get the latest AI news from Maginative in your inbox.