Baseline( InsActor)
Humanoid Reaction Synthesis is pivotal for creating highly interactive and empathetic robots that can seamlessly integrate into human environments, enhancing the way we live, work, and communicate. However, it is difficult to learn the diverse interaction patterns of multiple humans and generate physically plausible reactions. The kinematics-based approaches face challenges, including issues like floating feet, sliding, penetration, and other problems that defy physical plausibility. The existing physics-based method often relies on kinematics-based methods to generate reference states, which struggle with the challenges posed by kinematic noise during action execution. Constrained by their reliance on diffusion models, these methods are unable to achieve real-time inference. In this work, we propose a Forward Dynamics Guided 4D Imitation method to generate physically plausible human-like reactions. The learned policy is capable of generating physically plausible and human-like reactions in real-time , significantly improving the speed(x33) and qualityof reactions compared with the existing method. Our experiments on the InterHuman and Chi3D datasets, along with ablation studies, demonstrate the effectiveness of our approach.
To enhance imitation learning, transforming motion capture data into state-action pairs is crucial. However, deriving precise actions from motion data is typically difficult, necessitating high-precision force sensors or advanced motion-tracking techniques. Our goal is to generate accurate state-action pairs from sequences of joint positions. To model state-action relationships, it's crucial to associate actions with each state . During the demonstration generation phase, we employ a universal motion tracker to seamlessly convert motion capture data for use in the simulation environment.
We believe that an effective policy should foresee its actions. After obtaining state-action pairs, we initially train two Variational Autoencoders as feature extractors for both states and actions, termed the state VAE and the action VAE. Following this, a forward dynamics model is trained to estimate the upcoming state, based on the current state and action. We consider the forward dynamics model to be stochastic rather than deterministic, so we train the model in the feature space using the contrastive loss.
With the forward dynamics model, we advance to the training phase of 4D imitation learning. The term "4D imitation learning" reflects the incorporation of temporal data in our input states and the dynamics network's ability to forecast future states. This approach transcends basic one-to-one mapping, evolving from singular state-action relationships to encompass temporal progression. Utilizing the previous states of both the actor and reactor along with the actor's current state , our model is tasked with forecasting the action for the reactor to execute. This model anticipates the reactor's next state, leveraging the reactor's preceding state and the proposed action. After encoding the forecasted action and state through the VAE's encoder, we apply a contrastive loss to facilitate gradient back-propagation.
Given the complexity of imitating a diverse array of tasks with a single network, we utilize an Iterative Generalist-Specialist Learning Strategy during the imitation learning phase. We begin by clustering dataset motions into ten subsets based on state features from the state encoder. A Generalist model is first trained on the entire dataset, after which this model is duplicated ten times to specialize in each subset, creating ten Specialists. Subsequently, we apply a distillation technique to transfer the knowledge from these Specialists back to the Generalist. This iterative process enhances our policy's ability to handle a broad spectrum of interactive tasks, enabling the generation of different reactions.
Two people greet each other by shaking hands.
Baseline( InsActor)
Ours
Both people stand straight, extend their arms up and wave their hands towards each other.
Baseline( InsActor)
Ours
The first person places the right arm on the shoulder of another person.
Baseline( InsActor)
Ours
The first one approaches the second one from behind, gently touches the shoulder of the second one, and starts a conversation with the person.
Baseline( InsActor)
Ours
One person approaches the other person and the other person raises the right hand, then greets with the left hand, and then they embrace each other.
Baseline( InsActor)
Ours
The other person slaps one side of the first person's face with their right hand and steps back.
Baseline( InsActor)
Ours
The two persons walk forward side by side.
Baseline( InsActor)
Ours
The first one stomps their foot and claps their thigh, and the second one strikes the first one with their right hand. the first one takes a step back.
Baseline( InsActor)
Ours
One and the other person acknowledge each other's presence and face each other.
Baseline( InsActor)
Ours