AdaptiveDiffuseMotion - Adaptive Multi-Task Diffusion Model

AdaptiveDiffuseMotion: Adaptive Multi-Task Diffusion Model for Speech-Driven Holistic Motion Generation

ICASSP 2026

Enyun Xuan^1,2 You Li^1,2 Ziwei Li^1,2 Mengmeng Yao¹ Renzhong Guo^1,2

¹ Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China
² Shenzhen University, Shenzhen, China

Abstract

Speech-driven holistic motion generation requires synchronized facial expressions and diverse gestures for lifelike virtual agents. Existing methods struggle to correctly model the complex relationships between gestures and facial expressions, often relying on fixed weighting schemes or predefined inter-task constraints that fail to capture their optimal interactions.

We introduce AdaptiveDiffuseMotion, a novel diffusion-based framework that adaptively balances deterministic facial expression synthesis and non-deterministic gesture generation through uncertainty-based multi-task learning. Our approach dynamically adjusts loss weights without manual tuning, achieving superior synchronization and diversity simultaneously.

Experimental results demonstrate significant improvements in facial precision, gesture diversity, and overall realism across multiple metrics and user studies.

Pipeline

Our framework encodes audio features and speaker ID as diffusion conditions. Cross-attention and self-attention layers decode and fuse features to capture temporal relationships. An uncertainty-based module dynamically adjusts loss weights for gesture and expression tasks.

Results

Our method generates more diverse gestures while maintaining more accurate facial expressions compared to existing approaches. The following video comparisons demonstrate these improvements across different speakers.

DiffSHEG

View Code & More Details on GitHub