AdaptiveDiffuseMotion: Adaptive Multi-Task Diffusion Model for Speech-Driven Holistic Motion Generation

ICASSP 2026

Enyun Xuan1,2 You Li1,2 Ziwei Li1,2 Mengmeng Yao1 Renzhong Guo1,2
1 Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China
2 Shenzhen University, Shenzhen, China

Abstract

AdaptiveDiffuseMotion Overview

Speech-driven holistic motion generation requires synchronized facial expressions and diverse gestures for lifelike virtual agents. Existing methods struggle to correctly model the complex relationships between gestures and facial expressions, often relying on fixed weighting schemes or predefined inter-task constraints that fail to capture their optimal interactions.

We introduce AdaptiveDiffuseMotion, a novel diffusion-based framework that adaptively balances deterministic facial expression synthesis and non-deterministic gesture generation through uncertainty-based multi-task learning. Our approach dynamically adjusts loss weights without manual tuning, achieving superior synchronization and diversity simultaneously.

Experimental results demonstrate significant improvements in facial precision, gesture diversity, and overall realism across multiple metrics and user studies.

Pipeline

AdaptiveDiffuseMotion Pipeline

Our framework encodes audio features and speaker ID as diffusion conditions. Cross-attention and self-attention layers decode and fuse features to capture temporal relationships. An uncertainty-based module dynamically adjusts loss weights for gesture and expression tasks.

Results

Our method generates more diverse gestures while maintaining more accurate facial expressions compared to existing approaches. The following video comparisons demonstrate these improvements across different speakers.

Visual Comparisons

Speaker 1: Scott - Facial Expression

Ground Truth

Ours

DiffSHEG

Speaker 1: Scott - Full-body Motion

Ground Truth

Ours

DiffSHEG

Speaker 2: Lawrence - Facial Expression

Ground Truth

Ours

DiffSHEG

Speaker 2: Lawrence - Full-body Motion

Ground Truth

Ours

DiffSHEG

Additional Sample: Carla

Ground Truth

Ours

DiffSHEG

View Code & More Details on GitHub