Introduction: This talk presents a systematic study on balanced multimodal representation learning, focusing on the critical issue of modality imbalance that arises when different modalities contribute unequally during training. We begin by analyzing how imbalance manifests across the entire multimodal learning pipeline, including the data, model, and learning levels. To address these challenges, we introduce a series of methods: at the data level, a balance-aware sequence sampling strategy to regulate training difficulty; at the model level, adaptive unimodal gradient boosting and modality-aware subnet masking to enhance weaker modalities; and at the learning level, a unified framework that dynamically integrates label fitting and cross-modal alignment. Extensive experiments demonstrate that these approaches effectively mitigate modality imbalance and consistently improve performance across diverse multimodal benchmarks.
Slide: [pdf]
Relevant Publications: