Attention is not Enough: Mitigating the Distribution Discrepancy in Asynchronous Multimodal Sequence Fusion

Tao Liang, Guosheng Lin, Lei Feng, Yan Zhang, Fengmao Lv

Research output: Chapter in Book/Report/Conference proceedingConference Paper published in Proceedingspeer-review

44 Citations (Scopus)

Abstract

Videos flow as the mixture of language, acoustic, and vision modalities. A thorough video understanding needs to fuse time-series data of different modalities for prediction. Due to the variable receiving frequency for sequences from each modality, there usually exists inherent asynchrony across the collected multimodal streams. Towards an efficient multimodal fusion from asynchronous multimodal streams, we need to model the correlations between elements from different modalities. The recent Multimodal Transformer (MulT) approach extends the self-attention mechanism of the original Transformer network to learn the crossmodal dependencies between elements. However, the direct replication of self-attention will suffer from the distribution mismatch across different modality features. As a result, the learnt crossmodal dependencies can be unreliable. Motivated by this observation, this work proposes the Modality-Invariant Crossmodal Attention (MICA) approach towards learning crossmodal interactions over modality-invariant space in which the distribution mismatch between different modalities is well bridged. To this end, both the marginal distribution and the elements with high-confidence correlations are aligned over the common space of the query and key vectors which are computed from different modalities. Experiments on three standard benchmarks of multimodal video understanding clearly validate the superiority of our approach.

Original languageEnglish
Title of host publicationProceedings - 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021
PublisherIEEE, Institute of Electrical and Electronics Engineers
Pages8128-8136
Number of pages9
ISBN (Electronic)9781665428125
DOIs
Publication statusPublished - 2021
Externally publishedYes
Event18th IEEE/CVF International Conference on Computer Vision, ICCV 2021 - Virtual, Online, Canada
Duration: 11 Oct 202117 Oct 2021

Publication series

NameProceedings of the IEEE International Conference on Computer Vision
ISSN (Print)1550-5499

Conference

Conference18th IEEE/CVF International Conference on Computer Vision, ICCV 2021
Country/TerritoryCanada
CityVirtual, Online
Period11/10/2117/10/21

Bibliographical note

Publisher Copyright:
© 2021 IEEE

Fingerprint

Dive into the research topics of 'Attention is not Enough: Mitigating the Distribution Discrepancy in Asynchronous Multimodal Sequence Fusion'. Together they form a unique fingerprint.

Cite this