Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval

Abstract

Video-text retrieval, the task of retrieving videos based on a textual query or vice versa, is of paramount importance for video understanding and multimodal information retrieval. Recent methods in this area rely primarily on visual and textual features and often ignore audio, although it helps enhance overall comprehension of video content. Moreover, traditional models that incorporate audio blindly utilize the audio input regardless of whether it is useful or not, resulting in suboptimal video representation. To address these limitations, we propose a novel video-text retrieval framework, Audio-guided VIdeo representation learning with GATEd attention (AVIGATE), that effectively leverages audio cues through a gated attention mechanism that selectively filters out uninformative audio signals. In addition, we propose an adaptive margin-based contrastive loss to deal with the inherently unclear positive-negative relationship between video and text, which facilitates learning better video-text alignment. Our extensive experiments demonstrate that AVIGATE achieves state-of-the-art performance on all the public benchmarks.

Overall Architecture

Figure 1. (Left) The overall architecture of AVIGATE. Audio input is processed through an Audio Spectrogram Transformer (AST) and further refined by an audio resampler to generate fixed-size audio embeddings. Frame embeddings are derived from the video using a CLIP Image Encoder, while the text embedding is extracted by the CLIP Text Encoder. These audio and frame embeddings are fused by a gated fusion transformer, which dynamically determines the contribution of audio. The final video representation is aligned with the text embedding using a multi-grained alignment scheme, facilitating an effective video-text retrieval process. (Right) The gated fusion transformer consists of a gated fusion block and a gating function.

Adaptive Maring-based Contrastive Loss

Figure 2. The adaptive margin and the proposed loss function that dynamically adjusts margins for each negative pair based on their intra-modal semantic similarities.

Figure 3. A conceptual illustration of how our adaptive margin operates in the embedding space for video-to-text alignment.

Quantitative Results

Table 1. Text-to-video and video-to-text retrieval results on the MSR-VTT 9k split. Bold denotes the best performance. † denotes the use of post-processing techniques.

Figure 4. Text-to-video retrieval results (R@1) on all benchmarks with different visual backbones.

Qualitative Results

Figure 6. Top-1 text-to-video retrieval results of our method on MSR-VTT, where they are true matches. g_mha^(l) and g_ffn^(l) denote the gating scores for l-th layers of the gated fusion transformer. The audio provides informative cues for accurate retrieval, where "a man is talking" in the query text is not visible (a). The irrelevant audio is filtered by the gated fusion transformer, leading to an accurate retrieval result (b).

Latency Analysis

Figure 7. Analysis of the efficiency of our method compared to the previous arts. t_sim and t_ex denote the latency of the similarity calculation and the query embedding extraction, respectively. Note that t_ex for all methods are the same since they use the same text encoder. The latency is calculated by a single RTX3090 card.

Boseung Jeong	Jicheol Park	Sungyeon Kim	Suha Kwak
Pohang University of Science and Technology (POSTECH)
CVPR 2025 (Oral)