Recently, remarkable progress has been made in the field of text-to-video retrieval. However, while current systems are primarily designed for very short videos, most real-world videos often capture complex human actions, which can last several minutes or even hours.

Image Credits: Yan-Bo Lin Ji Lei, Mohit Bansal, Gedas Bertasius
A scientific paper published on arXiv.org addresses this limitation by proposing an efficient audio-visual text-to-video retrieval system focused on long-distance video.
The researchers notice that most of the relevant visual information can be captured in just a few video frames, while temporal dynamics can be briefly encoded in an audio stream. Therefore, instead of processing multiple densely-extracted frames from a long video, the proposed framework operates on low-sampled video frames with dense audio.
It is demonstrated that compared to the long-range video-only approach, the novel framework leads to better video retrieval results at lower computational cost.
We introduce an audio-visual method for long-range text-to-video retrieval. In contrast to previous approaches designed for short video retrieval (eg, 5–15 seconds in duration), our approach aims to recover minute-long videos capturing complex human actions. One challenge of standard video-only approaches is the large computational cost associated with processing hundreds of condensedly extracted frames from such long videos. To address this problem, we propose to replace parts of the video with compact audio signals that concisely summarize dynamic audio events and are cheap to process. Our method, named ECLIPSE (Efficient CLIP with Sound Encoding), adapts the popular CLIP model to an audio-visual video setting, by adding an integrated audiovisual transformer block that captures complementary signals from video and audio streams. In addition to being 2.92x faster and 2.34x memory-efficient than long-range video-only approaches, our method also achieves superior text-to-video retrieval accuracy on many diverse long-range video datasets such as ActivityNet, QVHighlights, YouCook2, does. Didemo and Charades.
Research Article: Lin, Y.-B., Lei, J., Bansal, M., and Bertasius, G., “ECLIPSE: Efficient Long-Range Video Retrieval Using Sight and Sound”, 2022. Link of Paper: https://arxiv.org/abs/2204.02874
Project Page: https://yanbo.ml/project_page/eclipse/