OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation

TL;DR. We train a unified audio-video-text encoder via fusion-as-teacher distillation and Tuple-InfoNCE, surpassing closed Gemini Embedding 2 on a new 12-direction AVT retrieval benchmark.

Abstract

Unified multimodal embedding spaces have become the standard interface for cross-modal retrieval and multimodal RAG, and recent audio-video-text (AVT) encoders extend this setting to three modalities. Such encoders can produce a joint \((T,V,A)\) embedding whenever all three modalities are available, but standard pairwise InfoNCE objectives leave this signal unused during training. We close this gap with fusion-as-teacher distillation, which treats a stop-gradient copy of the fused embedding as a teacher signal for the single-modal embeddings, paired with a Tuple-InfoNCE term that supervises the fused embedding directly. We instantiate this objective as OmniRetriever-7B. Across six zero-shot retrieval benchmarks, OmniRetriever-7B surpasses the closed-source Gemini Embedding 2 by 13.3–18.0 R@1 on Clotho and SoundDescs, and reaches the contemporary zero-shot specialist band of open video–text encoders on MSR-VTT and MSVD. To stress-test joint representations, we further release OmniRetriever-Bench, a 12-direction AVT retrieval benchmark totaling 3,782 triples; on it OmniRetriever-7B attains AVG-all 34.84, improving over Gemini Embedding 2 by +1.72 and over the best prior open-source AVT method by +8.03. Model weights, datasets, and code will be released.

Method

A unified AVT encoder \(f_\theta\) already computes a joint embedding \(\mathbf{z}_{TVA}\) on every \((T,V,A)\) forward, but pairwise InfoNCE leaves it unsupervised. We turn \(\mathbf{z}_{TVA}\) into a training signal with two complementary losses on top of the standard pairwise alignment \(\mathcal{L}_A\):

Fusion-as-Teacher Distillation (\(\mathcal{L}_D\)). Distill a stop-gradient copy of \(\mathbf{z}_{TVA}\) into each single-modal embedding via symmetric InfoNCE, so the \(T\), \(V\), and \(A\) streams each see the joint geometry that an \(A{+}T{+}V\) query reaches at inference. No extra parameters; one extra joint forward.
Tuple-InfoNCE (\(\mathcal{L}_T\)). Supervise \(\mathbf{z}_{TVA}\) directly against the in-batch tuple grid plus a modality-cycled hard negative \(\mathbf{z}_{\tilde{T}\tilde{V}\tilde{A}}\) that perturbs one of \(T,V,A\) on a period-3 schedule.

Both losses reuse the same backbone forward and apply to any unified retriever whose forward pass produces a joint multi-stream embedding. A cross-backbone replication on Omni-Embed-Nemotron-3B reproduces the dominant \(\mathcal{L}_D\) contribution.

Results

Clotho T→A

+18.0 R@1

vs. Gemini Embedding 2

SoundDescs T→A

+13.3 R@1

vs. Gemini Embedding 2

OmniRetriever-Bench

34.84 AVG-all

+1.72 vs. Gemini Embedding 2
+8.03 vs. best open AVT

OmniRetriever-7B reaches the zero-shot audio–text specialist band on Clotho and reduces the prior open omni-modality gap on \(A\!\to\!T\) directions; gains concentrate on the eight audio-anchored routes of OmniRetriever-Bench (e.g., \(A\!\to\!T\) 11.92 vs. 1.48, \(V\!\to\!A\) 25.46 vs. 13.80 against Gemini Embedding 2). On the four visually-saturated \(T\!\leftrightarrow\!V\) and \(T\!\leftrightarrow\!A{+}V\) routes, Gemini still leads by 6–10 R@1, consistent with its larger closed video–text training corpus.

OmniRetriever-Bench

Standard audio–text and video–text benchmarks evaluate only single-modal directions and never pool both video and audio in the same gallery. OmniRetriever-Bench is, to our knowledge, the first benchmark to score retrieval across all 12 single- and dual-modal directions (\(T\!\leftrightarrow\!V\), \(T\!\leftrightarrow\!A\), \(V\!\leftrightarrow\!A\), \(T\!\leftrightarrow\!AV\), \(A\!\leftrightarrow\!TV\), \(V\!\leftrightarrow\!AT\)) on a shared 3,782-triple held-out gallery. All captions are reviewed and corrected by trained human annotators starting from a Gemini 3.0 Pro draft.

Six sample cards from OmniRetriever-Bench — everyday user-generated content covering cooking, scenery, pets, narration, ambient music, and casual dialogue.

BibTeX

@article{liu2026omniretriever,
  title         = {OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation},
  author        = {Liu, Yunze and Wu, Chi-Hao and Zhou, Enmin and Shen, Junxiao},
  year          = {2026},
  eprint        = {2605.26641},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  doi           = {10.48550/arXiv.2605.26641}
}