PAEFF: Precise Alignment and Enhanced Gated Feature Fusion for Face-Voice Association

IRIS

We study the task of learning association between faces and voices, which is gaining interest in the multimodal community lately. These methods suffer from the deliberate crafting of negative mining procedures as well as the reliance on the distant margin parameter. These issues are addressed by learning a joint embedding space in which orthogonality constraints are applied to the fused embeddings of faces and voices. However, embedding spaces of faces and voices possess different characteristics and require spaces to be aligned before fusing them. To this end, we propose a method that accurately aligns the embedding spaces and fuses them with an enhanced gated fusion thereby improving the performance of face-voice association. Extensive experiments on the VoxCeleb dataset reveals the merits of the proposed approach.

PAEFF: Precise Alignment and Enhanced Gated Feature Fusion for Face-Voice Association / Hannan, A., Manzoor Muhammad, A., Nawaz, S., Liaqat, M.I., Schedl, M., Noman, M.. - (2025), pp. 2710-2714. [10.21437/interspeech.2025-268]

PAEFF: Precise Alignment and Enhanced Gated Feature Fusion for Face-Voice Association

Hannan Abdul;Manzoor Muhammad Arslan;Nawaz Shah;Liaqat Muhammad Irzam;Schedl Markus;Noman Mubashir

2025

Abstract

We study the task of learning association between faces and voices, which is gaining interest in the multimodal community lately. These methods suffer from the deliberate crafting of negative mining procedures as well as the reliance on the distant margin parameter. These issues are addressed by learning a joint embedding space in which orthogonality constraints are applied to the fused embeddings of faces and voices. However, embedding spaces of faces and voices possess different characteristics and require spaces to be aligned before fusing them. To this end, we propose a method that accurately aligns the embedding spaces and fuses them with an enhanced gated fusion thereby improving the performance of face-voice association. Extensive experiments on the VoxCeleb dataset reveals the merits of the proposed approach.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Codice OpenAlex
	
				W4415433611
			
	Parole chiave
	
				Cross-modal verification &amp; matching
Face-voice association
Hyperbolic space
Multimodal learning
			
	Appare nelle tipologie:
	
				2.1 Contributo in volume (Capitolo o Saggio)

File in questo prodotto:

File	Dimensione	Formato
hannan25_interspeech.pdf accesso aperto Descrizione: PAEFF: Precise Alignment and Enhanced Gated Feature Fusion for Face-Voice Association Tipologia: Versione Editoriale (PDF) Licenza: Non specificato Dimensione 1.02 MB Formato Adobe PDF Visualizza/Apri	1.02 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11771/40201

Citazioni

ND

3

4

social impact