Chameleon: a multimodal learning framework robust to missing modalities

IRIS

Multimodal learning has demonstrated remarkable performance improvements over unimodal architectures. However, multimodal learning methods often exhibit deteriorated performances if one or more modalities are missing. This may be attributed to the commonly used multi-branch design containing modality-specific components, making such approaches reliant on the availability of a complete set of modalities. In this work, we propose a robust multimodal learning framework, Chameleon, that adapts a common-space visual learning network to align all input modalities. To enable this, we present the unification of input modalities into one format by encoding any non-visual modality into visual representations thus making it robust to missing modalities. Extensive experiments are performed on multimodal classification task using four textual-visual (Hateful Memes, UPMC Food-101, MM-IMDb, and Ferramenta) and two audio-visual (avMNIST, VoxCeleb) datasets. Chameleon not only achieves superior performance when all modalities are present at train/test time but also demonstrates notable resilience in the case of missing modalities.

Chameleon: a multimodal learning framework robust to missing modalities / Liaqat, Muhammad Irzam; Nawaz, Shah; Zaheer Muhammad, Zaigham; Saeed Muhammad, Saad; Sajjad, Hassan; De Schepper, Tom; Nandakumar, Karthik; Khan Muhammad, Haris; Gallo, Ignazio; Schedl, Markus. - In: INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL. - ISSN 2192-6611. - 14:2(2025). [10.1007/s13735-025-00370-y]

Chameleon: a multimodal learning framework robust to missing modalities

Liaqat Muhammad Irzam;Nawaz Shah;Zaheer Muhammad Zaigham;Saeed Muhammad Saad;Sajjad Hassan;De Schepper Tom;Nandakumar Karthik;Khan Muhammad Haris;Gallo Ignazio;Schedl Markus

2025

Abstract

Multimodal learning has demonstrated remarkable performance improvements over unimodal architectures. However, multimodal learning methods often exhibit deteriorated performances if one or more modalities are missing. This may be attributed to the commonly used multi-branch design containing modality-specific components, making such approaches reliant on the availability of a complete set of modalities. In this work, we propose a robust multimodal learning framework, Chameleon, that adapts a common-space visual learning network to align all input modalities. To enable this, we present the unification of input modalities into one format by encoding any non-visual modality into visual representations thus making it robust to missing modalities. Extensive experiments are performed on multimodal classification task using four textual-visual (Hateful Memes, UPMC Food-101, MM-IMDb, and Ferramenta) and two audio-visual (avMNIST, VoxCeleb) datasets. Chameleon not only achieves superior performance when all modalities are present at train/test time but also demonstrates notable resilience in the case of missing modalities.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Rivista
	
				INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL
			
	Parole chiave
	
				Missing modalities
Multimodal learning
Vision and other modalities
			
	Appare nelle tipologie:
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
s13735-025-00370-y.pdf accesso aperto Descrizione: Chameleon: A Multimodal Learning Framework Robust to Missing Modalities Tipologia: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 4.06 MB Formato Adobe PDF Visualizza/Apri	4.06 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11771/40198

Citazioni

ND

8

ND

social impact