Multimodal learning has demonstrated remarkable performance improvements over unimodal architectures. However, multimodal learning methods often exhibit deteriorated performances if one or more modalities are missing. This may be attributed to the commonly used multi-branch design containing modality-specific components, making such approaches reliant on the availability of a complete set of modalities. In this work, we propose a robust multimodal learning framework, Chameleon, that adapts a common-space visual learning network to align all input modalities. To enable this, we present the unification of input modalities into one format by encoding any non-visual modality into visual representations thus making it robust to missing modalities. Extensive experiments are performed on multimodal classification task using four textual-visual (Hateful Memes, UPMC Food-101, MM-IMDb, and Ferramenta) and two audio-visual (avMNIST, VoxCeleb) datasets. Chameleon not only achieves superior performance when all modalities are present at train/test time but also demonstrates notable resilience in the case of missing modalities.
Chameleon: a multimodal learning framework robust to missing modalities / Liaqat, Muhammad Irzam; Nawaz, Shah; Zaheer Muhammad, Zaigham; Saeed Muhammad, Saad; Sajjad, Hassan; De Schepper, Tom; Nandakumar, Karthik; Khan Muhammad, Haris; Gallo, Ignazio; Schedl, Markus. - In: INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL. - ISSN 2192-6611. - 14:2(2025). [10.1007/s13735-025-00370-y]
Chameleon: a multimodal learning framework robust to missing modalities
Liaqat Muhammad Irzam;
2025
Abstract
Multimodal learning has demonstrated remarkable performance improvements over unimodal architectures. However, multimodal learning methods often exhibit deteriorated performances if one or more modalities are missing. This may be attributed to the commonly used multi-branch design containing modality-specific components, making such approaches reliant on the availability of a complete set of modalities. In this work, we propose a robust multimodal learning framework, Chameleon, that adapts a common-space visual learning network to align all input modalities. To enable this, we present the unification of input modalities into one format by encoding any non-visual modality into visual representations thus making it robust to missing modalities. Extensive experiments are performed on multimodal classification task using four textual-visual (Hateful Memes, UPMC Food-101, MM-IMDb, and Ferramenta) and two audio-visual (avMNIST, VoxCeleb) datasets. Chameleon not only achieves superior performance when all modalities are present at train/test time but also demonstrates notable resilience in the case of missing modalities.| File | Dimensione | Formato | |
|---|---|---|---|
|
s13735-025-00370-y.pdf
accesso aperto
Descrizione: Chameleon: A Multimodal Learning Framework Robust to Missing Modalities
Tipologia:
Versione Editoriale (PDF)
Licenza:
Creative commons
Dimensione
4.06 MB
Formato
Adobe PDF
|
4.06 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

