Multimodal learning leverage multiple and diverse modalities such as images, text, or audio to enable contextual understanding and reliable decision-making. These methods have achieved state-of-the-art results by integrating multiple modalities in areas such as medical imaging, autonomous driving, and visual surveillance. However, the effectiveness of multimodal learning in real-world scenarios remains limited by practical challenges: data from some modalities may be missing, corrupted, or poorly aligned due to sensor failures, environmental noise, or bandwidth constraints. While prior surveys have proposed taxonomies on multimodal learning and strategies for handling missing or corrupted data, these perspectives are often treated in isolation. This separation overlooks the fact that data imperfections are interconnected, and effective multimodal learning requires a unified understanding across architectural design and modality reliability. To address this gap, we present comprehensive taxonomies that cover and integrate three major aspects: (1) architectural design for multimodal learning, (2) learning under missing modalities, and (3) learning under corrupted modalities. By framing these aspects together, our study highlights their interdependencies and facilitates a comprehensive understanding of multimodal learning. Furthermore, we discuss benchmark datasets and real-world applications through this taxonomic lens and outline open challenges and future directions for developing resilient methods.
Multimodal learning under imperfect data conditions: a survey / Liaqat, Muhammad Irzam; Abbas, Qaiser; Nawaz, Shah; Zaheer, Zaigham; Moscati, Marta; Hou, Yufang; Khan Muhammad, Haris; Khan, Salman; Andre, Elisabeth; Schedl, Markus. - (2025). [10.36227/techrxiv.176410566.65375877/v1]
Multimodal learning under imperfect data conditions: a survey
Liaqat Muhammad Irzam;
2025
Abstract
Multimodal learning leverage multiple and diverse modalities such as images, text, or audio to enable contextual understanding and reliable decision-making. These methods have achieved state-of-the-art results by integrating multiple modalities in areas such as medical imaging, autonomous driving, and visual surveillance. However, the effectiveness of multimodal learning in real-world scenarios remains limited by practical challenges: data from some modalities may be missing, corrupted, or poorly aligned due to sensor failures, environmental noise, or bandwidth constraints. While prior surveys have proposed taxonomies on multimodal learning and strategies for handling missing or corrupted data, these perspectives are often treated in isolation. This separation overlooks the fact that data imperfections are interconnected, and effective multimodal learning requires a unified understanding across architectural design and modality reliability. To address this gap, we present comprehensive taxonomies that cover and integrate three major aspects: (1) architectural design for multimodal learning, (2) learning under missing modalities, and (3) learning under corrupted modalities. By framing these aspects together, our study highlights their interdependencies and facilitates a comprehensive understanding of multimodal learning. Furthermore, we discuss benchmark datasets and real-world applications through this taxonomic lens and outline open challenges and future directions for developing resilient methods.| File | Dimensione | Formato | |
|---|---|---|---|
|
v1.pdf
accesso aperto
Descrizione: Multimodal Learning Under Imperfect Data Conditions: A Survey
Tipologia:
Documento in Pre-print
Licenza:
Creative commons
Dimensione
684.72 kB
Formato
Adobe PDF
|
684.72 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

