Efficiently learning unsupervised pixel-wise visual representations is crucial for training agents that can perceive their environment without relying on heavy human supervision or abundant annotated data. Motivated by recent work that promotes motion as a key source of information in representation learning, we propose a novel instance of contrastive criterions over time and space. In our architecture, pixel-wise motion field and representations are extracted by neural models, trained from scratch in an integrated fashion. Learning proceeds online over time, exploiting also a momentum-based moving average to update the feature extractor, without replaying any large buffers of past data. Experiments on real-world videos and on a recently introduced benchmark, with photorealistic streams generated from a 3D environment, confirm that the proposed model can learn to estimate motion and jointly develop representations. Our model nicely encodes the variable appearance of the visual information in space and time, significantly overcoming a recent approach and it also compares favourably with convolutional and Transformer-based networks, offline-pre-trained on large collections of supervised and unsupervised images.

Bridging continual learning of motion and self-supervised representations

Betti Alessandro;
2024

Abstract

Efficiently learning unsupervised pixel-wise visual representations is crucial for training agents that can perceive their environment without relying on heavy human supervision or abundant annotated data. Motivated by recent work that promotes motion as a key source of information in representation learning, we propose a novel instance of contrastive criterions over time and space. In our architecture, pixel-wise motion field and representations are extracted by neural models, trained from scratch in an integrated fashion. Learning proceeds online over time, exploiting also a momentum-based moving average to update the feature extractor, without replaying any large buffers of past data. Experiments on real-world videos and on a recently introduced benchmark, with photorealistic streams generated from a 3D environment, confirm that the proposed model can learn to estimate motion and jointly develop representations. Our model nicely encodes the variable appearance of the visual information in space and time, significantly overcoming a recent approach and it also compares favourably with convolutional and Transformer-based networks, offline-pre-trained on large collections of supervised and unsupervised images.
2024
9781643685489
Adversarial machine learning, Contrastive Learning, Federated learning, Pixels, Semi-supervised learning, Unsupervised learning
File in questo prodotto:
File Dimensione Formato  
FAIA-392-FAIA240812.pdf

accesso aperto

Descrizione: Bridging Continual Learning of Motion and Self-Supervised Representations
Tipologia: Versione Editoriale (PDF)
Licenza: Creative commons
Dimensione 941.51 kB
Formato Adobe PDF
941.51 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11771/34883
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
social impact