Advances in Self-Supervised Learning: What is next? [Spyros GIDARIS and Andrei BURSUC]

.center.footer[Spyros GIDARIS and Andrei BURSUC | Advances in Self-Supervised Learning: What is next?]

---

## CVPR 2020 Tutorial
## .bold[Leave Those Nets Alone: Advances in Self-Supervised Learning]
# What is next?

]

---

.center.width-60[![](images/next/star-trek-1.jpg)]

The first generation of self-supervised approaches ($\texttt{RotNet}$, $\texttt{RelPatch}$, $\texttt{JigSaw}$, $\texttt{Colorization}$, $\texttt{Exemplar}$, $\texttt{DeepCluster}$,) explored a new paradigm and achieved interesting results.

---

.center.width-60[![](images/next/imagenet-eval-pirl.jpg)]
.caption[ImageNet classification with linear models]

However performance is still  far from the performance of their supervised counterparts.

.citation[I. Misra and L. van der Maaten, Self-Supervised Learning of Pretext-Invariant Representations, CVPR 2020]

---

.bigger[.citet[Asano et al. (2020)] show that **as little as a single image is sufficient**, when combined with self-supervision and data augmentation, to learn the first few layers of standard deep networks as well as using millions of images and full supervision.]

.citation[Y.M. Asano, A critical analysis of self-supervision, or what we can learn from a single image, ICLR 2020]

---
class: middle

.grid[
.kol-6-12[
.center[1 - Take a high-resolution image]
 
.center.width-80[![](images/next/asano-1.png)]
]

.hidden[
.kol-6-12[
.center[2 - Generate 1M images of crops and augmentations]
 
.center.width-80[![](images/next/asano-3.png)]
]
]
]

.citation[Y.M. Asano, A critical analysis of self-supervision, or what we can learn from a single image, ICLR 2020]

---
count: false
class: middle

.grid[
.kol-6-12[
.center[1 - Take a high-resolution image]
 
.center.width-80[![](images/next/asano-1.png)]
]

.kol-6-12[
.center[2 - Generate 1M images of crops and augmentations]
 
.center.width-80[![](images/next/asano-3.png)]
]
]

.citation[Y.M. Asano, A critical analysis of self-supervision, or what we can learn from a single image, ICLR 2020]

---
class: middle

.caption[Accuracies of linear classifiers trained on
the representations from intermediate layers of supervised and self-supervised network ]
]

.kol-6-12[
 
- With sufficient data augmentation, one image allows self-supervision to learn good and generalizable features
- At deeper layers there is a gap with supervised methods, that is mitigated to some extent by large datasets for self-supervised methods
]
]

.citation[Y.M. Asano, A critical analysis of self-supervision, or what we can learn from a single image, ICLR 2020]

---

.center.width-80[![](images/next/star-trek-2.jpg)]

The recent line of approaches ($\texttt{contrastive}$, $\texttt{feature reconstruction}$, $\texttt{clustering}$, $\texttt{multi-modal supervision}$) achieved remarkable results outperforming supervised variants on several benchmarks.

---

.kol-6-12[
.center.width-90[![](images/next/imagenet-eval-byol.png)]
.caption[ImageNet Top-1 accuracy of linear classifiers trained on representations learned with different self-supervised methods (pretrained on ImageNet). ]
]

.kol-6-12[
 
- The performance on this benchmark has accelerated strongly in the past year closing the gap w.r.t. supervised methods
- The main contributors: 
 - *contrastive learning with more negatives* 
 - *feature reconstruction*
 - *momentum update*
 - *output projection head* 
 - *better designed and stronger data augmentation*
 - *longer training*
]

]
.citation[J.B. Grill et al., Bootstrap your own latent: A new approach to self-supervised Learning, NeurIPS 2020]

---

.center.width-80[![](images/next/star-trek-3.jpg)]

What should we expect from the next generation to solve?

---

Although they achieve outstanding performance, contrastive methods require long training and complex setups, e.g., TPUs, large mini-batches.

.hidden[Recent works have showed that reconstruction methods can achieve competitive performance by reconstructing features instead of inputs.]

.hidden.center.width-40[![](images/next/obow-memory.png)]
.hidden.caption[Time and memory consumption relative to supervised
training.]

.hidden.citation[S. Gidaris et al., Learning Representations by Predicting Bags of Visual Words, CVPR 2020 J. B. Grill et al., Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning, NeurIPS 2020 M. Caron et al., Unsupervised Learning of Visual Features by Contrasting Cluster Assignments, NeurIPS 2020 S. Gidaris et al., Online Bag-of-Visual-Words Generation for Unsupervised Representation Learning, CVPR 2021]
---

Although they achieve outstanding performance, contrastive methods require long training and complex setups, e.g., TPUs, large mini-batches.

Recent works have showed that reconstruction methods can achieve competitive performance by reconstructing features instead of inputs.

.center.width-40[![](images/next/obow-memory.png)]
.caption[Time and memory consumption relative to supervised
training.]

.citation[S. Gidaris et al., Learning Representations by Predicting Bags of Visual Words, CVPR 2020 J. B. Grill et al., Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning, NeurIPS 2020 M. Caron et al., Unsupervised Learning of Visual Features by Contrasting Cluster Assignments, NeurIPS 2020 S. Gidaris et al., Online Bag-of-Visual-Words Generation for Unsupervised Representation Learning, CVPR 2021 ]

---

.center.width-40[![](images/next/obow-memory.png)]
.caption[Time and memory consumption relative to supervised
training.]

.center.Q[How to improve data and compute efficiency?]

---
class: middle

]

.kol-5-12[
 
- With few exceptions most self-supervised methods deal with ImageNet-like data with one dominant object per image.

- In the case of autonomous driving data with HD images and large complex scenes, these strategies might be cumbersome to apply.
]
]

.hidden.center.Q[How to go beyond single object images?]

---
count: false
class: middle

]

.kol-5-12[
 
- With few exceptions most self-supervised methods deal with ImageNet-like data with one dominant object per image.

- In the case of autonomous driving data with HD images and large complex scenes, these strategies might be cumbersome to apply.
]
]

.center.Q[How to go beyond single object images?]

---

## .center[Targetting object detection in the self-supervised task]

.grid[
.kol-6-12[
 
 
.center.width-70[![](images/next/pixpro-1.png)]
.caption[ PixPro is based on a pixel-to-propagation consistency pretext task for pixellevel visual representation learning.]
]
.kol-6-12[
.hidden.center.width-70[![](images/next/detcon-1.png)]
.hidden.caption[DetCon: The contrastive detection objective pulls together pooled feature vectors from the same mask (across views) and pushes apart features from different masks and different images.]
]

]

.citation[Z. Xie et al.,Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning, CVPR 2021 .hidden[O. Henaff et al., Efficient Visual Pretraining with Contrastive Detection, arXiv 2021]]

---
count: false
class: middle

## .center[Targetting object detection in the self-supervised task]

.grid[
.kol-6-12[
 
 
.center.width-70[![](images/next/pixpro-1.png)]
.caption[ PixPro is based on a pixel-to-propagation consistency pretext task for pixellevel visual representation learning.]
]
.kol-6-12[
.center.width-70[![](images/next/detcon-1.png)]
.caption[ DetCon: The contrastive detection objective then pulls together pooled feature vectors from the same mask (across views) and pushes apart features from different masks and different images.]
]

]

.citation[Z. Xie et al.,Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning, CVPR 2021 O. Henaff et al., Efficient Visual Pretraining with Contrastive Detection, arXiv 2021]
---
class: middle

.bigger[Most self-supervised methods are pre-trained on ImageNet, which is a *curated* and *perfectly balanced* dataset.]

.hidden.bigger[In self-supervised learning **the dataset itself is a form of supervision**.]

.hidden[
 
.center.Q[Shifting towards uncurated and "boring" data and better understand its impact.]
]
---

.bigger[Most self-supervised methods are pre-trained on ImageNet, which is a *curated* and *perfectly balanced* dataset.]

.hidden[
 
.center.Q[Shifting towards uncurated and "boring" data and better understand its impact.]
]

---

.bigger[Most self-supervised methods are pre-trained on ImageNet, which is a *curated* and *perfectly balanced* dataset.]

.center.Q[Shifting towards uncurated and "boring" data and better understand its impact.]

---

## .center[Self-supervision pre-training on Instagram images]

.center.width-40[![](images/next/imagenet-eval-seer.png)]
.caption[SEER is pre-trained on uncurated random images from Instagram]

---

.grid[
.kol-6-12[
.center.width-90[![](images/next/imagenet-eval-byol.png)]
]
.kol-6-12[
Most approaches compete on a few popular benchmarks (`ImageNet`, `Places205`, `VOC07+12`, `COCO14`):
- `ImageNet` is also used for pre-training
- finetuning on some downstream tasks can still be overly long, e.g., object detection 
- the amount of datasets to evaluate on is still limited $\rightarrow$ risk of optimizing for specific datasets
]
]

.hidden.center.Q[Finding better and more compelling evaluation strategies.]

.citation[J.B. Grill et al., Bootstrap your own latent: A new approach to self-supervised Learning, NeurIPS 2020]

---
count: false
class: middle

.center.Q[Finding better and more compelling evaluation strategies.]

.citation[J.B. Grill et al., Bootstrap your own latent: A new approach to self-supervised Learning, NeurIPS 2020]

---

## .center[Evaluating self-supervised methods with few-shot protocols]

Assess the quality of self-supervised representations through few-shot object recognition:
.smaller[
- large pool of train-test sets ($2k$)
- low-cost evaluation
- can be easily extended to other datasets
]

---

.bigger[A few other works have recently proposed a variety of datasets and tasks to evaluate on (see below).]

.citation[ X. Zhai et al., A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark, arXiv 2019 A. Newell and J. Deng, How Useful is Self-Supervised Pretraining for Visual Tasks?, CVPR 2020 B. Wallace and B. Hariharan, Extending and Analyzing Self-Supervised Learning Across Domains, ECCV 2020 L. Ericsson et al., How Well Do Self-Supervised Models Transfer?, CVPR 2021 G.V. Horn et al.,Benchmarking Representation Learning for Natural World Image Collections, CVPR 2021]
---

.center.width-70[![](images/next/avc.png)]

.center[Traditionally, multi-modal self-supervised methods rely on mode supervision and process each modality individually]

---

.center.width-60[![](images/next/drive4u.jpg)]

.center[However, most robots rely on a range of complementary sensors to understand their enviroment]

.hidden.center.Q[How to leverage better information from different sensors and their interplay, in particular for robotics?]

---

.center.width-60[![](images/next/drive4u.jpg)]

.center[However, most robots rely on a range of complementary sensors to understand their enviroment]

.center.Q[How to leverage better information from different sensors and their interplay, in particular for robotics?]

---

## .center[Self-supervision for other sensors and modalities]

.grid[
.kol-7-12[
 
 
.center.width-100[![](images/next/pointcontrast.png)]
.caption[PointContrast: recognize a point across views]
]
.kol-5-12[
.hidden.center.width-90[![](images/next/flowe.png)]
.hidden.caption[FlowE: Encourage features to obey the same transformation as the input image pairs.]
]

]

.citation[S. Xie et al.,PointContrast: Unsupervised Pre-training for 3D Point Cloud Understanding, ECCV 2020 .hidden[Y. Xiong et al., Self-Supervised Representation Learning from Flow Equivariance, arXiv 2021]]

---

## .center[Self-supervision for other sensors and modalities]

.grid[
.kol-7-12[
 
 
.center.width-100[![](images/next/pointcontrast.png)]
.caption[PointContrast: recognize a point across views]
]
.kol-5-12[
.center.width-90[![](images/next/flowe.png)]
.caption[FlowE: Encourage features to obey the same transformation as the input image pairs.]
]

]

.citation[S. Xie et al.,PointContrast: Unsupervised Pre-training for 3D Point Cloud Understanding, ECCV 2020 Y. Xiong et al., Self-Supervised Representation Learning from Flow Equivariance, arXiv 2021]

---

.bigger[In spite of the impressive progress in the past few years, the most common usage of self-supervised methods is still just for pre-training.]

.hidden.center.Q[Finding new applications of self-supervised learning.]

---
count: false
class: middle, center

.bigger[In spite of the impressive progress in the past few years, the most common usage of self-supervised methods is still just for pre-training.]

.center.Q[Finding new applications of self-supervised learning.]

---

## .center[Cross-domain detection with test-time training]

.center.width-55[![](images/next/oshot-1.png)]
.caption[Using self-supervision and self-training for unsupervised domain adaptation over a single (test) image]

.hidden[
.center.width-50[![](images/next/oshot-2.png)]
.caption[The Social Bikes concept-dataset acquired from different social networks]
]

.citation[A. d'Innocente et al., One-Shot Unsupervised Cross-Domain Detection, ECCV 2020 Y. Sun et al., Test-Time Training with Self-Supervision for Generalization under Distribution Shifts, ICML 2020]
---

## .center[Cross-domain detection with test-time training]

.center.width-55[![](images/next/oshot-1.png)]
.caption[Using self-supervision and self-training for unsupervised domain adaptation over a single (test) image]

.center.width-50[![](images/next/oshot-2.png)]
.caption[Using self-supervision and self-training for unsupervised domain adaptation over a single (test) image]

---

# What is next?

## Improving data and compute efficiency

## Going beyond single object images

## Going beyond curated datasets

## Towards better evaluation practices

## Multi-modal reasoning with self-supervised learning

## New applications of self-supervised learning

---
layout: false
class: end-slide, center
count: false

The end.