Advances in Self-Supervised Learning: Introduction [Spyros GIDARIS and Andrei BURSUC]

.center.footer[Spyros GIDARIS and Andrei BURSUC | Advances in Self-Supervised Learning: Introduction]

---

## .bold[CVPR 2020 Tutorial]
# Leave Those Nets Alone: Advances in Self-Supervised Learning

]

---

## CVPR 2020 Tutorial
## .bold[Leave Those Nets Alone: Advances in Self-Supervised Learning]
# Introduction

]

.center.width-20[![](images/logos/valeoai.png)]

---
class: middle, center

# Motivation

---

.center.bigger[Deep Learning + Supervised Learning is a really cool and strong combo.]

.center.width-90[![](images/deep_success.png)]

---

## .center[Deep Learning: how it works?]

.center.width-90[![](images/deep_supervised_train_pipeline.png)]

- Predefine the set of visual concepts to be learned
- Collect diverse and large number of examples for each of them
- Train a deep model for several GPU hours or days

---

---

## Difficult to acquire and curate large human-annotated datasets

.center.width-60[![](images/imagenet.jpeg)]

.grid[
.kol-5-12[
- Requires intense human labor
  - annotating + cleaning raw data
- Time consuming and expensive
- Error prone (human mistakes)
]

---

## Difficult to keep the pace with an ever changing world

.center.width-80[![](images/men-fashion-80s.jpg)]
.caption[Men's fashion trends 1980-1989]

- Data distributions shift all the time, e.g., fashion trends, new Instagram filters
- Infeasible to launch large annotation campaigns each time

---

## Difficult to keep the pace with an ever changing world

.center.width-80[![](images/super-mario.png)]
.caption[Super Mario from 1981 to 2017]

- Sensors specs are frequently upgraded
- Infeasible to launch large annotation campaigns each time

---

---

## Exploiting raw unlabeled data

.center.width-90[![](images/raw_data.png)]
 
- Acquiring raw unlabeled data is usually easy
- However, typical supervised methods cannot exploit them

---

.bigger[Deep Learning requires *large amounts* of *carefully labeled data* which is **difficult to acquire** and **expensive to annotate**.]

---

.bigger[Even with large amounts of data, supervised learning still has several blind-spots in terms of **learning useful and rich representations**. ]

---

## .center[The supervision signal can bias the network in unexpected ways]

.center.width-90[![](images/gatys-1.png)]
.caption[VGG-16 preditions on original and artificially texturised images.]

---

## .center[The supervision signal can bias the network in unexpected ways]

.center.width-85[![](images/texture-bias.png)]
.caption[Classification predictions of a ResNet-50 trained on ImageNet]

.citation[R. Geirhos et al., ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness, ICLR 2019]

---

## .center[Self-supervision to the rescue?]

.center.width-60[![](images/steering-pixels.png)]
.caption[Bottom images are transformed such that local statistics are preserved while global statistics are altered.]
 
.hidden[
Train a linear binary classifier original vs. transformed images over $\texttt{conv5}$ features from:
  + model pre-trained on ImageNet labels $\rightarrow$ accuracy of $78\%$
  + model pre-trained with self-supervision  $\rightarrow$ accuracy of $85\%$
]

.citation[S. Jenni, Steering Self-Supervised Feature Learning Beyond Local Pixel Statistics, CVPR 2020]

---
count: false
class: middle

## .center[Self-supervision to the rescue?]

.center.width-60[![](images/steering-pixels.png)]
.caption[Bottom images are transformed such that local statistics are preserved while global statistics are altered.]
 
Train a linear binary classifier original vs. transformed images over $\texttt{conv5}$ features from:
  + model pre-trained on ImageNet labels $\rightarrow$ accuracy of $78\%$
  + model pre-trained with self-supervision  $\rightarrow$ accuracy of $85\%$

.citation[S. Jenni, Steering Self-Supervised Feature Learning Beyond Local Pixel Statistics, CVPR 2020]

---

.bigger[Improving representation learning requires features that are *not specialized for solving a particular supervised task*, but rather *encapsulate richer statistics for various downstream tasks*.]

---

## .center[Inspiring success from self-supervision in NLP, e.g., **word2vec** ]

.center.width-85[![](images/bert-1.png)]
.caption[Missing word prediction task.]
.center.width-85[![](images/bert-2.png)]
.caption[Next sentence prediction task.]

.citation[T. Mikolov et al., Efficient estimation of word representations in vector space, ArXiv 2013 T. Mikolov et al., Distributed representations of words and phrases and their compositionality, NeurIPS 2013 J. Devlin, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, ArXiv 2018]

---

# What is self-supervision?

- .bigger[A form of unsupervised learning where the **data (not the human) provides the supervision signal**]

.hidden[
- .bigger[Usually, *define a pretext task* for which the network is forced to learn what we really care about]
]

.hidden[
- .bigger[For most pretext tasks, *a part of the data is withheld* and the network has to predict it]
]

.hidden[
- .bigger[The features/representations learned on the pretext task are subsequently used for a different *downstream task*, usually where some annotations are available.]
]

---
count: false
class: middle

# What is self-supervision?

- .bigger[A form of unsupervised learning where the **data (not the human) provides the supervision signal**]

- .bigger[Usually, *define a pretext task* for which the network is forced to learn what we really care about]

.hidden[
- .bigger[For most pretext tasks, *a part of the data is withheld* and the network has to predict it]
]

.hidden[
- .bigger[The features/representations learned on the pretext task are subsequently used for a different *downstream task*, usually where some annotations are available.]
]

---
count: false
class: middle

# What is self-supervision?

- .bigger[A form of unsupervised learning where the **data (not the human) provides the supervision signal**]

- .bigger[Usually, *define a pretext task* for which the network is forced to learn what we really care about]

- .bigger[For most pretext tasks, *a part of the data is withheld* and the network has to predict it]

.hidden[
- .bigger[The features/representations learned on the pretext task are subsequently used for a different *downstream task*, usually where some annotations are available.]
]

---

# What is self-supervision?

- .bigger[A form of unsupervised learning where the **data (not the human) provides the supervision signal**]

- .bigger[Usually, *define a pretext task* for which the network is forced to learn what we really care about]

- .bigger[For most pretext tasks, *a part of the data is withheld* and the network has to predict it]

- .bigger[The features/representations learned on the pretext task are subsequently used for a different *downstream task*, usually where some annotations are available.]

---

## .center[Example: Rotation prediction]

.center.width-70[![](images/rotnet_3.png)]

.citation[S. Gidaris et al., Unsupervised Representation Learning by Predicting Image Rotations, ICLR 2018]

---

# Self-supervised learning pipeline

.center.bold.bigger[*Stage 1:* Train network on pretext task (without human labels)]
.center.width-90[![](images/self-sup-pipeline-step1.png)]

---
count:false

# Self-supervised learning pipeline

.center.bold.bigger[*Stage 1:* Train network on pretext task (without human labels) ]
.center.width-90[![](images/self-sup-pipeline-step1.png)]
.center.bold.bigger[*Stage 2:* Train classifier on learned features for new task with fewer labels]
.center.width-90[![](images/self-sup-pipeline-step2-1.png)]
---
count:false

# Self-supervised learning pipeline

.center.bold.bigger[*Stage 1:* Train network on pretext task (without human labels)]
.center.width-90[![](images/self-sup-pipeline-step1.png)]
.center.bold.bigger[*Stage 2:* Fine-tune network for new task with fewer labels]
.center.width-90[![](images/self-sup-pipeline-step2-3.png)]

---

## .center[Karate Kid and Self-Supervised Learning]
.center.width-85[![](images/karate-kid-poster-wide.jpg)]
.caption[The Karate Kid (1984)]

---
class: middle, black-slide

## .center[Stage 1: Train .italic[muscle memory] on pretext tasks]

]

.hidden[
.grid[
.kol-6-12[
$$\begin{aligned}
\text{Mr. Miyagi} &= \text{Deep Learning Practitioner} \\\\
\text{Daniel LaRusso} &= \text{ConvNet}\end{aligned}$$
]

.kol-6-12[
$$\begin{aligned}\text{daily chores} &= \text{pretext tasks} \\\\
\text{learning karate} &= \text{downstream task}\end{aligned}$$
]
]
]
---
class: middle, black-slide

## .center[Stage 1: Train .italic[muscle memory] on pretext tasks]

]

.kol-6-12[
$$\begin{aligned}
\text{Mr. Miyagi} &= \text{Deep Learning Practitioner} \\\\
\text{Daniel LaRusso} &= \text{ConvNet}\end{aligned}$$
]

.kol-6-12[
$$\begin{aligned}\text{daily chores} &= \text{pretext tasks} \\\\
\text{learning karate} &= \text{downstream task}\end{aligned}$$

]

---

## .center[Stage 2: Fine-tune skills rapidly]

.center.width-60[![](images/karate-kid-train.gif)]

---
class: middle, center

.Q[.big[Is this actually useful in practice?]]

---

## .center[Transfer learning - object detection]

.kol-7-12[
.center.width-100[![](images/progress_selfsup_detection.png)]
.caption[Object detection with Faster R-CNN fine-tuned on VOC  $\texttt{trainval07+12}$ and evaluated on $\texttt{test07}$. Networks are pre-trained with self-supervision on ImageNet.]
]

.kol-5-12[
 
- Rapid progress in self-supervised learning
- Self-supervised methods are starting to outperform supervised methods
- This is a __key milestone for self-supervised methods__ as they are finally showing their effectiveness to complex downstream tasks.
]
]

---

Loosely speaking, multiple old and new approaches could fit, at least partially, the definition of self-supervised learning: 
- input or feature reconstruction: .cites[[Hinton and Salakhutdinov (2006); Vincent et al. (2008); Gidaris et al. (2020), Grill et al. (2020)]]
- generating data: .cites[[Goodfellow et al., (2014)]]
- training with paired signals: .cites[[V. De Sa (1994); Arandjelovic and Zisserman (2017)]]
- hiding data from the networks: .cites[[Doersch et al. (2015); Zhang et al. (2017)]]
- instance discrimination: .cites[[Dosovitskiy et al. (2014); van der Ooord et al. (2018)]] ...

.citation[G. Hinton and R. Salakhutdinov, Reducing the Dimensionality of Data with Neural Networks, Science 2006 P.Vincent et al.,Extracting and Composing Robust Features with Denoising Autoencoders, ICML 2008 S. Gidaris et al., Learning Representations by Predicting Bags of Visual Words, CVPR 2020 J.B. Grill et al., Bootstrap your own latent: A new approach to self-supervised Learning, NeurIPS 2020 I. Goofellow et al., Generative Adversarial Networks, NeurIPS 2014 V. De Sa, Learning classification from unlabelled data, NeurIPS 1994 R. Arandjelovic and A. Zisserman, Look, Listen and Learn, ICCV 2017 R. Zhang et al., Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction, CVPR 2017 C. Doersch et al., Unsupervised Visual Representation Learning by Context Prediction, ICCV 2015 A. Dosovitskiy et al., Discriminative Unsupervised Feature Learning with Convolutional Neural Networks, NeurIPS 2015 A. van der Oord et al., Representation Learning with Contrastive Predictive Coding, ArXiv 2018 ]

---

# .center[Scope]

.center.bigger[In this tutorial, we **focus** on self-supervised methods that lead to *useful representations*, obtained through the invention of *a pretext task* and/or *by hiding a part or view of the original data* to the network.]

---

# Evaluating Self-Supervised methods

---

.bigger[Self-supervised methods are evaluated on a range of datasests and tasks .cites[[Goyal et al. (2019); Zhai et al. (2019)]]]

.bigger[In most benchmarks the model is *pre-trained on ImageNet* on a pretext task and *subsequentely fine-tuned on other datasets* or protocols.]

.citation[P. Goyal et al., Scaling and Benchmarking Self-Supervised Visual Representation Learning, ICCV 2019 X. Zhai et al., A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark, ArXiv 2019]

---

# Evaluation tasks
 
## Linear classification / probe

## Efficient learning

## Transfer learning

---
# Linear classification / probe

.center.width-70[![](images/self-sup-pipeline-step2-1-crop.png)]

- Simplest evaluation of the utility of learned representations: fit a linear classifier (FC layer, linear SVM)
- .bold[Typical datasets:] ImageNet, Places205, Pascal VOC07 (image classification), COCO14 (image classification), iNat

.citation[P. Goyal et al., Scaling and Benchmarking Self-Supervised Visual Representation Learning, ICCV 2019]

---

.center.width-60[![](images/imagenet-classif-1.png)]

.caption[ImageNet Top-1 accuracy of linear classifiers trained on representations learned with different self-supervised methods (pretrained on ImageNet).]

.citation[T. Chen et al., A Simple Framework for Contrastive Learning of Visual Representations, ICML 2020]
---

.center.width-50[![](images/byol-imagenet.png)]

.caption[ImageNet Top-1 accuracy of linear classifiers trained on representations learned with different self-supervised methods (pretrained on ImageNet).]

.citation[J. B. Grill et al., Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning, NeurIPS 2020]

---

# Annotation efficient classification

.center.width-70[![](images/self-sup-pipeline-step2-2-crop.png)]

- Fine-tune pre-trained network on a subset of labels $1\\%{-}100\\%$

- .bold[Datasets:] ImageNet, VTAB

???
- ImageNet is still the most popular choice, though new datasets are proposes now, e.g. VTAB benchmark

---

.kol-6-12[
.center.width-100[![](images/few-labels-bench-2.png)]
.caption[ImageNet accuracy of models trained with few labels: CPCv2 vs. supervised]
]

.kol-6-12[
 
- Supervised networks do not generalize well from few labeled data
- Self-supervised networks reach significantly better accuracy in the low data regime
]
]

.citation[O. Henaff et al., Data-Efficient Image Recognition with Contrastive Predictive Coding, ArXiv 2019 ]

---

# Transfer learning

.center.width-70[![](images/self-sup-pipeline-step2-3-crop.png)]
- The pre-trained model is augmented with task specific modules (e.g., decoders for semantic segmentation, RPN for object detection) and fine-tuned partially or completely
- .bold[Tasks and datasets:]
.smaller-x[
 + Object detection: VOC07, VOC12, COCO14
+ Semantic segmentation: Cityscapes, ADE20K
+ Other tasks: Surface Normal Estimation (NYUv2), Visual Navigation (Gibson)
]

.citation[ P. Goyal et al., Scaling and Benchmarking Self-Supervised Visual Representation Learning, ICCV 2019 ]

---

## Leave Those Nets Alone: Advances in Self-Supervised Learning

- 10:00 - 10:30 EDT (16:00 - 16:30 CET) Introduction _.italic.smaller-x[by Spyros and Andrei]_
- 10:35 - 11:00 EDT (16:35 - 17:00 CET) Contrastive learning _.italic.smaller-x[by Olivier and Aäron]_
- 11:05 - 12:00 EDT (17:05 - 18:00 CET) Teacher-student approaches _.italic.smaller-x[by Spyros and Andrei]_
- 12:05 - 12:50 EDT (18:05 - 18:50 CET) Clustering-style self-supervised learning _.italic.smaller-x[by Mathilde]_
- 12:55 - 13:50 EDT (18:55 - 19:50 CET) Multi-modal approaches _.italic.smaller-x[by Jean-Baptiste and Adrià]_
- 13:55 - 14:30 EDT (19:55 - 20:30 CET) What is next?

---

## Leave Those Nets Alone: Advances in Self-Supervised Learning

- .gray[10:00 - 10:30 EDT (16:00 - 16:30 CET) Introduction .italic.smaller-x[by Spyros and Andrei]]
- 10:35 - 11:00 EDT (16:35 - 17:00 CET) Contrastive learning _.italic.smaller-x[by Olivier and Aäron]_
- 11:05 - 12:00 EDT (17:05 - 18:00 CET) Teacher-student approaches _.italic.smaller-x[by Spyros and Andrei]_
- 12:05 - 12:50 EDT (18:05 - 18:50 CET) Clustering-style self-supervised learning _.italic.smaller-x[by Mathilde]_
- 12:55 - 13:50 EDT (18:55 - 19:50 CET) Multi-modal approaches _.italic.smaller-x[by Jean-Baptiste and Adrià]_
- 13:55 - 14:30 EDT (19:55 - 20:30 CET) What is next?

---
layout: false
class: end-slide, center
count: false

The end.