Self-Supervised Learning [Andrei BURSUC and Relja ARANDJELOVIĆ]

.center.footer[Andrei BURSUC and Relja ARANDJELOVIĆ | Self-Supervised Learning]

---

## .bold[CVPR  2020  Tutorial]
# Towards Annotation-Efficient Learning  
# Self-Supervised Learning

.kol-4-12[
.center.big.bold[Andrei Bursuc]
 
 

.center[<img src="images/logo_valeoai.png" style="width: 200px;" />]
]

.kol-4-12[
.center.big.bold[Relja Arandjelović]
 
 
.center[<img src="images/logo_deepmind.png" style="width: 230px;" />]
]

<div class="header-logo"><img src="images/logo_cvpr.png" style="width: 250px;"/></div>

---
class: middle

# Self-Supervised Learning

## 1. Motivation

## 2. A tour of pretext tasks for Self-Supervised Learning

## 3. Practical considerations

## 4. Evaluation

## 5. Applications

---
class: middle, center

# 1. Motivation

---

.center.big[Deep Learning + Supervised Learning is a really cool and strong combo.]

.hidden.center.big.italic[... when task and data permit it.]

---
count: false
class: middle

.center.big[Deep Learning + Supervised Learning is a really cool and strong combo.]

.center.big[... when task and data permit it.]

---

.center.big[In addition to **data acquisition and annotation challenges**, the _representation learning perspective_ provides additional reasons to look for alternative or complementary solutions. ]

---

## .center[The type of supervision signal can bias the network in unexpected ways]

.center.width-90[![](images/gatys-1.png)]
.caption[VGG-16 preditions on original and artificially texturised images.]

---

## .center[The type of supervision signal can bias the network in unexpected ways]

.center.width-85[![](images/texture-bias.png)]
.caption[Classification predictions of a ResNet-50 trained on ImageNet]

.citation[R. Geirhos et al., ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness, ICLR 2019]

---

.big[Improving representation learning requires features that are *not specialized for solving a particular supervised task*, but rather *encapsulate richer statistics for various downstream tasks*.]

---

## .center[The success of self-supervised methods in NLP, e.g. _word2vec_, is inspiring ]

.center.width-85[![](images/bert-1.png)]
.caption[Missing word prediction task.]
.center.width-85[![](images/bert-2.png)]
.caption[Next sentence prediction task.]

.citation[T. Mikolov et al., Efficient estimation of word representations in vector space, ArXiv 2013 T. Mikolov et al., Distributed representations of words and phrases and their compositionality, NeurIPS 2013 J. Devlin, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, ArXiv 2018]

---
count: false
class: middle

# .center[What is self-supervision?]

- .bigger[A form of unsupervised learning where the **data (not the human) provides the supervision signal**]

.hidden[
- .bigger[Usually, *define a pretext task* for which the network is forced to learn what we really care about]
]

.hidden[
- .bigger[For most pretext tasks, *a part of the data is withheld* and the network has to predict it]
]

.hidden[
- .bigger[The features/representations learned on the pretext task are subsequently used for a different *downstream task*, usually where some annotations are available.]
]

---
count: false
class: middle

# .center[What is self-supervision?]

- .bigger[A form of unsupervised learning where the **data (not the human) provides the supervision signal**]

- .bigger[Usually, *define a pretext task* for which the network is forced to learn what we really care about]

.hidden[
- .bigger[For most pretext tasks, *a part of the data is withheld* and the network has to predict it]
]

.hidden[
- .bigger[The features/representations learned on the pretext task are subsequently used for a different *downstream task*, usually where some annotations are available.]
]

---
count: false
class: middle

# .center[What is self-supervision?]

- .bigger[A form of unsupervised learning where the **data (not the human) provides the supervision signal**]

- .bigger[Usually, *define a pretext task* for which the network is forced to learn what we really care about]

- .bigger[For most pretext tasks, *a part of the data is withheld* and the network has to predict it]

.hidden[
- .bigger[The features/representations learned on the pretext task are subsequently used for a different *downstream task*, usually where some annotations are available.]
]

---

# .center[What is self-supervision?]

- .bigger[A form of unsupervised learning where the **data (not the human) provides the supervision signal**]

- .bigger[Usually, *define a pretext task* for which the network is forced to learn what we really care about]

- .bigger[For most pretext tasks, *a part of the data is withheld* and the network has to predict it]

- .bigger[The features/representations learned on the pretext task are subsequently used for a different *downstream task*, usually where some annotations are available.]

---

## .center[Example: Rotation prediction]

.center.width-70[![](images/rotnet_3.png)]

.citation[S. Gidaris et al., Unsupervised Representation Learning by Predicting Image Rotations, ICLR 2018]

---

# Self-supervised learning pipeline

.center.bold.bigger[*Stage 1:* Train network on pretext task (without human labels)]
.center.width-90[![](images/self-sup-pipeline-step1.png)]

---
count:false

# Self-supervised learning pipeline

.center.bold.bigger[*Stage 1:* Train network on pretext task (without human labels) ]
.center.width-90[![](images/self-sup-pipeline-step1.png)]
.center.bold.bigger[*Stage 2:* Train classifier on learned features for new task with fewer labels]
.center.width-90[![](images/self-sup-pipeline-step2-1.png)]
---
count:false

# Self-supervised learning pipeline

.center.bold.bigger[*Stage 1:* Train network on pretext task (without human labels)]
.center.width-90[![](images/self-sup-pipeline-step1.png)]
.center.bold.bigger[*Stage 2:* Fine-tune network for new task with fewer labels]
.center.width-90[![](images/self-sup-pipeline-step2-3.png)]

---

## .center[Karate Kid and Self-Supervised Learning]
.center.width-85[![](images/karate-kid-poster-wide.jpg)]
.caption[The Karate Kid (1984)]

---
class: middle, black-slide

## .center[Stage 1: Train .italic[muscle memory] on pretext tasks]

]

.hidden[ 
.grid[
.kol-6-12[
$$\begin{aligned}
\text{Mr. Miyagi} &= \text{Deep Learning Practitioner} \\\\
\text{Daniel LaRusso} &= \text{ConvNet}\end{aligned}$$
]

.kol-6-12[
$$\begin{aligned}\text{daily chores} &= \text{pretext tasks} \\\\
\text{learning karate} &= \text{downstream task}\end{aligned}$$
]
]
]
---
class: middle, black-slide

## .center[Stage 1: Train .italic[muscle memory] on pretext tasks]

]

.kol-6-12[
$$\begin{aligned}
\text{Mr. Miyagi} &= \text{Deep Learning Practitioner} \\\\
\text{Daniel LaRusso} &= \text{ConvNet}\end{aligned}$$
]

.kol-6-12[
$$\begin{aligned}\text{daily chores} &= \text{pretext tasks} \\\\
\text{learning karate} &= \text{downstream task}\end{aligned}$$

]

---

## .center[Stage 2: Fine-tune skills rapidly]

.center.width-60[![](images/karate-kid-train.gif)]

---

.center.width-60[![](images/karate-kid-fly.gif)]

$$\text{Mr. Miyagi during Stage 1} = \text{catching up on ArXiv papers}$$

---
class: middle, center

.Q[.big[Is this actually useful in practice?]]

<!-- ---
class: middle

.kol-6-12[
.center.width-100[![](images/imagenet-classif-1.png)]
.caption[ImageNet Top-1 accuracy of linear classifiers trained on representations learned with different self-supervised methods (pretrained on ImageNet). Gray cross indicates supervised ResNet-50.]

]

.kol-6-12[
 
The performance on linear classifiers on ImageNet has accelerated strongly in the past year closing the gap w.r.t. supervised methods
]
]

.citation[T. Chen et al., A Simple Framework for Contrastive Learning of Visual Representations, ArXiv 2020 ]
 -->
---

## .center[Transfer learning - object detection]

.kol-6-12[
.center.width-90[![](images/voc-bench.png)]
.caption[Object detection with Faster R-CNN fine-tuned on VOC  $\texttt{trainval07+12}$ and evaluated on $\texttt{test07}$. Networks are pre-trained with self-supervision on ImageNet.]
]

.kol-6-12[
 
- Self-supervised methods are starting to outperform supervised methods
- This is a __key milestone for self-supervised methods__ as they are finally showing their effectiveness to more complex downstream tasks.
]
]

<!-- ---

.big.italic.purple[".bold[Unsupervised learning:] Think hard about the model, use whatever data fits.]
.big.italic.purple[.bold[Self-supervised learning:] Think hard about the data, use whatever model fits."]

.bold.pull-right[Misha Denil]

.center.width-30[![](images/thinker.jpg)]

-->

<!-- ---

---

Loosely speaking, multiple old and new approaches could fit, at least partially, the definition of self-supervised learning: 
- input reconstruction: .cites[[Hinton and Salakhutdinov (2006); Vincent et al. (2008)]]
- training with paired signals: .cites[[V. De Sa (1994); Arandjelovic and Zisserman (2017); C. Godard et al. (2017)]]
- hiding data from the networks: .cites[[Doersch et al. (2015); Zhang et el. (2017)]]
- instance discrimination: .cites[[Dosovitskiy et al. (2014); van der Ooord et al. (2018)]]
- etc.

.citation[G. Hinton and R. Salakhutdinov, Reducing the Dimensionality of Data with Neural Networks, Science 2006 P.Vincent et al.,Extracting and Composing Robust Features with Denoising Autoencoders, ICML 2008 V. De Sa, Learning classification from unlabelled data, NeurIPS 1994 R. Arandjelovic and A. Zisserman, Look, Listen and Learn, ICCV 2017 C. Godard et al., Unsupervised Monocular Depth Estimation with Left-Right Consistency, CVPR 2017 R. Zhang et al., Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction, CVPR 2017 C. Doersch et al., Unsupervised Visual Representation Learning by Context Prediction, ICCV 2015 A. Dosovitskiy et al., Discriminative Unsupervised Feature Learning with Convolutional Neural Networks, NeurIPS 2015 A. van der Oord et al., Representation Learning with Contrastive Predictive Coding, ArXiv 2018 ]

---

# .center[Scope]

.center.big[In this tutorial, we **focus** on self-supervised methods that lead to *useful representations*, obtained through the invention of *a pretext task* and/or *by hiding a part of the original data* to the network.]

---

# 2. A tour of pretext tasks for Self-Supervised Learning

---
layout: false 
class: end-slide, center
count: false

---
layout: true
.center.footer[Andrei BURSUC and Relja ARANDJELOVIĆ | Self-Supervised Learning]

---
class: middle, center

# 3. Practical considerations

---

# Practical considerations

## Noise Contrastive Estimation

---

Let $x \in \mathcal{X}$ be input data and $y \in \\{1, \dots, L\\}$ and $f\_{\theta}(\cdot):\mathcal{X}\rightarrow\mathbb{R}^D$ a network generating an embedding vector $f\_{\theta}(x)$.

We denote:
- $q=f\_{\theta}(x)$ (*query*)
- $\\{x^{i}\\}$ a set of samples from $\mathcal{X}$.
- $k\_{i} = f\_{\theta}(x^{i})$ the embeddings of $\\{x^{i}\\}$ as keys (*representations*)

---

## Parametric Classifier
- We are interested in instance discrimination, i.e. recognizing a particular query sample
- We can formulate the instance-level classification objective using the softmax criterion 
- We consider each neuron from the last layer as a class-level weight vector $w\_{i}$ (with $b = 0$), i.e. _class prototype_

$$\mathcal{L}\_{\text{softmax}} (q, c(q)) = - \log\frac{\exp(q^\top w\_{c(q)})}{ \sum\_{c\in C}\exp(q^\top w\_{c})}$$

where:
- $C$ is the number of classes 
- $c(q)$ is the class index of $q$

.citation[Z. Wu et al., Unsupervised Feature Learning via Non-Parametric Instance Discrimination, CVPR 2018]

---

## Non-Parametric Classifier
- Using $w\_i$ as class prototype prevents explicit comparison between instances, i.e. individual samples
- We can use instead a _non-parametric_ variant that replaces $q^\top w\_j$ with $q^\top k\_i$

$$\mathcal{L}\_{\text{non-param-softmax}}(q) = - \log \frac{\exp(q^\top k\_q )}{ \sum\_{i \in N}\exp(q ^\top k\_{i})}$$

where:
- $N$ is the number of training samples 
- $k\_q \in \\{ k\_i\\}$ is the key of a positive sample for $q$

.citation[Z. Wu et al., Unsupervised Feature Learning via Non-Parametric Instance Discrimination, CVPR 2018]

---

$$\mathcal{L}\_{\text{softmax}} (q, c(q)) = - \log\frac{\exp(q^\top w\_{c(q)})}{ \sum\_{c\in C}\exp(q^\top w\_{c})} \longrightarrow   \mathcal{L}\_{\text{non-param-softmax}}(q) = - \log \frac{\exp(q^\top k\_q )}{ \sum\_{i \in N}\exp(q ^\top k\_{i})}$$

where:
- $N$ is the number of training samples 
- $k\_q \in \\{ k\_i\\}$ is the key of a positive sample for $q$

.citation[Z. Wu et al., Unsupervised Feature Learning via Non-Parametric Instance Discrimination, CVPR 2018]
---

## Non-Parametric Classifier
.center.width-60[![](images/npid-crop.png)]
.caption[Self-supervised learning as image instance-level discrimination]

$$ \mathcal{L}\_{\text{non-param-softmax}}(q) = - \log \frac{\exp(q^\top k\_q )}{ \sum\_{i \in N}\exp(q ^\top k\_{i})}$$

The learning objective focuses now entirely on feature representation, instead of class-specific representations

.citation[Z. Wu et al., Unsupervised Feature Learning via Non-Parametric Instance Discrimination, CVPR 2018]

---

## Noise-Contrastive Estimation (NCE)
- Computing the non-parametric softmax is expensive when the number of classes $N$ is very large, e.g. millions
- Popular solutions: NCE, hierarchical softmax
.hidden[
- .bold[NCE:] cast the multi-class classification into a set of binary classification problems, each discriminating between __data samples (positives)__ and __noise samples (negatives)__.

$$\mathcal{L}\_{\text{NCE}}(q)  = - \log \frac{\exp(q^\top k\_{q} )}{ Z\_q}$$

- The denominator $Z\_q$ is treated as constant and estimated via Monte Carlo approximation

$$Z\_q =  \sum\_{i \in N}\exp(q^\top k\_{i}) \simeq \frac{N}{K}\sum\_{j \in K}\exp(q^\top k\_{j})$$

where $K$ is the number Monte Carlo samples
]

.citation[Z. Wu et al., Unsupervised Feature Learning via Non-Parametric Instance Discrimination, CVPR 2018 M. Gutmann and A. Hyvarinen, Noise-contrastive estimation: A new estimation principle for unnormalized statistical models, AISTATS 2010]

---

## Noise-Contrastive Estimation (NCE)
- Computing the non-parametric softmax is expensive when the number of classes $N$ is very large, e.g. millions
- Popular solutions: NCE, hierarchical softmax
- .bold[NCE:] cast the multi-class classification into a set of binary classification problems, each discriminating between __data samples (positives)__ and __noise samples (negatives)__.

$$\mathcal{L}\_{\text{NCE}}(q)  = - \log \frac{\exp(q^\top k\_{q})}{ Z\_q}$$

- The denominator $Z\_q$ is treated as constant and estimated via Monte Carlo approximation

$$Z\_q =  \sum\_{i \in N}\exp(q^\top k\_{i}) \simeq \frac{N}{K}\sum\_{j \in K}\exp(q^\top k\_{j})$$

where $K$ is the number Monte Carlo samples

---

$$\mathcal{L}\_{\text{InfoNCE}}(q) = - \log \frac{\exp(q^\top k\_{q})}{\sum\_{i \in K}\exp(q^\top  k\_{i})}$$

$$I(q;k\_{q}) \geq \log{K} - \mathcal{L}\_{\text{InfoNCE}}(q) $$

- Minimizing the objective $\mathcal{L}$, maximizes the lower bound on the mutual information $I(q;k\_{q})$

- The $\log K$ term suggests that using more negative samples can lead to an improved representation
]

.citation[A. van der Oord et al., Representation learning with contrastive predictive coding, ArXiv 2018 M. Gutmann and A. Hyvarinen, Noise-contrastive estimation: A new estimation principle for unnormalized statistical models, AISTATS 2010]

---

$$\mathcal{L}\_{\text{InfoNCE}}(q) = - \log \frac{\exp(q^\top k\_{q})}{\sum\_{i \in K}\exp(q^\top  k\_{i})}$$

-  Connection to _mutual information maximization_ between the variable $q$ and $k\_{q}$:

$$I(q;k\_{q}) \geq \log{K} - \mathcal{L}\_{\text{InfoNCE}}(q) $$

- Minimizing the objective $\mathcal{L}$, maximizes the lower bound on the mutual information $I(q;k\_{q})$

- The $\log K$ term suggests that using more negative samples can lead to an improved representation

---

# Practical considerations

## Noise Contrastive Estimation - The Metric Learning perspective

---

.center.width-65[![](images/n-pair-loss.png)]
.caption[Deep metric learning with (left) triplet loss and (right) $(N+1)$-tuplet loss. $(N+1)$-tuplet loss pushes $N-1$ negative examples all at once.]

Similar findings have been reached when studying this problem from a *metric learning perspective* by .citet[Sohn (2016)].

Their method that generalizes *triplet-loss* with **N-pair loss**, exploits multiple negatives for each query and has been shown to improve convergence speed and performance.

.citation[K. Sohn,  Improved Deep Metric Learning with Multi-class N-pair Loss Objective, NeurIPS 2016]

???
Embedding vectors $f$ of deep networks are trained to satisfy the constraints of each loss. Triplet loss pulls positive example while pushing one negative example at a time. On the other hand, $(N+1)$-tuplet loss pushes $N-1$ negative examples all at once, based on their similarity to the input example.

---

- Contrastive loss (pair-wise) .cites[[Chopra et al. (2005), Hadsell et al. (2006)]]

$$\mathcal{L}\_{\text{contrastive}}(q, k\_i) = \boldsymbol{1}\\{y\_q = y\_i\\} \\| q - k\_i\\|^{2}\_{2} +\boldsymbol{1}\\{y\_q =\not y\_i\\} \max(0, m - \\| q - k\_i\\|^{2}\_{2} )$$

where $m$ is the margin parameter

- Triplet loss .cites[[M. Schultz and T. Joachim (2004); K. Weinberger et al. (2006); Schroff et al. (2015)]]

$$\mathcal{L}\_{\text{triplet}}(q, k\_{q}, k\_{i}) =  \max(0, \\| q - k\_{q}\\|^{2}\_{2} - \\| q - k\_{i}\\|^{2}\_{2} + m )$$

$$\mathcal{L}\_{\text{N-pair}}(q, k\_{q}, \\{k\_i\\}\_{i \in N}) = \log ( 1 +\sum\_{i=1}^{N-1}\exp(q^\top k\_{i} - q^\top k\_{q})$$
]

.hidden[
- InfoNCE loss .cites[[van der Ooord (2018)]] 
$$\mathcal{L}\_{\text{InfoNCE}}(q) = - \log \frac{\exp(q^\top k\_{q} )}{\sum\_{i \in K}\exp(q^\top  k\_{i} ))}$$

]

.citation[S. Chopra et al., Learning a similarity metric discriminatively, with application to face verification, CVPR 2005;  R. Hadsell et al., Dimensionality reduction by learning an invariant  mapping, CVPR 2006; K. Weinberger et al., Distance metric learning for large margin nearest neighbor classification, NeurIPS 2006; M. Schultz and T. Joachims, Learning a distance metric from relative comparisons, NeurIPS 2004; F. Schroff et al., FaceNet: A unified embedding for face recognition and clustering, CVPR 2015;  K. Sohn, Improved deep metric learning with multi-class N-pair loss objective, NeurIPS 2016 ]

---

- Contrastive loss (pair-wise) .cites[[Chopra et al. (2005), Hadsell et al. (2006)]]

$$\mathcal{L}\_{\text{contrastive}}(q, k\_i) = \boldsymbol{1}\\{y\_q = y\_i\\} \\| q - k\_i\\|^{2}\_{2} +\boldsymbol{1}\\{y\_q =\not y\_i\\} \max(0, m - \\| q - k\_i\\|^{2}\_{2} )$$

where $m$ is the margin parameter

- Triplet loss .cites[[M. Schultz and T. Joachim (2004); K. Weinberger et al. (2006); Schroff et al. (2015)]]

$$\mathcal{L}\_{\text{triplet}}(q, k\_{q}, k\_{i}) =  \max(0, \\| q - k\_{q}\\|^{2}\_{2} - \\| q - k\_{i}\\|^{2}\_{2} + m )$$

- N-pair loss .cites[[Sohn (2016)]]

$$\mathcal{L}\_{\text{N-pair}}(q, k\_{q}, \\{k\_i\\}\_{i \in N}) = \log ( 1 +\sum\_{i=1}^{N-1}\exp(q^\top k\_{i} - q^\top k\_{q}))$$

.hidden[
- infoNCE loss .cites[[van der Ooord (2018)]] 
$$\mathcal{L}\_{\text{InfoNCE}}(q) = - \log \frac{\exp(q^\top k\_{q} )}{\sum\_{i \in K}\exp(q^\top  k\_{i} )}$$
]

---

- Contrastive loss (pair-wise) .cites[[Chopra et al. (2005), Hadsell et al. (2006)]]

$$\mathcal{L}\_{\text{contrastive}}(q, k\_i) = \boldsymbol{1}\\{y\_q = y\_i\\} \\| q - k\_i\\|^{2}\_{2} +\boldsymbol{1}\\{y\_q =\not y\_i\\} \max(0, m - \\| q - k\_i\\|^{2}\_{2} )$$

where $m$ is the margin parameter

- Triplet loss .cites[[M. Schultz and T. Joachim (2004); K. Weinberger et al. (2006); Schroff et al. (2015)]]

$$\mathcal{L}\_{\text{triplet}}(q, k\_{q}, k\_{i}) =  \max(0, \\| q - k\_{q}\\|^{2}\_{2} - \\| q - k\_{i}\\|^{2}\_{2} + m )$$

- N-pair loss .cites[[Sohn (2016)]]

$$\mathcal{L}\_{\text{N-pair}}(q, k\_{q}, \\{k\_i\\}\_{i \in N}) = \log ( 1 +\sum\_{i=1}^{N-1}\exp(q^\top k\_{i} - q^\top k\_{q}))$$

- infoNCE loss .cites[[van der Ooord (2018)]] 
$$\mathcal{L}\_{\text{InfoNCE}}(q) = - \log \frac{\exp(q^\top k\_{q} )}{\sum\_{i \in K}\exp(q^\top  k\_{i} )}$$

---

.hidden.bigger[We have seen in InfoNCE that more negatives lead to improved representations.]

.hidden.bigger[Bridging these two views can lead to a better understanding of contrastive learning and to transfering know-how from metric learning.]
---

.hidden.bigger[Bridging these two views can lead to a better understanding of contrastive learning and to transfering know-how from metric learning.]
---

.bigger[Bridging these two views can lead to a better understanding of contrastive learning and to transfering know-how from metric learning.]

---

.bigger[In recent literature InfoNCE is the most popular and, so far, the best performing loss for contrastive self-supervised learning.]

---

# Practical considerations

## Dealing with negative samples

---

.center.bigger[Multiple works have observed the influence of negative samples and proposed different heuristics to increase their number.]

.grid[
.kol-4-12[
.caption[CMC .cites[[Tian et al. (2019)]]]
.center.width-105[![](images/cmc-negatives.png)]
]
.kol-4-12[
.caption[MoCo .cites[[He et al. (2020)]]]
.center.width-105[![](images/moco-negatives.png)]
]

.kol-4-12[
.caption[PiRL .cites[[Misra and van der Maaten (2020)]]]
.center.width-105[![](images/pirl-negatives.png)]
]
]

.citation[Y. Tian et al., Contrastive Multiview Coding, ArXiv 2019 K. He et al., Momentum Contrast for Unsupervised Visual Representation Learning, CVPR 2020 I. Misra and L. van der Maaten, Self-Supervised Learning of Pretext-Invariant Representations, CVPR 2020]

---

.center.width-50[![](images/nce-loss-1.png)]
.caption[Classification accuracies for DeepInfoMax on CIFAR10. Accuracies shown averaged over the last 100 epochs for the _InfoNCE_, _JSD_, and _DV DIM_ losses. x-axis is log base-2 of the number of negative samples (0 mean one negative sample per positive sample).]

.citation[R.D. Hjelm et al.,Learning deep representations by mutual information estimation and maximization, ICLR 2019]

---

# End-to-End training

.grid[
.kol-4-12[
.center.width-115[![](images/pairwise-1.png)]
 
]

]

.citation[R. Hadsell et al., Dimensionality reduction by learning an invariant mapping, CVPR 2006; F. Schroff et al., FaceNet: A unified embedding for face recognition and clustering, CVPR 2015]

---
count: false

# End-to-End training

.grid[
.kol-4-12[
.center.width-115[![](images/pairwise-2.png)]
 
]

.kol-7-12[
 
- Shared encoder for queries and keys
- The encoder is updated by backpropagation through all samples

]
]

.citation[R. Hadsell et al., Dimensionality reduction by learning an invariant mapping, CVPR 2006; F. Schroff et al., FaceNet: A unified embedding for face recognition and clustering, CVPR 2015]
---
count: false

# End-to-End training

.grid[
.kol-4-12[
.center.width-115[![](images/pairwise-2.png)]
 
]

.kol-7-12[
 
- Shared encoder for queries and keys
- The encoder is updated by backpropagation through all samples

.bigger.green[Pros]
- Consistent $q$ and $\\{k\_{i}\\}$
]
]

# End-to-End training

.grid[
.kol-4-12[
.center.width-115[![](images/pairwise-2.png)]
 
]

.kol-7-12[
 
- Shared encoder for queries and keys
- The encoder is updated by backpropagation through all samples

.bigger.green[Pros]
- Consistent $q$ and $\\{k\_{i}\\}$

.bigger.red[Cons]
- The amount of negatives limited by GPU memory
]
]

.citation[R. Hadsell et al., Dimensionality reduction by learning an invariant mapping, CVPR 2006; F. Schroff et al., FaceNet: A unified embedding for face recognition and clustering, CVPR 2015]
---

# End-to-End training

.grid[
.kol-4-12[
.center.width-115[![](images/pairwise-3.png)]
 
]

.kol-7-12[
 
- Rely on large mini-batches ($2k-8k$ samples) to ensure plenty of negatives, e.g. $16382$ negatives from a batch of $8192$, i.e. $2(N-1)$ negatives
- The loss is computed across all positive pairs in a mini-batch
]
]

.citation[T. Chen et al., A Simple Framework for Contrastive Learning of Visual Representations, ArXiv 2020 ]

???
- _SimCLR_ .cites[[Chen et al. (2010)]] also takes an end-to-end approach

---

.grid[
.kol-4-12[
.center.width-115[![](images/pairwise-3.png)]
 
]

.bigger.green[Pros]
- Lots of negatives

]
]

.citation[T. Chen et al., A Simple Framework for Contrastive Learning of Visual Representations, ArXiv 2020 ]

---

.grid[
.kol-4-12[
.center.width-115[![](images/pairwise-3.png)]
 
]

.bigger.green[Pros]
- Lots of negatives

.bigger.red[Cons]
- Lots of GPUs/TPUs

]
]

.citation[T. Chen et al., A Simple Framework for Contrastive Learning of Visual Representations, ArXiv 2020 ]

---

# Memory bank

.grid[
.kol-4-12[
.center.width-115[![](images/memory-bank-1.png)]
 
]

.kol-7-12[
 
- Keys are randomly sampled from a _memory bank_ of cached features 
- The memory bank contains features or all images in the dataset
- There is no backpropagation through the memory bank
- Keys are updated via exponential moving average

]
]

.citation[Z. Wu et al., Unsupervised Feature Learning via Non-Parametric Instance Disc., CVPR 2018; I. Misra et al., Self-Sup. Learning of Pretext-Invariant Representations, CVPR 2020 ]

---

count:false

# Memory bank

.grid[
.kol-4-12[
.center.width-115[![](images/memory-bank-1.png)]
 
]

.bigger.green[Pros]
- Many negatives ($K = 65536$) while GPU memory efficient

]
]

.citation[Z. Wu et al., Unsupervised Feature Learning via Non-Parametric Instance Disc., CVPR 2018; I. Misra et al., Self-Sup. Learning of Pretext-Invariant Representations, CVPR 2020 ]

---
count:false

# Memory bank

.grid[
.kol-4-12[
.center.width-115[![](images/memory-bank-1.png)]
 
]

.bigger.green[Pros]
- Many negatives ($K = 65536$) while GPU memory efficient

.bigger.red[Cons]
- Keys are .italic["older" ] than queries
]
]

.citation[Z. Wu et al., Unsupervised Feature Learning via Non-Parametric Instance Disc., CVPR 2018; I. Misra et al., Self-Sup. Learning of Pretext-Invariant Representations, CVPR 2020 ]

???
- Sampled features correspond to encoders at different steps in the past epoch and are less consistent

---

# Momentum encoder

.grid[
.kol-4-12[
.center.width-115[![](images/moco-1.png)]
 
]

.kol-7-12[
- _Momentum encoder_ relies on a memory-bank with a different update scheme
- The encoder $f\_{\psi}(\cdot)$ is updated instead of the keys themselves
- New samples are continuously added to the memory bank
- The momentum encoder $f\_{\psi}(\cdot)$ is slowly pursuing $f\_{\theta}(\cdot)$ via exponential moving average, i.e. momentum, update: 
$$\psi \leftarrow m \psi + (1-m)\theta $$
]
]

.citation[K. He et al., Momentum Contrast for Unsupervised Visual Representation Learning, CVPR 2020]

---

.grid[
.kol-4-12[
.center.width-115[![](images/moco-1.png)]
 
]

```py
q = f_q.forward(x_q) # queries: NxC 
k = f_k.forward(x_k) # keys: NxC 
k = k.detach() # no gradient to keys
...
# SGD update: query network 
loss.backward() 
update(f_q.params)
# momentum update: key network 
f_k.params = m*f_k.params+(1-m)*f_q.params

```
]
]

.citation[K. He et al., Momentum Contrast for Unsupervised Visual Representation Learning, CVPR 2020]

---

.grid[
.kol-4-12[
.center.width-115[![](images/moco-1.png)]
 
]

```
]
]

.citation[K. He et al., Momentum Contrast for Unsupervised Visual Representation Learning, CVPR 2020]

---

.grid[
.kol-4-12[
.center.width-115[![](images/moco-1.png)]
 
]

.bigger.red[Cons]
- Momentum requires tuning: $m \in [0.99, 0.9999]$  works well, while for $m \le 0.9$ accuracy drops considerably.

- No backprop through the memory bank

]
]

.citation[K. He et al., Momentum Contrast for Unsupervised Visual Representation Learning, CVPR 2020]

---

# Practical considerations

## Output projection

---

.grid[
.kol-4-12[
 .center.width-115[![](images/pairwise-3.png)]
 
]

.kol-7-12[
 
- Add a small neural network _projection head_
 $g\_{\phi}(\cdot)$ right before contrastive loss

]
]
.citation[T. Chen et al., A Simple Framework for Contrastive Learning of Visual Representations, ArXiv 2020 ]

---

.center.width-115[![](images/pairwise-4.png)]
 
]

.kol-7-12[
 
- Add a small neural network _projection head_
 $g\_{\phi}(\cdot)$ right before contrastive loss 
- $g\_{\phi}(\cdot)$ can be a MLP with one hidden layer and ReLU 
- $g\_{\phi}(\cdot)$ is removed for the downstream task

]
]

.citation[T. Chen et al., A Simple Framework for Contrastive Learning of Visual Representations, ArXiv 2020 ]

---

# Output projection

This cost-effective idea brings consist gains across methods ($+6-10\\%$)

.kol-6-12[
.center.width-100[![](images/proj-head-1.png)]
.caption[Linear evaluation of representations with different projection heads $g\_{\psi}(\cdot)$ and various output dimensions.]

]
.kol-6-12[
]
]

.citation[T. Chen et al., A Simple Framework for Contrastive Learning of Visual Representations, ArXiv 2020 K. He et al., Improved Baselines with Momentum Contrastive Learning, ArXiv 2020]

---
count: false

# Output projection

This cost-effective idea brings consist gains across methods ($+6-10\\%$)

.kol-6-12[
.center.width-100[![](images/proj-head-1.png)]
.caption[Linear evaluation of representations with different projection heads $g\_{\psi}(\cdot)$ and various output dimensions.]
]
.kol-6-12[
 
.center.width-100[![](images/proj-head-3.png)]
.caption[ImageNet linear classification accuracy with MoCo v2]
]
]

.citation[T. Chen et al., A Simple Framework for Contrastive Learning of Visual Representations, ArXiv 2020 K. He et al., Improved Baselines with Momentum Contrastive Learning, ArXiv 2020]

---

.bigger[Due to loss of information induced by the contrastive loss, $g\_{\phi}(\cdot)$ can remove information that may be useful for the downstream taks, e.g. color, orientation of objects. 
]

.center.width-55[![](images/proj-head-2.png)]
.caption[Accuracy of training additional MLPs on different representations to predict the transformation applied.]

.citation[T. Chen et al., A Simple Framework for Contrastive Learning of Visual Representations, ArXiv 2020]

---

# Practical considerations

## Hyperparameters

---

.center.width-35[![](images/asterix-potion.png)]
.center.bigger[Recent contrastive methods require getting a few key ingredients to work properly.]

---

## Feature normalization
- Without $\ell\_{2}$-normalization, the $\text{softmax}$ distribution can be made arbitrarily sharp by simply scaling all the features.

---

## Feature normalization
- Without $\ell\_{2}$-normalization, the $\text{softmax}$ distribution can be made arbitrarily sharp by simply scaling all the features.

## Temperature scaling
- Mitigates peaky predictions from $\text{softmax}$ by increasing entropy
 
$$\mathcal{L}\_{\text{InfoNCE}}(q) = - \log \frac{\exp(q^\top k\_{q} )}{\sum\_{i \in K}\exp(q^\top k\_{i} )} \longrightarrow \mathcal{L}\_{\text{InfoNCE}}(q) = - \log \frac{\exp(q^\top k\_{q} / \tau )}{\sum\_{i \in K}\exp(q^\top k\_{i} /\tau )} $$

.citation[G. Hinton et al., Distilling the knowledged in a neural network, ArXiv 2015 T. Chen et al., A Simple Framework for Contrastive Learning of Visual Representations, ArXiv 2020]

???
- The temperature parameter $\tau$ controls the concentration level of the $\text{softmax}$ distribution

---

## Feature normalization
- Without $\ell\_{2}$-normalization, the $\text{softmax}$ distribution can be made arbitrarily sharp by simply scaling all the features.

## Temperature scaling
-  Mitigates peaky predictions from $\text{softmax}$ by increasing entropy

.center.width-45[![](images/impact-l2norm.png)]
.caption[Linear evaluation for models trained with different choices of $\ell\_2$-normalization and temperature $\tau$.]

.citation[G. Hinton et al., Distilling the knowledged in a neural network, ArXiv 2015 T. Chen et al., A Simple Framework for Contrastive Learning of Visual Representations, ArXiv 2020]

???
- The temperature parameter $\tau$ controls the concentration level of the $\text{softmax}$ distribution

---

## Feature normalization
- Without $\ell\_{2}$-normalization, the $\text{softmax}$ distribution can be made arbitrarily sharp by simply scaling all the features.

## Temperature scaling
-  Mitigates peaky predictions from $\text{softmax}$ by increasing entropy

## Downstream hyperparameters
- use specific hyperparameters for self-supervised method, e.g. larger learning rate, trainable BN layers

.citation[G. Hinton et al., Distilling the knowledged in a neural network, ArXiv 2015 T. Chen et al., A Simple Framework for Contrastive Learning of Visual Representations, ArXiv 2020 Y. Tian et al.Contrastive Multiview Coding, ArXiv 2019 K. He et al., Momentum Contrast for Unsupervised Visual Representation Learning, CVPR 2020]

---

# Practical considerations
## Data augmentation

---

.center.bigger[Data augmentation is known to be useful for recent self-supervised works. The rule of thumb was usually _.italic[whatever works well for supervised learning]_.
]

---

.bigger[Impact of data augmentation tuning on self-supervised methods has been thoroughly studied only recently by .citet[Chen et al. (2020)]]

.center.width-70[![](images/data-aug-1.png)]
.caption[Typical data augmentation operators used for visual representation learning]

.citation[T. Chen et al., A Simple Framework for Contrastive Learning of Visual Representations, ArXiv 2020]

---

.kol-6-12[
.center.width-100[![](images/data-aug-2.png)]
.caption[Linear evaluation (ImageNet top-1 accuracy) under individual or composition of data augmentations. ]
]

]

.citation[T. Chen et al., A Simple Framework for Contrastive Learning of Visual Representations, ArXiv 2020]

???
Diagonal entries correspond to single transformation, and off-diagonals correspond to composition of two transformations (applied sequentially).

---
class: middle
count: false

.kol-6-12[
.center.width-100[![](images/data-aug-2.png)]
.caption[Linear evaluation (ImageNet top-1 accuracy) under individual or composition of data augmentations. ]
]

.kol-6-12[
 
- Composing augmentations makes the task harder and improves the quality of the representation
- Contrastive learning needs stronger data augmentation than supervised learning
- A well tuned augmentation strategy can lead to improvements of $3-4\\%$ alone
]

]

.citation[T. Chen et al., A Simple Framework for Contrastive Learning of Visual Representations, ArXiv 2020]

???
- Diagonal entries correspond to single transformation, and off-diagonals correspond to composition of two transformations (applied sequentially).

---

# 4. Evaluating Self-Supervised methods

---
class: middle

# Evaluating Self-Supervised methods
## Tasks and protocols

---

.bigger[Self-supervised methods are evaluated on a battery of datasests and tasks .cites[[Goyal et al. (2019), Zhai et al. (2019)]]]

.bigger[In most benchmarks the model is *pre-trained on ImageNet* on a pretext task and *subsequentely fine-tuned on other datasets* or protocols.]

.citation[P. Goyal et al., Scaling and Benchmarking Self-Supervised Visual Representation Learning, ICCV 2019 X. Zhai et al., A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark, ArXiv 2019]

---

# Evaluation tasks
 
## Linear classification

## Few-shot learning

## Efficient learning

## Transfer learning

---
# Linear classification

.center.width-70[![](images/self-sup-pipeline-step2-1-crop.png)]

- Simplest evaluation of the utility of learned representations: fit a linear classifier (FC layer, linear SVM)
- .bold[Typical datasets:]
  + ImageNet
  + Places205
  + Pascal VOC07 (image classification)
  + COCO14 (image classification)

.citation[P. Goyal et al., Scaling and Benchmarking Self-Supervised Visual Representation Learning, ICCV 2019]

---
# Linear classification

]

.kol-6-12[
 
- The performance on this benchmark has accelerated strongly in the past year closing the gap w.r.t. supervised methods
]
]

.citation[T. Chen et al., A Simple Framework for Contrastive Learning of Visual Representations, ArXiv 2020 ]
---

.kol-6-12[
.center.width-100[![](images/imagenet-classif-2.png)]
.caption[ImageNet Top-1 accuracy of linear classifiers trained on representations learned with different self-supervised methods (pretrained on ImageNet). ]
]

.kol-6-12[
 
- The performance on this benchmark has accelerated strongly in the past year closing the gap w.r.t. supervised methods
- The main contributors: 
 - _contrastive learning with more negatives_ 
 - _output projection head_ 
 - _better designed and stronger data augmentation_
 - _longer training_
]
]

.citation[T. Chen et al., A Simple Framework for Contrastive Learning of Visual Representations, ArXiv 2020 ]

---

## .center[Linear classification on other datasets]

.kol-6-12[
 
.caption[Image classification performance on ImageNet, VOC07, Places205, and iNaturalist datasets. Linear classifiers are trained on image representations obtained by self-supervised learners that were pre-trained on ImageNet.]
]
]

.citation[I. Misra and L. van der Maaten, Self-Supervised Learning of Pretext-Invariant Representations, CVPR 2020 ]

???
- The ranking of the methods usually remains unchanged for contrastive methods

---
# Few-shot learning

.center.width-70[![](images/self-sup-pipeline-step2-1-crop.png)]

- Very similar goals of few-shot and self-supervised learning $\longrightarrow$ *.italic[Let's use same evaluation settings!]*
- Thousands of small _train_-_test_ splits are proposed, leading to robust evaluation statistics.
- .bold[Datasets:] miniImageNet

???
- The utility of learned representations can be also evaluated using the few-shot learning evaluation protocol
- This comparison method could be more meaningful for assessing different self-supervised features

---

# Few-shot learning

.kol-6-12[
 
.center.width-100[![](images/few-shot-bench-2.png)]
.caption[Average 5-way classification accuracies on the test set of MiniImageNet.]
]

.kol-6-12[
 
- Self-supervision reaches relatively competitive performance against the supervised .italic[Cosine Classifier]

]
]

---

# Efficient classification

.center.width-70[![](images/self-sup-pipeline-step2-2-crop.png)]

- Fine-tune pre-trained network on a subset of labels $1\\%-100\\%$

- .bold[Datasets:] ImageNet, VTAB

.citation[X. Zhai et al., S4L: Self-Supervised Semi-Supervised Learning, ICCV 2019 X. Zhai et al., A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark, ArXiv 2019 ]

???
- ImageNet is still the most popular choice, though new datasets are proposes now, e.g. VTAB benchmark

---

# Efficient classification

.kol-6-12[
.center.width-90[![](images/few-labels-bench.png)]
.caption[ImageNet accuracy of models trained with few labels]
]

.kol-6-12[
 
1. Sample $1\\%$ or $10\\%$ of ImageNet images in a class-balanced way (i.e. $12.8 / 128$ images per class)
2. Fine-tune the whole base network on the labeled data

]
]

.citation[X. Zhai et al., S4L: Self-Supervised Semi-Supervised Learning, ICCV 2019 T. Chen et al., A Simple Framework for Contrastive Learning of Visual Representations, ArXiv 2020 ]

???
- Recent self-supervised methods outperform the supervised baseline

---

# Efficient classification

.kol-6-12[
.center.width-100[![](images/few-labels-bench-2.png)]
.caption[ImageNet accuracy of models trained with few labels: CPCv2 vs. supervised]
]

.kol-6-12[
 
- Supervised networks do not generalize well from few labeled data
- Self-supervised networks reach significantly better accuracy in the low data regime 
]
]

.citation[O. Henaff et al., Data-Efficient Image Recognition with Contrastive Predictive Coding, ArXiv 2019 ]

---
# Transfer learning

.center.width-70[![](images/self-sup-pipeline-step2-3-crop.png)]

- The pre-trained model is augmented with task specific modules (e.g. decoders for semantic segmentation, RPN for object detection) and fine-tuned partially or completely

- .bold[Tasks and datasets:] 
  + Object detection: VOC07, VOC12, COCO14
  + Semantic segmentation: VOC07, Cityscapes
  + Surface Normal Estimation: NYUv2
  + Visual Navigation: Gibson

.citation[ P. Goyal et al., Scaling and Benchmarking Self-Supervised Visual Representation Learning, ICCV 2019 ]

---

# Transfer learning - object detection

.center.width-50[![](images/voc-bench.png)]
.caption[Object detection with Faster R-CNN fine-tuned on VOC $\texttt{trainval07+12}$ and evaluated on $\texttt{test07}$. Networks are pre-trained with self-supervision on ImageNet.]

---

# Evaluating Self-Supervised methods
## Ablation studies

---

.bigger[In the last year, multiple works evaluated thoroughly self-supervised methods at scale on multiple downstream tasks and different pre-training datasets.]

.citation[A. Kolesnikov et al., Revisiting Self-Supervised Visual Representation Learning, CVPR 2019 P. Goyal et al., Scaling and Benchmarking Self-Supervised Visual Representation Learning, ICCV 2019 X. Zhai et al., A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark, ArXiv 2019 A. Newell and J. Deng, How Useful is Self-Supervised Pretraining for Visual Tasks?, CVPR 2020 B. Wallace and B. Hariharan, Extending and Analyzing Self-Supervised Learning Across Domains, ArXiv 2020]

---

## Key findings

- .bigger[Large impact of CNN architecture .cites[[Kolesnikov et al. (2019)]]]

.hidden[
- .bigger[Increasing the size of pre-training dataset benefits higher capacity networks .cites[[Goyal et al. (2019)]]]
]

.hidden[
- .bigger[Varying benefits of increasing the pretext task difficulty  .cites[[Goyal et al. (2019), Chen et al. (2020)]]]
]

.hidden[
- .bigger[Currently greatest benefits are in low annotated data regimes .cites[[Henaff et al. (2019); Newell and Deng (2020)]]]
]

.hidden[
- .bigger[In stage 2, linear classification achieves lower performance than finetuning .cites[[Zhai et al. (2019); Newell and Deng (2020)]]]
]

???
- e.g. for _Rotation_, RevNet50 $\gg$ ResNet50

---

## Key findings

- .bigger[Large impact of CNN architecture .cites[[Kolesnikov et al. (2019)]]]

- .bigger[Increasing the size of pre-training dataset benefits higher capacity networks .cites[[Goyal et al. (2019)]]]

.hidden[
- .bigger[Varying benefits of increasing the pretext task difficulty  .cites[[Goyal et al. (2019), Chen et al. (2020)]]]
]

.hidden[
- .bigger[Currently greatest benefits are in low annotated data regimes .cites[[Henaff et al. (2019); Newell and Deng (2020)]]]
]

.hidden[
- .bigger[In stage 2, linear classification achieves lower performance than finetuning .cites[[Zhai et al. (2019); Newell and Deng (2020)]]]
]

???

- ResNet50 $\gt$ AlexNet
- _Jigsaw_ performs better than _Colorization_.
- The performance of the _Jigsaw_ model saturates (log- linearly) as we increase the data scale from 1M to 100M.

---

## Key findings

- .bigger[Large impact of CNN architecture .cites[[Kolesnikov et al. (2019)]]]

- .bigger[Increasing the size of pre-training dataset benefits higher capacity networks .cites[[Goyal et al. (2019)]]]

- .bigger[Varying benefits of increasing the pretext task difficulty  .cites[[Goyal et al. (2019), Chen et al. (2020)]]]

.hidden[
- .bigger[Currently greatest benefits are in low annotated data regimes .cites[[Henaff et al. (2019); Newell and Deng (2020)]]]
]

.hidden[
- .bigger[In stage 2, linear classification achieves lower performance than finetuning .cites[[Zhai et al. (2019); Newell and Deng (2020)]]]
]

???
- for _Jigsaw_ we see an improvement in transfer learning performance with increased complexity
- _Colorization_ appears to be less sensitive to changes in problem complexity

---

## Key findings

- .bigger[Large impact of CNN architecture .cites[[Kolesnikov et al. (2019)]]]

- .bigger[Increasing the size of pre-training dataset benefits higher capacity networks .cites[[Goyal et al. (2019)]]]

- .bigger[Varying benefits of increasing the pretext task difficulty  .cites[[Goyal et al. (2019), Chen et al. (2020)]]]
 
- .bigger[Currently greatest benefits are in low annotated data regimes .cites[[Henaff et al. (2019); Newell and Deng (2020)]]]

.hidden[
- .bigger[In stage 2, linear classification achieves lower performance than finetuning .cites[[Zhai et al. (2019); Newell and Deng (2020)]]]
]
---

## Key findings

- .bigger[Large impact of CNN architecture .cites[[Kolesnikov et al. (2019)]]]

- .bigger[Increasing the size of pre-training dataset benefits higher capacity networks .cites[[Goyal et al. (2019)]]]

- .bigger[Varying benefits of increasing the pretext task difficulty  .cites[[Goyal et al. (2019), Chen et al. (2020)]]]

- .bigger[Currently greatest benefits are in low annotated data regimes .cites[[Henaff et al. (2019); Newell and Deng (2020)]]]

- .bigger[In stage 2, linear classification achieves lower performance than finetuning .cites[[Zhai et al. (2019); Newell and Deng (2020)]]]

???
- Linear classification achieves lower performance than finetuning
- Overall linear evaluation is a poor proxy for overall reduced sample complexity

---

## .center[Representation learning stress test ]

.center.width-70[![](images/vtab-specific.png)]
.caption[Average top-1 accuracy across the $19$ tasks in VTAB. On the x-axis the methods are ordered according to their average accuracy across all tasks.]

- Self-supervised outperforms .italic[from-scratch] training.

- With self-supervision, strong benefits can be obtained with fewer labels, e.g. _Sup-Rot-10%_, or all labels, e.g. _Sup-Rot-100%_, _Sup-Exemplar-100%_

.citation[X. Zhai et al., A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark, ArXiv 2019]

---

## .center[Representation learning stress test ]

- Self-supervised outperforms .italic[from-scratch] training.

- .highlight[With self-supervision, strong benefits can be obtained with fewer labels, e.g. _Sup-Rot-10%_, or all labels, e.g. _Sup-Rot-100%_, _Sup-Exemplar-100%_]

.citation[X. Zhai et al., A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark, ArXiv 2019]

---

# 5. Applications of Self-Supervision

---

.hidden.bigger[
- Apply on downstream task with few(er) or no labels .cites[[Dwibedi et al. (2019); Wang et al. (2019); Henaff et al. (2019); Arandjelovic and Zisserman (2017)]]
]

.hidden.bigger[
- Boosting performance through _an extra prediction head_ in addition to the main task: semi-supervised learning, domain generalization, etc. 
]

.hidden.citation[D. Dwibedi et al., Temporal Cycle-Consistency Learning, CVPR 2019 X. Wang et al., Learning Correspondence from the Cycle-Consistency of Time, CVPR 2019 O. Henaff et al., Data-Efficient Image Recognition with Contrastive Predictive Coding, ArXiv 2019 R. Arandjelovic and A. Zisserman, Look, Listen and Learn , ICCV 2017]

---
count: false
class: middle

.bigger[
- Apply on downstream task with few(er) or no labels .cites[[Dwibedi et al. (2019); Wang et al. (2019); Henaff et al. (2019); Arandjelovic and Zisserman (2017)]]
]

.hidden.bigger[
- Boosting performance through _an extra prediction head_ in addition to the main task: semi-supervised learning, domain generalization, etc. 
]

.citation[D. Dwibedi et al., Temporal Cycle-Consistency Learning, CVPR 2019 X. Wang et al., Learning Correspondence from the Cycle-Consistency of Time, CVPR 2019 O. Henaff et al., Data-Efficient Image Recognition with Contrastive Predictive Coding, ArXiv 2019 R. Arandjelovic and A. Zisserman, Look, Listen and Learn , ICCV 2017]

---
count: false
class: middle

.bigger[
- Apply on downstream task with few(er) or no labels .cites[[Dwibedi et al. (2019); Wang et al. (2019); Henaff et al. (2019); Arandjelovic and Zisserman (2017)]]
]

.bigger[
- Boosting performance through _an extra prediction head_ in addition to the main task: semi-supervised learning, domain generalization, etc.
]

---
count: false
class: middle

.inactive.bigger[Self-supervision has been successfully leveraged for several areas:]

.inactive.bigger[
- Apply on downstream task with few(er) or no labels .cites[.inactive[[Dwibedi et al. (2019); Wang et al. (2019); Henaff et al. (2019); Arandjelovic and Zisserman (2017)]]]
]

.bigger[
- Boosting performance through _an extra prediction head_ in addition to the main task: semi-supervised learning, domain generalization, etc.
]

---

.center.width-75[![](images/gidaris-boosting.png)]

- Both paradigms aim to learn from few or no labeled images and consist of two stages
- .bold[Self-supervision as auxiliary objective] at 1st learning stage of few-shot model:
  - More diverse features and better ability to adapt to novel classes with few data
- Explored self-supervised methods: _Rotation_ and _Relative Patch Location_

.citation[S. Gidaris et al., Boosting Few-Shot Learning with Self-Supervision, ICCV 2019; J.C. Su et al., When Does Self-supervision Improve Few-shot Learning?, ArXiv 2019]

---

.kol-4-12[
 
.center.width-80[![](images/hendrycks-4.png)]
.caption[Accuracy under adversarial attacks of varying strenghts $\varepsilon$]
]

.center.width-85[![](images/hendrycks-2.png)]
.caption[ Accuracy of usual training vs training with auxiliary rotation self-supervision on the 19 CIFAR-10-C corruptions.]
]
]

.hidden[
.center.width-25[![](images/hendrycks-3.png)]
.caption[Out-of-Distribution detection performance]
]

.citation[D. Hendrycks et al., Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty, NIPS 2019; F. Ahmed & A. Courville, Detecting semantic anomalies, AAAI 2020]

---

.kol-4-12[
 
.center.width-80[![](images/hendrycks-4.png)]
.caption[Accuracy under adversarial attacks of varying strenghts $\varepsilon$]
]

.center.width-85[![](images/hendrycks-2.png)]
.caption[ Accuracy of usual training vs training with auxiliary rotation self-supervision on the 19 CIFAR-10-C corruptions.]
]
]

.center.width-25[![](images/hendrycks-3.png)]
.caption[Out-of-Distribution detection performance]

.citation[D. Hendrycks et al., Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty, NIPS 2019; F. Ahmed & A. Courville, Detecting semantic anomalies, AAAI 2020]

---

## .center[Domain generalization and adaptation]

.center.width-90[![](images/jigen-1.png)]

- Recognizing objects across visual domains requires high generalization abilities. 
- Signals from pretext tasks allow to capture natural invariances and regularities in data.

???
- Signals from self-supervised pretext tasks allow to capture natural invariances and regularities in data that could mitigate large style gaps.

---

.center.width-60[![](images/jigen-2.png)]
.caption[CAM activation maps: yellow corresponds to high values, while dark blue corresponds to low values.]

The self-supervised pretext task enables localization of the most informative part of the image, useful for object class prediction regardless of the visual domain.

---

## .center[Rotation for better GAN Discriminators]

.grid[
.kol-6-12[
.center.width-100[![](images/rot-gan.png)]
.caption[Discriminator with rotation-based self-supervision. ]
]

.kol-6-12[
 
 
- Add self-supervised auxiliary task to .italic[Discriminator]
- Encourage the .italic[Discriminator] to learn meaningful feature representations which are not forgotten during training

]
]

???
- Both the fake and real images are rotated by $0\degree$, $90\degree$, $180\degree$, and $270\degree$ degrees. The colored arrows indicate that only the upright images are considered for true vs. fake classification loss task. For the rotation loss, all images are classified by the discriminator according to their rotation degree.
-  Self-supervised unconditional GAN reaches similar performance to state-of-the-art conditional counterparts

---

# 6. Open challenges

---

.bigger[Although they achieve outstanding performance, contrastive methods require long training and complex setups, e.g. TPUs, large mini-batches.]

.hidden.bigger[Recent works have showed that reconstruction methods can achieve competitive performance by reconstructing features instead of inputs.]

.hidden.Q[Can we improve or go beyond contrastive learning?]

.hidden.citation[S. Gidaris et al., Learning Representations by Predicting Bags of Visual Words, CVPR 2020 J. B. Grill et al., Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning, ArXiv 2020 ]
---

.bigger[Although they achieve outstanding performance, contrastive methods require long training and complex setups, e.g. TPUs, large mini-batches.]

.bigger[Recent works have showed that reconstruction methods can achieve competitive performance by reconstructing features instead of inputs.]

.hidden.Q[Can we improve or go beyond contrastive learning?]

.citation[S. Gidaris et al., Learning Representations by Predicting Bags of Visual Words, CVPR 2020 J. B. Grill et al., Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning, ArXiv 2020 ]
---

.bigger[Although they achieve outstanding performance, contrastive methods require long training and complex setups, e.g. TPUs, large mini-batches.]

.bigger[Recent works have showed that reconstruction methods can achieve competitive performance by reconstructing features instead of inputs.]

.center.Q[Can we improve or go beyond contrastive learning?]

.bigger[.citet[Asano et al. (2020)] show that **as little as a single image is sufficient**, when combined with self-supervision and data augmentation, to learn the first few layers of standard deep networks as well as using millions of images and full supervision.]

.citation[Y.M. Asano, A critical analysis of self-supervision, or what we can learn from a single image, ICLR 2020]

---
class: middle

.grid[
.kol-6-12[
.center[1 - Take a high-resolution image]
 
.center.width-80[![](images/asano-1.png)]
]

.hidden[
.kol-6-12[
.center[2 - Generate 1M images of crops and augmentations]
 
.center.width-80[![](images/asano-3.png)]
]
]
]

.hidden[.Q[How to better leverage data for extracting more sophisticated information for learning representations?]]

.citation[Y.M. Asano, A critical analysis of self-supervision, or what we can learn from a single image, ICLR 2020]

---
count: false
class: middle

.grid[
.kol-6-12[
.center[1 - Take a high-resolution image]
 
.center.width-80[![](images/asano-1.png)]
]

.kol-6-12[
.center[2 - Generate 1M images of crops and augmentations]
 
.center.width-80[![](images/asano-3.png)]
]
]

.hidden[.Q[How to better leverage data for extracting more sophisticated information for learning representations?]]

.citation[Y.M. Asano, A critical analysis of self-supervision, or what we can learn from a single image, ICLR 2020]

---
count: false
class: middle

.grid[
.kol-6-12[
.center[1 - Take a high-resolution image]
 
.center.width-80[![](images/asano-1.png)]
]

.kol-6-12[
.center[2 - Generate 1M images of crops and augmentations]
 
.center.width-80[![](images/asano-3.png)]
]
]

.center.Q[How to better leverage data for extracting more sophisticated information for learning representations?]

.citation[Y.M. Asano, A critical analysis of self-supervision, or what we can learn from a single image, ICLR 2020]

---

]

.kol-5-12[
 
- With few exceptions, e.g. MonoDepth, most self-supervised methods deal with ImageNet-like data with one dominant object per image.

- In the case of autonomous driving data with HD images and large complex scenes, these strategies are cumbersome to apply.
]
]

---

.center.width-100[![](images/pcam.jpg)]

- Instead, in medical imaging, frequently there is little structure to exploit from images.

.hidden.center.Q[How to go beyond single-object images?]

.citation[E. Bejnordi et al., Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer, JAMA 2017]

---

.center.width-100[![](images/pcam.jpg)]

- Instead, in medical imaging, frequently there is little structure to exploit from images.

.center.Q[How to go beyond single-object images?]

.citation[E. Bejnordi et al., Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer, JAMA 2017]

---

# Open challenges

## Beyond contrastive learning

## Beyond learning simple image statistics

## Beyond single object images

---
class: middle

.bold.bigger.pull-right[Misha Denil]

.center.width-30[![](images/thinker.jpg)]

---
layout: false 
class: end-slide, center
count: false

The end.

---
layout: true

.center.footer[Andrei BURSUC and Relja ARANDJELOVIĆ | Self-Supervised Learning]

---
count: false
class: middle

## .center[Typical data augmentation composition (pytorch) for self-supervised learning ]
 
.grid[

.kol-3-12[
]
.kol-6-12[
```
from torchvision import transforms
...
augmentation = [
    transforms.RandomResizedCrop(224, scale=(0.2, 1.)),
    transforms.RandomApply([
        transforms.ColorJitter(0.8, 0.8, 0.8, 0.2)], p=0.8),
    transforms.RandomGrayscale(p=0.2),
    transforms.RandomApply([GaussianBlur([0.1, 2.0])], p=0.5),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225])
]

```
]

]

.citation[T. Chen et al., A Simple Framework for Contrastive Learning of Visual Representations, ArXiv 2020 K. He et al., Improved Baselines with Momentum Contrastive Learning, ArXiv 2020]
---

# Practical considerations

## Feature normalization

---
count: false
# Feature Normalization

- the $\text{softmax}$ function generalizes the sigmoid function and yields a vector of $k$ values in $[0, 1]$ by exponentiating and then normalizing

$$\text{softmax}(a)= \frac{1}{\sum\_j \exp(a\_j)} [\exp(a\_1), \dots, \exp(a\_k)]$$

- the $\text{softmax}$ distribution can be made arbitrarily sharp by simply scaling all the features

.center.width-100[![](images/softmax-npy-1.png)]
.caption[$\text{softmax}$ output for different logit scales]

---
count:false
# Feature Normalization

- the $\text{softmax}$ function generalizes the sigmoid function and yields a vector of $k$ values in $[0, 1]$ by exponentiating and then normalizing

$$\text{softmax}(a)= \frac{1}{\sum\_j \exp(a\_j)} [\exp(a\_1), \dots, \exp(a\_k)]$$

- the $\text{softmax}$ distribution can be made arbitrarily sharp by simply scaling all the features

.center.width-100[![](images/softmax-npy-2.png)]
.caption[$\text{softmax}$ output for different logit scales]

---
count:false
# Feature Normalization

- the $\text{softmax}$ function generalizes the sigmoid function and yields a vector of $k$ values in $[0, 1]$ by exponentiating and then normalizing

$$\text{softmax}(a)= \frac{1}{\sum\_j \exp(a\_j)} [\exp(a\_1), \dots, \exp(a\_k)]$$

- the $\text{softmax}$ distribution can be made arbitrarily sharp by simply scaling all the features

.center.width-100[![](images/softmax-npy-3.png)]
.caption[$\text{softmax}$ output for different logit scales]

---
count:false
# Feature Normalization

- the $\text{softmax}$ function generalizes the sigmoid function and yields a vector of $k$ values in $[0, 1]$ by exponentiating and then normalizing

$$\text{softmax}(a)= \frac{1}{\sum\_j \exp(a\_j)} [\exp(a\_1), \dots, \exp(a\_k)]$$

- the $\text{softmax}$ distribution can be made arbitrarily sharp by simply scaling all the features

.center.width-100[![](images/softmax-npy-4.png)]
.caption[$\text{softmax}$ output for different logit scales]

---
count:false
# Feature Normalization

- the $\text{softmax}$ function generalizes the sigmoid function and yields a vector of $k$ values in $[0, 1]$ by exponentiating and then normalizing

$$\text{softmax}(a)= \frac{1}{\sum\_j \exp(a\_j)} [\exp(a\_1), \dots, \exp(a\_k)]$$

- the $\text{softmax}$ distribution can be made arbitrarily sharp by simply scaling all the features

.center.width-100[![](images/softmax-npy-5.png)]
.caption[$\text{softmax}$ output for different logit scales]

---
count:false
# Feature Normalization

- the $\text{softmax}$ function generalizes the sigmoid function and yields a vector of $k$ values in $[0, 1]$ by exponentiating and then normalizing

$$\text{softmax}(a)= \frac{1}{\sum\_j \exp(a\_j)} [\exp(a\_1), \dots, \exp(a\_k)]$$

- the $\text{softmax}$ distribution can be made arbitrarily sharp by simply scaling all the features

.center.width-100[![](images/softmax-npy-6.png)]
.caption[$\text{softmax}$ output for different logit scales]

---
count:false
# Feature Normalization

- the $\text{softmax}$ function generalizes the sigmoid function and yields a vector of $k$ values in $[0, 1]$ by exponentiating and then normalizing

$$\text{softmax}(a)= \frac{1}{\sum\_j \exp(a\_j)} [\exp(a\_1), \dots, \exp(a\_k)]$$

- the $\text{softmax}$ distribution can be made arbitrarily sharp by simply scaling all the features

.center.width-100[![](images/softmax-npy-7.png)]
.caption[$\text{softmax}$ output for different logit scales]

---
count:false
# Feature Normalization

- the $\text{softmax}$ function generalizes the sigmoid function and yields a vector of $k$ values in $[0, 1]$ by exponentiating and then normalizing

$$\text{softmax}(a)= \frac{1}{\sum\_j \exp(a\_j)} [\exp(a\_1), \dots, \exp(a\_k)]$$

- the $\text{softmax}$ distribution can be made arbitrarily sharp by simply scaling all the features

.center.width-100[![](images/softmax-npy-8.png)]
.caption[$\text{softmax}$ output for different logit scales]

---
count: false
class: middle

.bigger[$\ell\_{2}$-normalization is frequently used for normalizing features for the contrastive loss.]

.bigger[Without $\ell\_{2}$-normalization, accuracy on the pretext task is usually better, $+2-5\\%$, but lower in the downstream task (even on the same dataset, e.g. ImageNet),  $-7\\%$]

.citation[T. Chen et al., A Simple Framework for Contrastive Learning of Visual Representations, ArXiv 2020 ]

---
count: false
# Temperature scaling

- The temperature parameter $\tau$ controls the concentration level of the $\text{softmax}$ distribution

$$ \text{softmax}(a/\tau):= \frac{1}{\sum\_j \exp(a\_j/\tau)} [\exp(a\_1/\tau), \dots, \exp(a\_k/\tau)]$$

- it mitigates the peaky predictions from $\text{softmax}$ by increasing its entropy

$$\tau=1$$

.center.width-90[![](images/softmax-scaled-npy-0.png)]
.caption[Impact of temperature scaling on non-normalized logits]

---

# Temperature scaling

- The temperature parameter $\tau$ controls the concentration level of the $\text{softmax}$ distribution

$$ \text{softmax}(a/\tau):= \frac{1}{\sum\_j \exp(a\_j/\tau)} [\exp(a\_1/\tau), \dots, \exp(a\_k/\tau)]$$

- it mitigates the peaky predictions from $\text{softmax}$ by increasing its entropy

$$\tau=2$$

.center.width-90[![](images/softmax-scaled-npy-1.png)]
.caption[Impact of temperature scaling on non-normalized logits]

---

# Temperature scaling

- The temperature parameter $\tau$ controls the concentration level of the $\text{softmax}$ distribution

$$ \text{softmax}(a/\tau):= \frac{1}{\sum\_j \exp(a\_j/\tau)} [\exp(a\_1/\tau), \dots, \exp(a\_k/\tau)]$$

- it mitigates the peaky predictions from $\text{softmax}$ by increasing its entropy

$$\tau=5$$

.center.width-90[![](images/softmax-scaled-npy-2.png)]
.caption[Impact of temperature scaling on non-normalized logits]

# Temperature scaling

- The temperature parameter $\tau$ controls the concentration level of the $\text{softmax}$ distribution

$$ \text{softmax}(a/\tau):= \frac{1}{\sum\_j \exp(a\_j/\tau)} [\exp(a\_1/\tau), \dots, \exp(a\_k/\tau)]$$

- it mitigates the peaky predictions from $\text{softmax}$ by increasing its entropy

$$\tau=10$$

.center.width-90[![](images/softmax-scaled-npy-3.png)]
.caption[Impact of temperature scaling on non-normalized logits]

# Temperature scaling

- The temperature parameter $\tau$ controls the concentration level of the $\text{softmax}$ distribution

$$ \text{softmax}(a/\tau):= \frac{1}{\sum\_j \exp(a\_j/\tau)} [\exp(a\_1/\tau), \dots, \exp(a\_k/\tau)]$$

- it mitigates the peaky predictions from $\text{softmax}$ by increasing its entropy

$$\tau=20$$

.center.width-90[![](images/softmax-scaled-npy-4.png)]
.caption[Impact of temperature scaling on non-normalized logits]

# Temperature scaling

- The temperature parameter $\tau$ controls the concentration level of the $\text{softmax}$ distribution

$$ \text{softmax}(a/\tau):= \frac{1}{\sum\_j \exp(a\_j/\tau)} [\exp(a\_1/\tau), \dots, \exp(a\_k/\tau)]$$

- it mitigates the peaky predictions from $\text{softmax}$ by increasing its entropy

$$\tau=50$$

.center.width-90[![](images/softmax-scaled-npy-5.png)]
.caption[Impact of temperature scaling on non-normalized logits]

# Temperature scaling

- The temperature parameter $\tau$ controls the concentration level of the $\text{softmax}$ distribution

$$ \text{softmax}(a/\tau):= \frac{1}{\sum\_j \exp(a\_j/\tau)} [\exp(a\_1/\tau), \dots, \exp(a\_k/\tau)]$$

- it mitigates the peaky predictions from $\text{softmax}$ by increasing its entropy

$$\tau=100$$

.center.width-90[![](images/softmax-scaled-npy-6.png)]
.caption[Impact of temperature scaling on non-normalized logits]

# Temperature scaling

- The temperature parameter $\tau$ controls the concentration level of the $\text{softmax}$ distribution

$$ \text{softmax}(a/\tau):= \frac{1}{\sum\_j \exp(a\_j/\tau)} [\exp(a\_1/\tau), \dots, \exp(a\_k/\tau)]$$

- it mitigates the peaky predictions from $\text{softmax}$ by increasing its entropy

$$\tau=400$$

.center.width-90[![](images/softmax-scaled-npy-7.png)]
.caption[Impact of temperature scaling on non-normalized logits]

---
count: false
class: middle

.bigger[
The optimum value of $\tau$ depends whether logits are normalized or not or if they are followed by a non-linear projection

Some .italic[rules-of-thumb] *(when extensive hyper-parameter search is not an option)*:
- no $\ell\_2$-normalization: $\tau \in [10, 100]$.cites[[Chen et al. (2020)]]
- $\ell\_2$-normalized vectors: $\tau \in [0.07, 0.1]$ .cites[[Wu et al. (2018), He et al. (2020), Chen et al. (2020)]]
- $\ell\_2$-normalized vectors + non-linear projection: $\tau = 0.2$ .cites[[He et al. (2020)]]
]

.citation[Z. Wu et al., Unsupervised Feature Learning via Non-Parametric Instance Discrimination, CVPR 2018 K. He et al., Momentum Contrast for Unsupervised Visual Representation Learning, CVPR 2020 T. Chen et al., A Simple Framework for Contrastive Learning of Visual Representations, ArXiv 2020 K. He et al., Improved Baselines with Momentum Contrastive Learning, ArXiv 2020 ]

---
count: false
class: middle

.center.width-55[![](images/impact-l2norm.png)]
.caption[Linear evaluation for models trained with different choices of $\ell\_2$-normalization and temperature $\tau$ for contrastive loss with 4096 samples.]

.citation[T. Chen et al., A Simple Framework for Contrastive Learning of Visual Representations, ArXiv 2020 ]

---
count: false
class: middle

# Practical considerations
## Hyperparameters for pretext and downstream tasks

---
count: false
class: middle

.center.bold[*Stage 1:* Train network on pretext task (without human labels) ]
.center.width-100[![](images/self-sup-pipeline-step1.png)]

.center.bold[*Stage 2:* Train classifier on learned features for new task with fewer labels]
.center.width-100[![](images/self-sup-pipeline-step2.png)]
]

- For the _Linear Classification Protocol_, hyperparameter tuning was limited in order to emphasize the utility of the learned features.

- .citet[Tian et al. (2019)] and .citet[He et al. (2020)] observed that feature distributions from pretext tasks can be substantially different from supervised ones

- This leads to different hyperparamater values: 
__weight decay = $0$__,  __initial learning rate = $30$__
]
]
]

.hidden[
.citation[Y. Tian et al.Contrastive Multiview Coding, ArXiv 2019 K. He et al., Momentum Contrast for Unsupervised Visual Representation Learning, CVPR 2020 ]
]

---

.center.bold[*Stage 1:* Train network on pretext task (without human labels) ]
.center.width-100[![](images/self-sup-pipeline-step1.png)]

.center.bold[*Stage 2:* Train classifier on learned features for new task with fewer labels]
.center.width-100[![](images/self-sup-pipeline-step2-1.png)]
]

- For the _Linear Classification Protocol_, hyperparameter tuning was limited in order to emphasize the utility of the learned features.

- .citet[Tian et al. (2019)] and .citet[He et al. (2020)] observed that feature distributions from pretext tasks can be substantially different from supervised ones

- This leads to different hyperparamater values: 
__weight decay = $0$__,  __initial learning rate = $30$__
]
]
]

.hidden[
.citation[Y. Tian et al.Contrastive Multiview Coding, ArXiv 2019 K. He et al., Momentum Contrast for Unsupervised Visual Representation Learning, CVPR 2020 ]
]
---

.center.bold[*Stage 1:* Train network on pretext task (without human labels) ]
.center.width-100[![](images/self-sup-pipeline-step1.png)]

.center.bold[*Stage 2:* Train classifier on learned features for new task with fewer labels]
.center.width-100[![](images/self-sup-pipeline-step2-1.png)]
]

- For the _Linear Classification Protocol_, hyperparameter tuning was limited in order to emphasize the utility of the learned features.

- .citet[Tian et al. (2019)] and .citet[He et al. (2020)] observed that feature distributions from pretext tasks can be substantially different from supervised ones

- This leads to different hyperparamater values: 
__weight decay = $0$__,  __initial learning rate = $30$__
]

]

.citation[Y. Tian et al.Contrastive Multiview Coding, ArXiv 2019 K. He et al., Momentum Contrast for Unsupervised Visual Representation Learning, CVPR 2020 ]

---

.center.bold[*Stage 1:* Train network on pretext task (without human labels) ]
.center.width-100[![](images/self-sup-pipeline-step1.png)]

.center.bold[*Stage 2:* Train classifier on learned features for new task with fewer labels]
.center.width-100[![](images/self-sup-pipeline-step2-3.png)]
]

- Similarly for fine-tuning, we can train BatchNorm layers instead of converting them into frozen affine layers. 
- This leads to performance boosts on different tasks: object detection, semantic segmentation.
]

]

.citation[K. He et al., Momentum Contrast for Unsupervised Visual Representation Learning, CVPR 2020 ]

---
count: false
class:middle

## .center[Impact of the architecture]

.kol-6-12[
 
 Quality of visual representations learned by various self-supervised learning techniques depends significantly on the CNN architecture used for the pretext task. 
]
]

.citation[A. Kolesnikov et al., Revisiting Self-Supervised Visual Representation Learning, CVPR 2019]

---
count: false
class:middle

## .center[Using more data for pre-training]

.grid[
.kol-6-12[
.center.width-100[![](images/goyal-impact-data.png)]
.caption[Transfer learning performance of self-supervised methods on the VOC07 dataset for AlexNet and ResNet-50 when varying the size of the pre-training dataset (ImageNet + YFCC-100M) ]
]

.kol-6-12[
- Increasing the size of pre-training data improves the transfer learning performance for both the _Jigsaw_ and _Colorization_ methods on ResNet-50 and AlexNet.
- _Jigsaw_ performs better than _Colorization_.
- The performance of the _Jigsaw_ model saturates (log- linearly) as we increase the data scale from 1M to 100M.
- performance gap between AlexNet and ResNet-50 keeps increasing, suggesting that higher capacity models are needed to take full advantage of the larger pre-training datasets.
]
]

.citation[ P. Goyal et al., Scaling and Benchmarking Self-Supervised Visual Representation Learning, ICCV 2019]

---
count: false
class:middle

## .center[Increasing problem complexity]

.grid[
.kol-6-12[
.center.width-100[![](images/goyal-impact-problem.png)]
.caption[Transfer learning performance of  _Jigsaw_ and _Colorization_ on VOC07 dataset for AlexNet and ResNet-50 when varying pretext problem complexity. ]
]

.kol-6-12[
 
 
- for _Jigsaw_ we see an improvement in transfer learning performance with increased complexity
- _Colorization_ appears to be less sensitive to changes in problem complexity
]
]

.citation[ P. Goyal et al., Scaling and Benchmarking Self-Supervised Visual Representation Learning, ICCV 2019]

---
count: false
class: middle

.bigger[.citet[Newell and Deng (2020)] show that the __greatest benefits of pretraining are currently in low data regimes__, and utility approaches zero before performance plateaus on the task from additional labels.]

.bigger[This finding underpins the usefulness of self-supervised learning when limited annotation data is available.]

.citation[A. Newell and J. Deng, How Useful is Self-Supervised Pretraining for Visual Tasks?, CVPR 2020]

---
count: false
class: middle

## .center[Changing domains for both pre-training and downstream datasets]

.grid[
.kol-6-12[
.center.width-100[![](images/wallace-bench.png)]
.caption[List of 16 datasets evaluated in .cites[[Wallace and Hariharan (2020)]]. They are into 4 categories: .bold[Internet], .bold[Symbolic], .bold[Scenes & Textures], and .bold[Biological].]
]

.kol-6-12[
 
- Self-supervised techniques: _Rotation_, _Jigsaw_, _NPID_, _Autoencoders_
- _Rotation_ is the most semantically meaningful task
- Much of the performance of _Jigsaw_ and _NPID_ comes from the nature of their induced distribution rather than semantic understanding
- There are still several areas, e.g. fine-grained classification, where all tasks underperform.
]

]

.citation[ B. Wallace and B. Hariharan, Extending and Analyzing Self-Supervised Learning Across Domains, ArXiv 2020]

---

## .center[Representation learning stress test ]

.center.width-60[![](images/vtab.png)]
.caption[The 19 datasets of the Visual Task Adaptation Benchmark (VTAB). They are grouped into three sets: .bold[Natural], .bold[Specialized], .bold[Structured]. ]

.citation[X. Zhai et al., A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark, ArXiv 2019]

---
count: false
class: middle

## .center[Representation learning stress test ]

.citation[X. Zhai et al., A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark, ArXiv 2019]

---
count: false
class: middle

## .center[Representation learning stress test ]

.center.width-70[![](images/vtab-overall.png)]
.caption[Aggregated performance of each method group: .bold[generative] (VAE, WAE, BigGAN, BigBiGAN), .bold[training from-scratch], .bold[self-supervised], .bold[semi-supervised] (*Rotation*, *Exemplar*, *Relative Patch Location*, *Jigsaw*) (using $10\\%$ ImageNet labels), .bold[supervised] (using $100\\%$ ImageNet labels). .bold[All]: using all labeled images from downstream task; .bold[1000]: using 100 annotated images]

- Self-supervised out-performs .italic[from-scratch] training.
- Interestingly, on .bold[Structured] tasks self-supervised methods perform slightly better than supervised ones.

.citation[X. Zhai et al., A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark, ArXiv 2019]

???
- This indicates supervised ImageNet models are invariant to useful features required for structured understanding, but self-supervised methods can capture these to some degree

---
count: false
class: middle

## .center[Linear classification vs. fine-tuning]

.kol-6-12[
.center.width-100[![](images/vtab-finetune.png)]
.caption[Kendall’s correlation between fine-tuning and linear evaluation on each dataset.]
]

.kol-6-12[
 
- Linear classification significantly lowers performance
- Overall linear evaluation is a poor proxy for overall reduced sample complexity
- A similar result was found by .citet[Newell and Deng (2020)]
]

]
.citation[X. Zhai et al., A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark, ArXiv 2019 A. Newell and J. Deng, How Useful is Self-Supervised Pretraining for Visual Tasks?, CVPR 2020]

---
class: middle

.caption[Accuracies of linear classifiers trained on
the representations from intermediate layers of supervised and self-supervised network ]
]

.kol-6-12[
 
- With sufficient data augmentation, one image allows self-supervision to learn good and generalizable features
- At deeper layers there is a gap with supervised methods, that is mitigated to some extent by large datasets for self-supervised methods
]
]

.hidden[.Q[How to better leverage data for extracting more sophisticated information for learning representations?]]

.citation[Y.M. Asano, A critical analysis of self-supervision, or what we can learn from a single image, ICLR 2020]