Transfer learning [Andrei Bursuc]

### Deep Learning - MAP583 2019-2020
# Part 5: Transfer learning  and Self-supervised learning

url: https://abursuc.github.io/slides/polytechnique/05_learning_tasks.html
.citation[
With slides from A. Karpathy, F. Fleuret,  G. Louppe, C. Ollion, O. Grisel, Y. Avrithis ...]

---

# Outline

## Transfer learning

## Off-the shelf networks

## Fine-tuning

## (Task) transfer learning

## Multi-task learning

## Domain adaptation

## Self-supervised learning

---

# Transfer Learning

---

# Transfer learning

- Assume two datasets $S$ and $T$

- Dataset $S$ is fully annotated, plenty of images and we can train a model $CNN_S$ on it

- Dataset $T$ is not as much annotated and/or with fewer images
  + annotations of $T$ do not necessarily overlap with $S$

- We can use the model $CNN_S$ to learn a better $CNN_T$

- This is transfer learning

---

# Transfer learning

- Even if our dataset $T$ is not large, we can train a CNN for it

- Pre-train a CNN on the dataset $S$

- The we can do:
  + fine-tuning
  + use CNN as feature extractor

---

# Transfer learning

Many flavors of transfer learning

---

# Why using a pre-trained CNN (.italic[off-the-shelf]) would be a good idea?

---

# Image ranking by CNN features

.center.width-100[![](images/part5/cnn_features_1.png)]

- $3$-chanel RGB input, $224 \times 224$
- AlexNet pre-trained on ImageNet for classification

.citation[A. Krizhevksy et al., Imagenet Classification with Deep Convolutional Neural Networks, NIPS 2012]

---

# Image ranking by CNN features

.center.width-100[![](images/part5/cnn_features_2.png)]

- $3$-chanel RGB input, $224 \times 224$
- AlexNet pre-trained on ImageNet for classification
- last fully connected layer ($fc\_6$): _global descriptor_ dimension $k=4096$

.citation[A. Krizhevksy et al., Imagenet Classification with Deep Convolutional Neural Networks, NIPS 2012]

---
# VGG-16

.citation[K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, NIPS 2014]

---
# VGG-16

```md
           Activation maps          Parameters
INPUT:     [224x224x3]   = 150K     0
CONV3-64:  [224x224x64]  = 3.2M     (3x3x3)x64    =       1,728
CONV3-64:  [224x224x64]  = 3.2M     (3x3x64)x64   =      36,864
POOL2:     [112x112x64]  = 800K     0
CONV3-128: [112x112x128] = 1.6M     (3x3x64)x128  =      73,728
CONV3-128: [112x112x128] = 1.6M     (3x3x128)x128 =     147,456
POOL2:     [56x56x128]   = 400K     0
CONV3-256: [56x56x256]   = 800K     (3x3x128)x256 =     294,912
CONV3-256: [56x56x256]   = 800K     (3x3x256)x256 =     589,824
CONV3-256: [56x56x256]   = 800K     (3x3x256)x256 =     589,824
POOL2:     [28x28x256]   = 200K     0
CONV3-512: [28x28x512]   = 400K     (3x3x256)x512 =   1,179,648
CONV3-512: [28x28x512]   = 400K     (3x3x512)x512 =   2,359,296
CONV3-512: [28x28x512]   = 400K     (3x3x512)x512 =   2,359,296
POOL2:     [14x14x512]   = 100K     0
CONV3-512: [14x14x512]   = 100K     (3x3x512)x512 =   2,359,296
CONV3-512: [14x14x512]   = 100K     (3x3x512)x512 =   2,359,296
CONV3-512: [14x14x512]   = 100K     (3x3x512)x512 =   2,359,296
POOL2:     [7x7x512]     =  25K     0
FC:        [1x1x4096]    = 4096     7x7x512x4096  = 102,760,448
FC:        [1x1x4096]    = 4096     4096x4096     =  16,777,216
FC:        [1x1x1000]    = 1000     4096x1000     =   4,096,000

TOTAL activations: 24M x 4 bytes ~=  93MB / image (x2 for backward)
TOTAL parameters: 138M x 4 bytes ~= 552MB (x2 for plain SGD, x4 for Adam)
```

.citation[K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, NIPS 2014]

---

```md
           Activation maps          Parameters
INPUT:     [224x224x3]   = 150K     0
CONV3-64:  [224x224x64]  = 3.2M     (3x3x3)x64    =       1,728
CONV3-64:  [224x224x64]  = 3.2M     (3x3x64)x64   =      36,864
POOL2:     [112x112x64]  = 800K     0
CONV3-128: [112x112x128] = 1.6M     (3x3x64)x128  =      73,728
CONV3-128: [112x112x128] = 1.6M     (3x3x128)x128 =     147,456
POOL2:     [56x56x128]   = 400K     0
CONV3-256: [56x56x256]   = 800K     (3x3x128)x256 =     294,912
CONV3-256: [56x56x256]   = 800K     (3x3x256)x256 =     589,824
CONV3-256: [56x56x256]   = 800K     (3x3x256)x256 =     589,824
POOL2:     [28x28x256]   = 200K     0
CONV3-512: [28x28x512]   = 400K     (3x3x256)x512 =   1,179,648
CONV3-512: [28x28x512]   = 400K     (3x3x512)x512 =   2,359,296
CONV3-512: [28x28x512]   = 400K     (3x3x512)x512 =   2,359,296
POOL2:     [14x14x512]   = 100K     0
CONV3-512: [14x14x512]   = 100K     (3x3x512)x512 =   2,359,296
CONV3-512: [14x14x512]   = 100K     (3x3x512)x512 =   2,359,296
CONV3-512: [14x14x512]   = 100K     (3x3x512)x512 =   2,359,296
POOL2:     [7x7x512]     =  25K     0
FC:        [1x1x4096]    = 4096     7x7x512x4096  = 102,760,448
*FC:        [1x1x4096]    = 4096     4096x4096     =  16,777,216
FC:        [1x1x1000]    = 1000     4096x1000     =   4,096,000

TOTAL activations: 24M x 4 bytes ~=  93MB / image (x2 for backward)
TOTAL parameters: 138M x 4 bytes ~= 552MB (x2 for plain SGD, x4 for Adam)
```

.citation[K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, NIPS 2014]

---

# Image ranking by CNN features

.center.width-70[![](images/part5/cnn_features_3.png)]

- query images

.citation[A. Krizhevksy et al., Imagenet Classification with Deep Convolutional Neural Networks, NIPS 2012]

---
count: false

# Image ranking by CNN features

.center.width-70[![](images/part5/cnn_features_4.png)]

- query images : nearest neighbors in ImageNet according to Euclidean distance

.citation[A. Krizhevksy et al., Imagenet Classification with Deep Convolutional Neural Networks, NIPS 2012]

---

# Sampling information from feature maps

.center.width-90[![](images/part5/mac_1.png)]

- VGG-16 last convolutional layer, $k=512$

# Sampling information from feature maps

.center.width-90[![](images/part5/mac_2.png)]

- VGG-16 last convolutional layer, $k=512$
- global spatial max-pooling/sum

# Sampling information from feature maps

.center.width-90[![](images/part5/mac_3.png)]

- VGG-16 last convolutional layer, $k=512$
- global spatial max-pooling/sum
- $\ell\_2$ normalization, PCA-whitening, $\ell\_2$ normalization
- _MAC_: maximum activation from convolutions

---

```md
           Activation maps          Parameters
INPUT:     [224x224x3]   = 150K     0
CONV3-64:  [224x224x64]  = 3.2M     (3x3x3)x64    =       1,728
CONV3-64:  [224x224x64]  = 3.2M     (3x3x64)x64   =      36,864
POOL2:     [112x112x64]  = 800K     0
CONV3-128: [112x112x128] = 1.6M     (3x3x64)x128  =      73,728
CONV3-128: [112x112x128] = 1.6M     (3x3x128)x128 =     147,456
POOL2:     [56x56x128]   = 400K     0
CONV3-256: [56x56x256]   = 800K     (3x3x128)x256 =     294,912
CONV3-256: [56x56x256]   = 800K     (3x3x256)x256 =     589,824
CONV3-256: [56x56x256]   = 800K     (3x3x256)x256 =     589,824
POOL2:     [28x28x256]   = 200K     0
CONV3-512: [28x28x512]   = 400K     (3x3x256)x512 =   1,179,648
CONV3-512: [28x28x512]   = 400K     (3x3x512)x512 =   2,359,296
CONV3-512: [28x28x512]   = 400K     (3x3x512)x512 =   2,359,296
POOL2:     [14x14x512]   = 100K     0
CONV3-512: [14x14x512]   = 100K     (3x3x512)x512 =   2,359,296
CONV3-512: [14x14x512]   = 100K     (3x3x512)x512 =   2,359,296
CONV3-512: [14x14x512]   = 100K     (3x3x512)x512 =   2,359,296
*POOL2:     [7x7x512]     =  25K     0
FC:        [1x1x4096]    = 4096     7x7x512x4096  = 102,760,448
FC:        [1x1x4096]    = 4096     4096x4096     =  16,777,216
FC:        [1x1x1000]    = 1000     4096x1000     =   4,096,000

TOTAL activations: 24M x 4 bytes ~=  93MB / image (x2 for backward)
TOTAL parameters: 138M x 4 bytes ~= 552MB (x2 for plain SGD, x4 for Adam)
```

.citation[K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, NIPS 2014]

---

# Global max-pooling: matching

.center.width-50[![](images/part5/mac_results_1.png)]

- receptive fields of 5 components of MAC vectors that contribute most to image similarity

---

# Global max-pooling: matching

.center.width-90[![](images/part5/mac_results_2.png)]

- receptive fields of 5 components of MAC vectors that contribute most to image similarity

---

# Global max-pooling: matching

.center.width-60[![](images/part5/mac_results_3.png)]

- receptive fields of 5 components of MAC vectors that contribute most to image similarity

---

# Regional max-pooling (R-MAC)

.center.width-90[![](images/part5/r-mac_1.png)]

- VGG-16 last convolutional layer, $k=512$
- fixed mulitscale overlapping regions

---

# Regional max-pooling (R-MAC)

.center.width-90[![](images/part5/r-mac_2.png)]

- VGG-16 last convolutional layer, $k=512$
- fixed mulitscale overlapping regions, spatial _max_-pooling

# Regional max-pooling (R-MAC)

.center.width-90[![](images/part5/r-mac_3.png)]

- VGG-16 last convolutional layer, $k=512$
- fixed mulitscale overlapping regions, spatial _max_-pooling
- $\ell\_2$ normalization, PCA-whitening, $\ell\_2$ normalization

---

# Regional max-pooling (R-MAC)

.center.width-90[![](images/part5/r-mac_4.png)]

- VGG-16 last convolutional layer, $k=512$
- fixed mulitscale overlapping regions, spatial _max_-pooling
- $\ell\_2$ normalization, PCA-whitening, $\ell\_2$ normalization
- _sum_-pooling over all descriptors, $\ell\_2$ normalization

---

# Can we do more with a bit of training?
---

class: middle
Training a task-specific linear classifier on top of CNN features led to SoTA or nearly-SoTA results

.center.width-80[![](images/part5/razavian.png)]

.citation[A. Razavian et al., CNN Features off-the-shelf: an Astounding Baseline for Recognition, arXiv 2014]

---

# Tasks and datasets

### Object classification and detection

.center.width-80[![](images/part5/pascal_voc.png)]
.caption[Pascal VOC]

### Fine grained classification

.center.width-40[![](images/part5/cub.png)]
.caption[CUB 200 2011]

---

# Tasks and datasets

### Instance retrieval

.center.width-100[![](images/part5/oxford_buildings.png)]
.caption[Oxford Buildings and Paris Buildings]

---

.center.width-60[![](images/part5/carlsson.png)]

--
count: false

##.center[.red[2020 edit: Not quite there yet]]

---

# Fine-tuning

---

# Fine-tuning

- Assume the parameters of $CNN_S$ are already a good start near our final local optimum

- Use them as the initial parameters for our new CNN for the target dataset

- This is a good solution when the dataset $T$ is relatively big 
  + e.g. for Imagenet $S$ with 1M images, $T$ with a few thousands images

---

# Fine-tuning

- Depending on the size of $T$ decide which layer to freeze and which to finetune/replace
- Use lower learning rate when fine-tuning: about $\frac{1}{10}$ of original learning rate
  + for new layers use agressive learning rate
- If $S$ and $T$ are very similar,fine-tune only fully-connected layers
- If datasets are different and you have enough data, fine-tune all layers

---

# Selecting which layers to freeze

.center.width-70[![](images/part5/cnn_vis_6.png)]

---

# Fine-tuning a ResNet?

.center.width-100[![](images/part5/resnet.png)]

---
count: false

# Fine-tuning a ResNet?

.center.width-100[![](images/part5/resnet.png)]

Preferably select layers to freeze main block

.center.width-70[![](images/part5/resnet_3.png)]

---

.center.width-60[![](images/part5/bread_butter.jpg)]

Pre-trained networks are the bread and butter in computer vision (at large) solutions, both academic and industrial.

Other communities are starting to follow this practice.

---

Pre-trained networks are the bread and butter in computer vision: _semantic segmentation_

]

---

Pre-trained networks are the bread and butter in computer vision: _object retrieval_

.center.width-90[![](images/part5/siamac.png)]

.citation[F. Radenovic, CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples, ECCV 2016]

---

Pre-trained networks are the bread and butter in computer vision:  _visual question answering_

.center.width-90[![](images/part5/vqa_1.png)]
.caption[VQA]

---

Pre-trained networks are the bread and butter in computer vision:  _style transfer_

.center[
<video width="640" height="480" controls loop>
 <source src="images/part5/picasso-periods.mp4" type="video/mp4">
</video>
]

---

Pre-trained networks are the bread and butter in computer vision:  _style transfer_

.center.width-80[![](images/part5/style_transfer.png)]

---

# .green[Question ?]

---

Imagine you have to work with x-ray images and you don't have too much of it. _What can you do?_

.center.width-70[![](images/part5/x-ray.png)]

---

There is a popular belief that pre-training is always better than random initialization even after convergence.

---

There is a popular belief that ~~pre-training is always better than random initialization even after convergence~~.

## .center[.red[False / Not anymore at least]]

---

# Rethinking pre-training

---

# (Task) Transfer learning

---

---
# Question

- Some vision tasks are more inter-related than others:
  + depth estimation could help surface normals estimation?
  + scene layout could help object detection?

- for some task annotation data can be more easily obtainable than others

- could we find fully computational approach for modeling the structure of the space of visual tasks?

---

# Task bank/dictionary

- Dataset
  + 4M real images
  + each image has the GT label for all tasks

- Task specific networks
  + 26 x

]
.kol-6-12[
.center[<img src="images/part5/task_dictionary.png" style="width: 400px;" />]
]
]
---

# Task bank/dictionary

.center[<iframe width="900" height="495" src="https://www.youtube.com/embed/SUq1CiX-KzM" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>]

# Common architecture

- encoder: ResNet50
- transfer function: 2 conv layers
- decoder: 15 _fully convolutional_ layers / 2-3 _fully connected_ layers

# Common architecture

- _full training_ (gold standard): $120k$ training, $16k$ validation, $17k$ testing
- _fine-tuning_: $1 - 16k$

# Computational model

.center.width-100[![](images/part5/taskonomy_overview.png)]

---

# Computed taxonomies

---
---

# Multi-Task Learning

---

.big[If we have different types of labels for a given dataset we can:
-  train multiple models (one for each type of labels)
-  inject all knowledge in a single model]

---

.big[If we have different types of labels for a given dataset we can:
-  train multiple models (one for each type of labels)
-  __inject all knowledge in a single model__]

---

# Multi-Task Learning

.center.width-100[![](images/part5/mtl_1.png)]
---
count: false
# Multi-Task Learning

.center.width-80[![](images/part5/mtl_1.png)]

---

# Domain Adaptation

---
# Example: Internet images $\rightarrow$ webcam

.center.width-80[![](images/part5/domain_adaptation_1.png)]

---
# Example: (semi-)synthetic $\rightarrow$ real

---

# Domain-adversarial networks

.center.width-100[![](images/part5/domain_adversarial.png)]

# Domain-adversarial networks

.center.width-60[![](images/part5/domain_adversarial.png)]

- build a network
- train .green[feature extractor] + .blue[class predictor] on source data
- train .green[feature extractor] + .pink[domain classifier] on source+target data
- use .green[feature extractor] + .blue[class predictor] at test time

---

# Domain-adversarial networks

.center.width-100[![](images/part5/domain_adversarial_2.png)]

---

# Large-gap domain adaptation

Using video games to generate training data

.center.width-70[![](images/part5/gtav.jpg)]
.caption[GTA V]

---

# Unsupervised domain adaptation

.center.width-50[![](images/part5/advent_1.png)]
.caption[Directly testing on the target data is not quite optimal]

.citation[T. Vu et al., ADVENT: Adversarial Entropy Minimization for Domain Adaptation in Semantic Segmentation, CVPR 2019]
---

# Unsupervised domain adaptation

.center.width-90[![](images/part5/advent_2.png)]
.caption[ADVENT - Adversarial Entropy Minimization for Domain Adaptation]

.citation[T. Vu et al., ADVENT: Adversarial Entropy Minimization for Domain Adaptation in Semantic Segmentation, CVPR 2019]
---

# Unsupervised domain adaptation

.center.width-60[![](images/part5/advent_4.png)]
.caption[GAP w.r.t. oracle, .italic[i.e.] when training on the target data]

.citation[T. Vu et al., ADVENT: Adversarial Entropy Minimization for Domain Adaptation in Semantic Segmentation, CVPR 2019]
---

# Unsupervised domain adaptation

.center.width-80[![](images/part5/advent_6.png)]
.caption[Semantic segmentation: from GTA V $\rightarrow$ Cityscapes]

.citation[T. Vu et al., ADVENT: Adversarial Entropy Minimization for Domain Adaptation in Semantic Segmentation, CVPR 2019]
---

# Unsupervised domain adaptation

.center.width-80[![](images/part5/advent_7.png)]
.caption[Object detection: from Cityscapes $\rightarrow$ Foggy Cityscapes (synthetic fog)]

.citation[T. Vu et al., ADVENT: Adversarial Entropy Minimization for Domain Adaptation in Semantic Segmentation, CVPR 2019]

---

# Also domain adaptation

Instead of learning domain invariant features, we can modify the synthetic data to "look" more realistic.

.center.width-60[![](images/part5/apple_1.png)]

.citation[A. Srivastava et al., Learning from Simulated and Unsupervised Images through Adversarial Training, CVPR 2017]

---

# Also domain adaptation

.center.width-100[![](images/part5/apple_2.png)]

.citation[A. Srivastava et al., Learning from Simulated and Unsupervised Images through Adversarial Training, CVPR 2017]

---
count: false

# Also domain adaptation

.center.width-100[![](images/part5/apple_2.png)]

.center.width-80[![](images/part5/apple_3.png)]

.citation[A. Srivastava et al., Learning from Simulated and Unsupervised Images through Adversarial Training, CVPR 2017]

---

# Self-supervised learning

---

# Self-supervised learning

Self-supervised learning consists of two learning stages:

- 1) Learn a feature extractor with a proxy annotation-free task
- 2) Transfer it to an end task:
  + end tasks: image classification, object detection, etc.

- __Goal__: at 1st stage learn good features for the 2nd stage

.center.width-90[![](images/part5/rotnet_1.png)]

---

## Self-supervised learning is the platypus of machine learning

.center.width-70[![](images/part5/platypus.jpg)]
---

## .center[Self-supervised learning is the platypus of machine learning]

.center.width-50[![](images/part5/platypus.jpg)]

- the loss is supervised
- no labeling is needed since labeling comes from the data itself
- it's somewehere in-between _supervised_ and _unsupervised learning_

---

# Image colorization

.center.width-60[![](images/part5/colorization_2.jpg)]
.caption[Convert grayscale images to colorized]

# Image colorization

.center.width-100[![](images/part5/colorization_1.jpg)]

---

# Exemplar networks

- perturb image patches (color distorsions, cropping, affine transformations) 
- train net to classify exemplars as same class

.center.width-80[![](images/part5/exemplar_nets.png)]

.citation[A. Dosovitskiy et al., Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks, PAMI 2015]

---

# Rotation prediction

.center.width-80[![](images/part5/rotnet_2.png)]

.citation[S. Gidaris et al., Unsupervised representation learning by predicting image rotations, ICLR 2018]
---

# Rotation prediction

.center.width-80[![](images/part5/rotnet_3.png)]

.citation[S. Gidaris et al., Unsupervised representation learning by predicting image rotations, ICLR 2018]

---

# Relative patch location

.center.width-100[![](images/part5/patch_location.png)]
.citation[C. Doersch et al., Unsupervised Visual Representation Learning by Context Prediction, ICCV 2015]
---

# Relative patch location

.center.width-100[![](images/part5/patch_location_2.png)]
.citation[C. Doersch et al., Unsupervised Visual Representation Learning by Context Prediction, ICCV 2015]

---

# Jigsaw puzzles

.center.width-100[![](images/part5/jigsaw_1.png)]
.citation[M. Noroozi and P. Favaro, Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles, ECCV 2016]
---
# Context encoders

.center.width-80[![](images/part5/context_encoder_1.png)]
.citation[D. Pathak, Context Encoders: Feature Learning by Inpainting, CVPR 2016]
---
# Context encoders

.center.width-80[![](images/part5/context_encoder_2.png)]
.citation[D. Pathak, Context Encoders: Feature Learning by Inpainting, CVPR 2016]

---

# Predicting the correct order of time

.center.width-50[![](images/part5/temporal_1.png)]
.citation[I. Misra et al., Shuffle and Learn: Unsupervised Learning using Temporal Order Verification, ECCV 2016]

---

# Predicting the correct order of time

.center.width-60[![](images/part5/temporal_2.png)]
.citation[I. Misra et al., Shuffle and Learn: Unsupervised Learning using Temporal Order Verification, ECCV 2016]

---
# Sorting video sequences

.center.width-70[![](images/part5/opn.png)]

---

# Sorting video sequences

.center.width-80[![](images/part5/opn_2.png)]

---

# Audio-visual correspondence

## .center[ Can audio and video learn from each other?]

.center.width-80[![](images/part5/look_listen_3.png)]
.citation[R. Arandjelovic, Look, Listen and Learn, ICCV 2017]

#Audio-visual correspondence

.center.width-80[![](images/part5/look_listen_4.png)]
.citation[R. Arandjelovic, Look, Listen and Learn, ICCV 2017]
.credit[Slide credit: A. Zissermanl]