CNNs - Siamese [Andrei Bursuc]

# Part 3: Convolutional layers. Siamese Networks

url: https://abursuc.github.io//slides/polytechnique/03_cnns_siamese.html
.citation[
With slides from A. Karpathy, F. Fleuret, J. Johnson, S. Yeung, G. Louppe, Y. Avrithis ...]

---

# Previously

## Neural networks

## Backpropagation

---

# The neuron
Inspired by neuroscience and human brain, but resemblances do not go too far

.center[<img src="images/part3/neuron_2.png" style="height: 350px;"/>]
.credit[Slide credit: A. Karpathy]

---
# Multi-layer perceptron

Layers can be composed *in series*, such that:
$$\begin{aligned}
\mathbf{h}\_0 &= \mathbf{x} \\\\
\mathbf{h}\_1 &= \sigma(\mathbf{W}\_1^T \mathbf{h}\_0 + \mathbf{b}\_1) \\\\
... \\\\
\mathbf{h}\_L &= \sigma(\mathbf{W}\_L^T \mathbf{h}\_{L-1} + \mathbf{b}\_L) \\\\
f(\mathbf{x}; \theta) &= \mathbf{h}\_L
\end{aligned}$$
where $\theta$ denotes the model parameters $\\{ \mathbf{W}\_k, \mathbf{b}\_k, ... | k=1, ..., L\\}$.

- This model is the **multi-layer perceptron**, also known as the _fully connected feedforward network_.
- Optionally, the last activation $\sigma$ can be skipped to produce unbounded output values $\hat{y} \in \mathbb{R}$.

---

## Computational graph view

## More canonical view (with units and connections)

---

## More canonical view (with units and connections)

---

# Element-wise activation functions

- Each neuron/unit is followed by a dedicated activation function
- Historically the _sigmoid_ and _tanh_ have been the most popular
.center.width-50[![](images/part3/activation_functions.png)]

- [Many other activation functions available](https://dashee87.github.io/data%20science/deep%20learning/visualising-activation-functions-in-neural-networks/)

---

# Softmax

- the $\text{softmax}$ function generalizes the sigmoid function and yields a vector of $k$ values in $[0, 1]$ by exponentiating and then normalizing

$$\sigma(a) := \text{softmax}(a):= \frac{1}{\sum\_j e^{a\_j}} (e^{a\_1}, \dots, e^{a\_k})$$

- as activation values increase $\text{softmax}$ tends to focus on the maximum

.center.width-80[![](images/part3/softmax_1.png)]

---
count:false
# Softmax

- the $\text{softmax}$ function generalizes the sigmoid function and yields a vector of $k$ values in $[0, 1]$ by exponentiating and then normalizing

$$\sigma(a) := \text{softmax}(a):= \frac{1}{\sum\_j e^{a\_j}} (e^{a\_1}, \dots, e^{a\_k})$$

- as activation values increase $\text{softmax}$ tends to focus on the maximum

.center.width-80[![](images/part3/softmax_2.png)]

---
count:false
# Softmax

- the $\text{softmax}$ function generalizes the sigmoid function and yields a vector of $k$ values in $[0, 1]$ by exponentiating and then normalizing

$$\sigma(a) := \text{softmax}(a):= \frac{1}{\sum\_j e^{a\_j}} (e^{a\_1}, \dots, e^{a\_k})$$

- as activation values increase $\text{softmax}$ tends to focus on the maximum

.center.width-80[![](images/part3/softmax_3.png)]

---
count:false
# Softmax

- the $\text{softmax}$ function generalizes the sigmoid function and yields a vector of $k$ values in $[0, 1]$ by exponentiating and then normalizing

$$\sigma(a) := \text{softmax}(a):= \frac{1}{\sum\_j e^{a\_j}} (e^{a\_1}, \dots, e^{a\_k})$$

- as activation values increase $\text{softmax}$ tends to focus on the maximum

.center.width-80[![](images/part3/softmax_4.png)]

---
count:false
# Softmax

- the $\text{softmax}$ function generalizes the sigmoid function and yields a vector of $k$ values in $[0, 1]$ by exponentiating and then normalizing

$$\sigma(a) := \text{softmax}(a):= \frac{1}{\sum\_j e^{a\_j}} (e^{a\_1}, \dots, e^{a\_k})$$

- as activation values increase $\text{softmax}$ tends to focus on the maximum

.center.width-80[![](images/part3/softmax_5.png)]

---
count:false
# Softmax

- the $\text{softmax}$ function generalizes the sigmoid function and yields a vector of $k$ values in $[0, 1]$ by exponentiating and then normalizing

$$\sigma(a) := \text{softmax}(a):= \frac{1}{\sum\_j e^{a\_j}} (e^{a\_1}, \dots, e^{a\_k})$$

- as activation values increase $\text{softmax}$ tends to focus on the maximum

.center.width-80[![](images/part3/softmax_6.png)]

---
count:false
# Softmax

- the $\text{softmax}$ function generalizes the sigmoid function and yields a vector of $k$ values in $[0, 1]$ by exponentiating and then normalizing

$$\sigma(a) := \text{softmax}(a):= \frac{1}{\sum\_j e^{a\_j}} (e^{a\_1}, \dots, e^{a\_k})$$

- as activation values increase $\text{softmax}$ tends to focus on the maximum

.center.width-80[![](images/part3/softmax_7.png)]

---

.center.width-50[![](images/part3/add_more_layers.jpg)]

---

count:false

# Why stacking layers is a good idea?

]

---

# Why stacking layers is a good idea?

.grid[
.kol-4-12[
.center.width-100[![](images/part3/spiral_raw.png)]
.caption[The toy spiral data consists of three classes (blue, red, yellow) that are not linearly separable.]
]
.kol-4-12[
.center.width-100[![](images/part3/spiral_linear.png)]
.caption[Linear classifier fails to learn the toy spiral dataset.]
]
.kol-4-12[]

]

---

# Why stacking layers is a good idea?

---
# Backpropagation

Consider a 1-dimensional output composition $f \circ g$, such that
$$\begin{aligned}
y &= f(\mathbf{u}) \\\\
\mathbf{u} &= g(x) = (g\_1(x), ..., g\_m(x)).
\end{aligned}$$
The **chain rule** of total derivatives states that
$$\frac{\text{d} y}{\text{d} x} = \sum\_{k=1}^m \frac{\partial y}{\partial u\_k} \underbrace{\frac{\text{d} u\_k}{\text{d} x}}\_{\text{recursive case}}$$

- Since a neural network is a composition of differentiable functions, the total
derivatives of the loss can be evaluated by applying the chain rule
recursively over its computational graph.
- The implementation of this procedure is called (reverse) **automatic differentiation** (AD).

---

As a guiding example, let us consider a simplified 2-layer MLP and the following loss function:
$$\begin{aligned}
f(\mathbf{x}; \mathbf{W}\_1, \mathbf{W}\_2) &= \sigma\left( \mathbf{W}\_2^T \sigma\left( \mathbf{W}\_1^T \mathbf{x} \right)\right) \\\\
\mathcal{\ell}(y, \hat{y}; \mathbf{W}\_1, \mathbf{W}\_2) &= \text{cross\\_entropy}(y, \hat{y}) + \lambda \left( ||\mathbf{W}_1||\_2 + ||\mathbf{W}\_2||\_2 \right)
\end{aligned}$$

for $\mathbf{x} \in \mathbb{R^p}$, $y \in \mathbb{R}$, $\mathbf{W}\_1 \in \mathbb{R}^{p \times q}$ and $\mathbf{W}\_2 \in \mathbb{R}^q$.

.center.width-60[<img src="images/part3/backprop1.png">]

---

The total derivative $\frac{\text{d} \ell}{\text{d} \mathbf{W}\_1}$ can be computed **backward**, by walking through all paths from $\ell$ to $\mathbf{W}\_1$ in the computational graph and accumulating the terms:
$$\begin{aligned}
\frac{\text{d} \ell}{\text{d} \mathbf{W}\_1} &= \frac{\partial \ell}{\partial u\_8}\frac{\text{d} u\_8}{\text{d} \mathbf{W}\_1} + \frac{\partial \ell}{\partial u\_4}\frac{\text{d} u\_4}{\text{d} \mathbf{W}\_1} \\\\
\frac{\text{d} u\_8}{\text{d} \mathbf{W}\_1} &= ...
\end{aligned}$$

.center.width-60[<img src="images/part3/backprop2.png">]

---

# Today

## Convolutional networks

## Siamese and triplet networks

---

# Convolutional layers

---

# Why would we need them?

If they were handled as normal "unstructured" vectors, large-dimension signals such as sound samples or images would require models of intractable size.

For instance a linear layer taking a $256 \times 256$ RGB image as input, and producing an image of same size would require:

$$ (256 \times 256 \times 3)ˆ2 \simeq 3.87e+10$$

parameters, with the corresponding memory footprint ($\simeq$150Gb !), and excess of capacity.

.center.width-40[![](images/part3/mlp_problem_1.png)]

---

# Why would we need them?

Moreover, this requirement is inconsistent with the intuition that such large signals have some "invariance in translation". __A representation meaningful at a certain location can / should be used everywhere.__

---

# Why would we need them?

A convolutional layer embodies this idea. It applies the same linear transformation locally, everywhere, and preserves the signal structure.

---

# Why would we need them?

- One neuron gets specialized for detecting a full-image pattern, while being sensible to translations

.center.width-70[![](images/part3/mlp_problem_1.png)]

---

# Why would we need them?

- Each neuron gets specialized for detecting a full-image pattern.  
- Neurons from later layer work similarly
- This is a big waste of parameters without good performance.

.center.width-40[![](images/part3/mlp_problem_2.png)]

---

# Why would we need them?

.center.width-100[![](images/part3/sound_wave.png)]

---

# Why would we need them?

.center.width-70[![](images/part3/cat_image.jpg)]

---

.center.width-100[![](images/part3/cat_image_zoom.png)]

---
# Fully connected layer

.center.width-30[![](images/part3/fc.png)]

In a **fully connected layer**, each hidden unit $h\_j = \sigma(\mathbf{w}\_j^T \mathbf{x}+b\_j)$ is connected to the entire image.
- Looking for activations that depend on pixels that are spatially far away is supposedly a waste of time and resources.
- Long range correlations can be dealt with in the higher layers.

---

# Locally connected layer

.center.width-30[![](images/part3/lc.png)]

In a **locally connected layer**, each hidden unit $h\_j$ is connected to only a patch of the image.
- Weights are specialized locally and functionally.
- Reduce the number of parameters.
- What if the object in the image shifts a little?

---

# Convolutional layer

.center.width-30[![](images/part3/conv-layer.png)]

In a **convolutional layer**, each hidden unit $h\_j$ is connected
to only a patch of the image, and **share** its weights with the other units $h\_i$.
- Weights are specialized functionally, regardless of spatial location.
- Reduce the number of parameters.

---

# Convolution

For one-dimensional tensors, given an input vector $\mathbf{x} \in \mathbb{R}^W$ and a convolutional kernel $\mathbf{u} \in \mathbb{R}^w$,
the discrete **convolution** $\mathbf{x} \star \mathbf{u}$ is a vector of size $W - w + 1$ such that
$$\begin{aligned}
(\mathbf{x} \star \mathbf{u})\_i &= \sum\_{m=0}^{w-1} x\_{i+m} u\_m .
\end{aligned}
$$

Technically, $\star$ denotes the cross-correlation operator. However, most machine learning
libraries call it convolution.

---

# Convolution 1d

.center.width-60[![](images/part3/conv1d_1.png)]

# Convolution 1d

.center.width-60[![](images/part3/conv1d_2.png)]

# Convolution 1d

.center.width-60[![](images/part3/conv1d_3.png)]

# Convolution 1d

.center.width-60[![](images/part3/conv1d_4.png)]

# Convolution 1d

.center.width-60[![](images/part3/conv1d_5.png)]

# Convolution 1d

.center.width-60[![](images/part3/conv1d_6.png)]

# Convolution 1d

.center.width-60[![](images/part3/conv1d_7.png)]

# Convolution 1d

.center.width-60[![](images/part3/conv1d_8.png)]

# Convolution 1d

.center.width-60[![](images/part3/conv1d_9.png)]

---

# Convolution

Convolutions generalize to multi-dimensional tensors:
- In its most usual form, a convolution takes as input a 3D tensor $\mathbf{x} \in \mathbb{R}^{C \times H \times W}$, called the input feature map.
- A kernel $\mathbf{u} \in \mathbb{R}^{C \times h \times w}$ slides across the input feature map, along its height and width. The size $h \times w$ is called the receptive field.
- At each location,  the element-wise product between the kernel and the input elements it overlaps is computed and the results are summed up.

---

# Convolution

- The final output $\mathbf{o}$ is a 2D tensor of size $(H-h+1) \times (W-w+1)$ called the output feature map and such that:
$$\begin{aligned}
\mathbf{o}\_{j,i} &= \mathbf{b}\_{j,i} + \sum\_{c=0}^{C-1} (\mathbf{x}\_c \star \mathbf{u}\_c)\_{j,i} = \mathbf{b}\_{j,i} + \sum\_{c=0}^{C-1}  \sum\_{n=0}^{h-1} \sum\_{m=0}^{w-1}  \mathbf{x}\_{c,j+n,i+m} \mathbf{u}\_{c,n,m}
\end{aligned}$$
where $\mathbf{u}$ and $\mathbf{b}$ are shared parameters to learn.
- $D$ convolutions can be applied in the same way to produce a $D \times (H-h+1) \times (W-w+1)$ feature map,
where $D$ is the depth.

---
# Convolution 2d

.center.width-60[![](images/part3/conv2d_1.png)]

# Convolution 2d

.center.width-60[![](images/part3/conv2d_2.png)]

# Convolution 2d

.center.width-60[![](images/part3/conv2d_3.png)]

# Convolution 2d

.center.width-60[![](images/part3/conv2d_4.png)]

# Convolution 2d

.center.width-60[![](images/part3/conv2d_5.png)]

# Convolution 2d

.center.width-60[![](images/part3/conv2d_6.png)]

# Convolution 2d

.center.width-60[![](images/part3/conv2d_7.png)]

# Convolution 2d

.center.width-60[![](images/part3/conv2d_8.png)]

# Convolution 2d

.center.width-60[![](images/part3/conv2d_9.png)]

# Convolution 2d

.center.width-60[![](images/part3/conv2d_10.png)]

# Convolution 2d

.center.width-60[![](images/part3/conv2d_11.png)]

# Convolution 2d

.center.width-60[![](images/part3/conv2d_12.png)]

---
count: false

# Convolution 2d

.center.width-60[![](images/part3/conv2d_13.png)]

---

# Image kernels explained visually

.center.width-80[![](images/part3/conv_illustrated.png)]

---

# Channels

Colored image = tensor of shape `(height, width, channels)`

Convolutions can be computed across channels:

$$
(im \star k)\_{j,i} = \sum\limits\_{c=0}^2 \sum\limits\_{n=0}^4 \sum\limits\_{m=0}^4  im\_{j + n - 2, i + m - 2, c} k\_{n, m, c}
$$

---

# Channels

- For first layer, RGB channels of input image can be easily visualized
- Number of channels is typically increased at deeper levels of the network

---

# Convolutions as neurons

- Since convolutions output one scalar at a time, they can be seen as an individual neuron from a MLP with a receptive field limited to the dimensions of the kernel
- The same neuron is "fired" over multiple areas from the input.

# Convolutions as neurons

.right-column[
.center[.green[Remember this?]]
.center[
<img src="images/part3/neuron_1.png" style="width: 350px;" />
]
]

---

# Convolutions as neurons

.right-column[
.center[.green[Remember this?]]
.center[
<img src="images/part3/neuron_1.png" style="width: 350px;" />
]
]

---

We usually refer to one of the channels generated by a convolution layer as an
__activation map__.

The sub-area of an input map that influences a component of the output as the __receptive field__ of the latter.

In the context of convolutional networks, a standard linear layer is called a __fully connected layer__ since every input influences every output.

# Strides

- Strides: increment step size for the convolution operator
- Reduces the size of the ouput map

# Padding

- Padding: artifically fill borders of image
- Useful to keep spatial dimension constant across filters
- Useful with strides and large receptive fields
- Usually: fill with 0s

---
count: false
# Padding

- Example: input  $C \times 3 \times 5$

---
count: false
# Padding

- Example: input $C \times 3 \times 5$, padding of $(2,1)$

---
count: false
# Padding

- Example: input $C \times 3 \times 5$, padding of $(2,1)$, a stride of $(2,2)$

---
count: false
# Padding