Going deeper [Andrei Bursuc]

### Deep Learning - MAP583 2019-2020
# Part 4: Going Deeper

url: https://abursuc.github.io/slides/polytechnique/04_deeper.html
.citation[
With slides from A. Karpathy, F. Fleuret,  G. Louppe, C. Ollion, O. Grisel, Y. Avrithis ...]

---

# Reputation of Deep Learning

]
.kol-6-12[

]
]

---

# Reputation of Deep Learning

]
.kol-6-12[
- .Q[Why would it be a good idea to stack more layers?]
- .Q[Are there any theoretical insights for doing his? Empirical ones?]
]
]

---

# Outline

## Universal approximation theorem

## Why going deeper?

## Regularization in deep networks
  ###      classic regularization: $L\_2$ regularization
  ###      implicit regularization: Dropout, Batch Normalization

## Residual networks

---

# Going deeper

---

# Universal function approximation

.bold[Theorem.] ( Hornik et al, 1991) Let $\sigma$ be a nonconstant, bounded, and monotonically-increasing
continuous function. For any $f \in C([0, 1]^{d})$ and $\varepsilon >
0$, there exists $h \in \mathbb{N}$ real constants $v\_i, b\_i \in
\mathbb{R}$ and real vectors $w_i \in \mathbb{R}^d$ such that:

$$ | \sum\_i^h v\_i \sigma(w\_i^Tx + b\_i) - f (x) | < \varepsilon $$

that is: neural nets are dense in $C([0, 1]^{d})$.

.credit[Slide credit: G. Louppe]
.citation[K. Hornik et al., Approximation Capabilities of Multilayer Feedforward Networks, 1991]

--
count: false

- It guarantees that even a single hidden-layer network can represent any classification
  problem in which the boundary is locally linear (smooth);
- It does not inform about good/bad architectures, nor how they relate to the optimization procedure.
- The universal approximation theorem generalizes to any non-polynomial (possibly unbounded) activation function, including the ReLU (Leshno, 1993).

---

.bold[Theorem] (Barron, 1992) The mean integrated square error between the estimated network $\hat{F}$ and the target function $f$ is bounded by
$$O\left(\frac{C^2\_f}{q} + \frac{qp}{N}\log N\right)$$
where $N$ is the number of training points, $q$ is the number of neurons, $p$ is the input dimension, and $C\_f$ measures the global smoothness of $f$.

__.bold[tl;dr:]__ Provided enough data, it guarantees that adding more neurons will result in a better approximation.

---
# Problem solved?

UFA theorems **do not tell us**:

- The number $h$ of hidden units small enough to have the network fit
  in RAM.

- The optimal function parameters can be found in finite time by
  minimizing the Empirical Risk with SGD and the usual random
  initialization schemes.

---
# Approximation with ReLU nets

def relu(x):
    return np.maximum(x, 0)

def rect(x, a, b, h, eps=1e-7):
    return h / eps * (
           relu(x - a)
         - relu(x - (a + eps))
         - relu(x - b)
         + relu(x - (b + eps)))

x = np.linspace(-3, 3, 1000)
*y = ( rect(x, -1, 0, 0.4))

plt.plot(x, y)
```
]

.citation[Conner Davis, [Quora: Is a single layered ReLU network still a universal approximator?](https://www.quora.com/Is-a-single-layered-ReLu-network-still-a-universal-approximator)]

---
# Approximation with ReLU nets

def relu(x):
    return np.maximum(x, 0)

def rect(x, a, b, h, eps=1e-7):
    return h / eps * (
           relu(x - a)
         - relu(x - (a + eps))
         - relu(x - b)
         + relu(x - (b + eps)))

x = np.linspace(-3, 3, 1000)
*y = (  rect(x, -1, 0, 0.4)
*    + rect(x,  0, 1, 1.3)
*    + rect(x,  1, 2, 0.8))

plt.plot(x, y)
```
]

.citation[Conner Davis, [Quora: Is a single layered ReLU network still a universal approximator?](https://www.quora.com/Is-a-single-layered-ReLu-network-still-a-universal-approximator)]

---
# Approximation with ReLU nets

def relu(x):
    return np.maximum(x, 0)

def rect(x, a, b, h, eps=1e-7):
    return h / eps * (
           relu(x - a)
         - relu(x - (a + eps))
         - relu(x - b)
         + relu(x - (b + eps)))

*x = np.arange(0,5,0.05) # 10
z = np.arange(0,5,0.001)
sin_approx = np.zeros_like(z)

*for i in range(2, x.size-1):
*    sin_approx = sin_approx + rect(z,(x[i]+x[i-1])/2, 
*          (x[i]+x[i+1])/2,  np.sin(x[i]), 1e-7)

plt.plot(x, y)
```
]

.citation[Conner Davis, [Quora: Is a single layered ReLU network still a universal approximator?](https://www.quora.com/Is-a-single-layered-ReLu-network-still-a-universal-approximator)]

---
# Approximation with ReLU nets

def relu(x):
    return np.maximum(x, 0)

def rect(x, a, b, h, eps=1e-7):
    return h / eps * (
           relu(x - a)
         - relu(x - (a + eps))
         - relu(x - b)
         + relu(x - (b + eps)))

*x = np.arange(0,5,0.25) # 20
z = np.arange(0,5,0.001)
sin_approx = np.zeros_like(z)

*for i in range(2, x.size-1):
*    sin_approx = sin_approx + rect(z,(x[i]+x[i-1])/2, 
*          (x[i]+x[i+1])/2,  np.sin(x[i]), 1e-7)

plt.plot(x, y)
```
]

.citation[Conner Davis, [Quora: Is a single layered ReLU network still a universal approximator?](https://www.quora.com/Is-a-single-layered-ReLu-network-still-a-universal-approximator)]

---
# Approximation with ReLU nets

def relu(x):
    return np.maximum(x, 0)

def rect(x, a, b, h, eps=1e-7):
    return h / eps * (
           relu(x - a)
         - relu(x - (a + eps))
         - relu(x - b)
         + relu(x - (b + eps)))

*x = np.arange(0,5,0.1) # 50
z = np.arange(0,5,0.001)
sin_approx = np.zeros_like(z)

*for i in range(2, x.size-1):
*    sin_approx = sin_approx + rect(z,(x[i]+x[i-1])/2, 
*          (x[i]+x[i+1])/2,  np.sin(x[i]), 1e-7)

plt.plot(x, y)
```
]

.citation[Conner Davis, [Quora: Is a single layered ReLU network still a universal approximator?](https://www.quora.com/Is-a-single-layered-ReLu-network-still-a-universal-approximator)]

---
# Approximation with ReLU nets

def relu(x):
    return np.maximum(x, 0)

def rect(x, a, b, h, eps=1e-7):
    return h / eps * (
           relu(x - a)
         - relu(x - (a + eps))
         - relu(x - b)
         + relu(x - (b + eps)))

*x = np.arange(0,5,0.01) # 500
z = np.arange(0,5,0.001)
sin_approx = np.zeros_like(z)

*for i in range(2, x.size-1):
*    sin_approx = sin_approx + rect(z,(x[i]+x[i-1])/2, 
*          (x[i]+x[i+1])/2,  np.sin(x[i]), 1e-7)

plt.plot(x, y)
```
]

.citation[Conner Davis, [Quora: Is a single layered ReLU network still a universal approximator?](https://www.quora.com/Is-a-single-layered-ReLu-network-still-a-universal-approximator)]

---

# Approximation with ReLU nets