layout: true .center.footer[Marc LELARGE and Andrei BURSUC | Deep Learning Do It Yourself | 13. Siamese Networks] --- class: center, middle, title-slide count: false # 13. Siamese Networks and Representation Learning
.bold[Andrei Bursuc ]
url: https://dataflowr.github.io/website/ .citation[ With slides from F. Fleuret, O. Grisel, E. Oyallon, G. Louppe, Y. Avrithis ...] --- # Siamese networks .center.width-60[data:image/s3,"s3://crabby-images/2cda0/2cda01d6c2095fa9c4f02beeb397c914c0951466" alt=""] - **Recognition:** given a face, classify among K possible persons - **Verification:** verify that two faces belongs to the same person A verification system can be implemented as a similarity measure. If it's really good, useful for recognition. --- class: middle, center .big[Training a classifier with so many classes and so few samples per classes is challenging.] .hidden.big[In addition, for such use-cases, new classes can appear on the fly anytime.] --- count: false class: middle, center .big[Training a classifier with so many classes and so few samples per classes is challenging.] .big[In addition, for such use-cases, new classes can appear on the fly anytime.] --- class: middle, center .big[Instead of training a network to do classification, we can train it to compute useful features from images, allowing us to measure similarities/dissimilarities between images.] --- # Siamese architecture .center.width-60[data:image/s3,"s3://crabby-images/33c21/33c2106556604bd5edaedb51e8ce48cadc579e53" alt=""] - an input sample is a _pair_ $( \mathbf{x}\_i, \mathbf{x}\_j)$ .citation[ S. Chopra et al.,Learning a similarity metric discriminatively, with application to face verification, CVPR 2005] --- # Siamese architecture .center.width-60[data:image/s3,"s3://crabby-images/5dfb8/5dfb8e27da0e5b076539f8795b3ff12ab25b6648" alt=""] - an input sample is a _pair_ $( \mathbf{x}\_i, \mathbf{x}\_j)$ - both $\mathbf{x}\_i$, $\mathbf{x}\_j$ go through the _same_ function $f$ with _shared_ parameters $\theta$ .citation[ S. Chopra et al.,Learning a similarity metric discriminatively, with application to face verification, CVPR 2005] --- # Siamese architecture .center.width-60[data:image/s3,"s3://crabby-images/0c2fc/0c2fc0ca37de8b6981e5f36b8685eb4307c62dfd" alt=""] - an input sample is a _pair_ $( \mathbf{x}\_i, \mathbf{x}\_j)$ - both $\mathbf{x}\_i$, $\mathbf{x}\_j$ go through the _same_ function $f$ with _shared_ parameters $\theta$ - loss $\ell\_{ij}$ is measured on output pair $( \mathbf{y}\_i, \mathbf{y}\_j)$ and target $t\_{ij}$ .citation[ S. Chopra et al.,Learning a similarity metric discriminatively, with application to face verification, CVPR 2005] --- # Contrastive loss .center.width-60[data:image/s3,"s3://crabby-images/be895/be895c9fdbf7c969963e673e5f0215f56ae71f23" alt=""] - input samples $\mathbf{x}\_i$, output vectors $\mathbf{y}\_i= f(\mathbf{x}\_i; \theta)$, target variables $t\_{ij}=\mathbb{1}[\text{sim}(\mathbf{x}\_i, \mathbf{x}\_j)]$ - _contrastive loss_ is a function of distance $\Vert \mathbf{y}\_i - \mathbf{y}\_j \Vert$ only $$\ell\_{ij} = L((\mathbf{y}\_i,\mathbf{y}\_j), t\_{ij} ) = \ell (\Vert \mathbf{y}\_i - \mathbf{y}\_j \Vert, t\_{ij})$$ .citation[ R. Hadsell et al., Dimensionality reduction by learning an invariant mapping, CVPR 2006] --- # Contrastive loss .center.width-60[data:image/s3,"s3://crabby-images/fc404/fc4048c0cb13358cfe6a37efd6cd4ea260784bdf" alt=""] - input samples $\mathbf{x}\_i$, output vectors $\mathbf{y}\_i= f(\mathbf{x}\_i; \theta)$, target variables $t\_{ij}=\mathbb{1}[\text{sim}(\mathbf{x}\_i, \mathbf{x}\_j)]$ - _contrastive loss_ is a function of distance $\Vert \mathbf{y}\_i - \mathbf{y}\_j \Vert$ only $$\ell\_{ij} = L((\mathbf{y}\_i,\mathbf{y}\_j), t\_{ij} ) = \ell (\Vert \mathbf{y}\_i - \mathbf{y}\_j \Vert, t\_{ij})$$ - _similar_ samples are _attracted_ $$\ell(x,t)=\textcolor{red}{t\ell^{+}} + (1-t)\ell^{-}(x) = \textcolor{red}{t x^2} + (1-t)[m-x]^2\_+$$ .citation[ R. Hadsell et al., Dimensionality reduction by learning an invariant mapping, CVPR 2006] --- # Contrastive loss .center.width-60[data:image/s3,"s3://crabby-images/d72e6/d72e68947be6f2ce195c198132071cabc7fdde20" alt=""] - input samples $\mathbf{x}\_i$, output vectors $\mathbf{y}\_i= f(\mathbf{x}\_i; \theta)$, target variables $t\_{ij}=\mathbb{1}[\text{sim}(\mathbf{x}\_i, \mathbf{x}\_j)]$ - _contrastive loss_ is a function of distance $\Vert \mathbf{y}\_i - \mathbf{y}\_j \Vert$ only $$\ell\_{ij} = L((\mathbf{y}\_i,\mathbf{y}\_j), t\_{ij} ) = \ell (\Vert \mathbf{y}\_i - \mathbf{y}\_j \Vert, t\_{ij})$$ - _dissimilar_ samples are _repelled_ if closer than margin $m$ $$\ell(x,t)=t\ell^{+} + (1-t)\textcolor{red}{\ell^{-}(x)} = t x^2 + (1-t)\textcolor{red}{[m-x]^2\_+}$$ .citation[ R. Hadsell et al., Dimensionality reduction by learning an invariant mapping, CVPR 2006] --- # Training siamese networks ## Data collection and loading - sample positive pairs $( \mathbf{x}\_i, \mathbf{x}\_j)$, with samples coming from the same class - sample negative pairs $( \mathbf{x}\_i, \mathbf{x}\_j)$, with samples of different classes - combine pairs of samples in larger mini-batches - __it's highly important to properly tune mini-batches__: balance positives and negatives, discard easy negatives and positives and focus on more difficult ones. .hidden[ ## Learning - pass all images through the two networks with shared parameters (in fact the same network) - backpropagate through the two networks and sum gradients from the two samples ] --- count: false # Training siamese networks ## Data collection and loading - sample positive pairs $( \mathbf{x}\_i, \mathbf{x}\_j)$, with samples coming from the same class - sample negative pairs $( \mathbf{x}\_i, \mathbf{x}\_j)$, with samples of different classes - combine pairs of samples in larger mini-batches - __it's highly important to properly tune mini-batches__: balance positives and negatives, discard easy negatives and positives and focus on more difficult ones. ## Learning - pass all images through the two networks with shared parameters (in fact the same network) - backpropagate through the two networks and sum gradients from the two samples --- # Triplet architecture .grid[ .kol-6-12[ .center.width-100[data:image/s3,"s3://crabby-images/166df/166dfa04cad5cfe00d148157e31c100f2b20b656" alt=""] ] .kol-6-12[ - an input sample is a _triple_ $(\mathbf{x}\_i, \mathbf{x}\_i^+, \mathbf{x}\_i^-)$ ] ] .citation[ Wang et al., Learning fine-grained image similarity with deep ranking, CVPR 2014] --- # Triplet architecture .grid[ .kol-6-12[ .center.width-100[data:image/s3,"s3://crabby-images/87501/87501111b97257c75b5de4c4d9592d92ddc08629" alt=""] ] .kol-6-12[ - an input sample is a _triple_ $(\mathbf{x}\_i, \mathbf{x}\_i^+, \mathbf{x}\_i^-)$ - $\mathbf{x}\_i, \mathbf{x}\_i^+, \mathbf{x}\_i^-$ go through the _same_ function $f$ with _shared_ parameters $\theta$ ] ] .citation[ Wang et al., Learning fine-grained image similarity with deep ranking, CVPR 2014] --- # Triplet architecture .grid[ .kol-6-12[ .center.width-100[data:image/s3,"s3://crabby-images/21f32/21f32546e2e9a7d9c7a9c4e7ca8e10cda3591937" alt=""] ] .kol-6-12[ - an input sample is a _triple_ $(\mathbf{x}\_i, \mathbf{x}\_i^+, \mathbf{x}\_i^-)$ - $\mathbf{x}\_i, \mathbf{x}\_i^+, \mathbf{x}\_i^-$ go through the _same_ function $f$ with _shared_ parameters $\theta$ - loss $\ell\_i$ is measured on output triple $(\mathbf{y}\_i, \mathbf{y}\_i^+, \mathbf{y}\_i^-)$ ] ] .citation[ Wang et al., Learning fine-grained image similarity with deep ranking, CVPR 2014] --- # Triplet loss - input _anchor_ $\mathbf{x}\_i$, output $\mathbf{y}\_i = f(\mathbf{x}\_i; \theta)$; positive $\mathbf{y}\_i^+ = f(\mathbf{x}\_i^+; \theta)$, negative $\mathbf{y}\_i^- = f(\mathbf{x}\_i^-; \theta)$ - _triplet loss_ is a function of distances $\Vert \mathbf{y}\_i - \mathbf{y}\_i^+ \Vert$, $\Vert \mathbf{y}\_i - \mathbf{y}\_i^- \Vert$ only $$\ell\_i = L(\mathbf{y}\_i, \mathbf{y}\_i^+, \mathbf{y}\_i^-) =\ell(\Vert \mathbf{y}\_i - \mathbf{y}\_i^+ \Vert, \Vert \mathbf{y}\_i - \mathbf{y}\_i^- \Vert)$$ $$\ell(x^+, x^-) = [m + (x^+)^2 - (x^-)^2]\_+$$ so distance $\Vert \mathbf{y}\_i - \mathbf{y}\_i^+ \Vert$ should be less than $\Vert \mathbf{y}\_i - \mathbf{y}\_i^- \Vert$ by _margin_ $m$ .citation[ Wang et al., Learning fine-grained image similarity with deep ranking, CVPR 2014] --- # Triplet loss - input _anchor_ $\mathbf{x}\_i$, output $\mathbf{y}\_i = f(\mathbf{x}\_i; \theta)$; positive $\mathbf{y}\_i^+ = f(\mathbf{x}\_i^+; \theta)$, negative $\mathbf{y}\_i^- = f(\mathbf{x}\_i^-; \theta)$ - _triplet loss_ is a function of distances $\Vert \mathbf{y}\_i - \mathbf{y}\_i^+ \Vert$, $\Vert \mathbf{y}\_i - \mathbf{y}\_i^- \Vert$ only $$\ell\_i = L(\mathbf{y}\_i, \mathbf{y}\_i^+, \mathbf{y}\_i^-) =\ell(\Vert \mathbf{y}\_i - \mathbf{y}\_i^+ \Vert, \Vert \mathbf{y}\_i - \mathbf{y}\_i^- \Vert)$$ $$\ell(x^+, x^-) = [m + (x^+)^2 - (x^-)^2]\_+$$ so distance $\Vert \mathbf{y}\_i - \mathbf{y}\_i^+ \Vert$ should be less than $\Vert \mathbf{y}\_i - \mathbf{y}\_i^- \Vert$ by _margin_ $m$ - by taking _two pairs_ $(\mathbf{x}\_i, \mathbf{x}\_i^+)$ and $(\mathbf{x}\_i, \mathbf{x}\_i^-)$ at a time with targets $1$, $0$ respectively, the _contrastive loss_ can be writen similarly $$\ell(x^+, x^-) = (x^+)^2 + [m-x^-]^2\_+$$ so distance $\Vert \mathbf{y}\_i - \mathbf{y}\_i^+ \Vert$ should be small and $\Vert \mathbf{y}\_i - \mathbf{y}\_i^- \Vert$ larger than $m$ .citation[ Wang et al., Learning fine-grained image similarity with deep ranking, CVPR 2014] --- count: false # Training with the triplet loss $$\ell\_i = L(\mathbf{y}\_i, \mathbf{y}\_i^+, \mathbf{y}\_i^-) =\ell(\Vert \mathbf{y}\_i - \mathbf{y}\_i^+ \Vert, \Vert \mathbf{y}\_i - \mathbf{y}\_i^- \Vert)$$ - sample a mini-batch of triplets$(\mathbf{x}\_i, \mathbf{x}\_i^+, \mathbf{x}\_i^-)$ - forward pass on all $3$ networks - compute loss over all samples and sum gradients for updating weights --- # Siamese networks .grid[ .kol-6-12[ ``` class SiameseNet(nn.Module): def __init__(self, embedding_net): super(SiameseNet, self).__init__() self.embedding_net = embedding_net def forward(self, x1, x2): output1 = self.embedding_net(x1) output2 = self.embedding_net(x2) return output1, output2 def get_embedding(self, x): return self.embedding_net(x) ``` ] .kol-6-12[ ``` class TripletNet(nn.Module): def __init__(self, embedding_net): super(TripletNet, self).__init__() self.embedding_net = embedding_net def forward(self, x1, x2, x3): output1 = self.embedding_net(x1) output2 = self.embedding_net(x2) output3 = self.embedding_net(x3) return output1, output2, output3 def get_embedding(self, x): return self.embedding_net(x) ``` ] ] --- count:false # Siamese networks .grid[ .kol-6-12[ ``` class SiameseNet(nn.Module): def __init__(self, embedding_net): super(SiameseNet, self).__init__() self.embedding_net = embedding_net def forward(self, x1, x2): * output1 = self.embedding_net(x1) * output2 = self.embedding_net(x2) return output1, output2 def get_embedding(self, x): return self.embedding_net(x) ``` ] .kol-6-12[ ``` class TripletNet(nn.Module): def __init__(self, embedding_net): super(TripletNet, self).__init__() self.embedding_net = embedding_net def forward(self, x1, x2, x3): * output1 = self.embedding_net(x1) * output2 = self.embedding_net(x2) * output3 = self.embedding_net(x3) return output1, output2, output3 def get_embedding(self, x): return self.embedding_net(x) ``` ] ] --- # Hard negative sampling After a few epochs, If $(\mathbf{x}\_i, \mathbf{x}\_i^{+}, \mathbf{x}\_i^{-})$ are chosen randomly, it will be easy to satisfy the inequality in the loss. -- Gradients in one batch quickly become almost $0$ except for **hard cases**. Random sampling is inefficient to find these hard cases -- count: false - **Hard triplet sampling:** sample $\mathbf{x}\_i^{-}$ such that: $$||\mathbf{y} - \mathbf{y}^{+}|| > ||\mathbf{y} - \mathbf{y}^{-}|| + m$$ - **Semi Hard triplet sampling:** sample $\mathbf{x}\_i^{-}$ such that: $$||\mathbf{y} - \mathbf{y}^{+}|| > ||\mathbf{y} - \mathbf{y}^{-}||$$ --- class: middle, center # Applications --- # Face recognition .center.width-25[data:image/s3,"s3://crabby-images/b4ce2/b4ce2f90d94ba8f0f18611b2a9bfad7dde78b48d" alt=""] - A threshold is computed on test set to decide which faces are the same - Best model achieves 99.6% verification accuracy on Labeled Faces in the Wild dataset - Works well even with non-camera facing faces .citation[F. Schroff et al., Facenet: A unified embedding for face recognition and clustering, CVPR 2015 ] --- # Deep image retrieval .center.width-70[data:image/s3,"s3://crabby-images/7b7de/7b7de7dea45ddb3b3bed22b53fa6d3c7ddfc5a21" alt=""] .citation[A. Gordo et al., Deep Image Retrieval: Learning Global Representations for Image Search, ECCV 2016 ] - query $\mathbf{x}\_i$, relevant $\mathbf{x}\_i^+$ (same building), irrelevant $\mathbf{x}\_i^-$ (other building) --- count: false # Deep image retrieval .center.width-70[data:image/s3,"s3://crabby-images/fa5c1/fa5c14eae6b0a3a1d5492dc442b1bdfeb61ad8ab" alt=""] .citation[A. Gordo et al., Deep Image Retrieval: Learning Global Representations for Image Search, ECCV 2016 ] - query $\mathbf{x}\_i$, relevant $\mathbf{x}\_i^+$ (same building), irrelevant $\mathbf{x}\_i^-$ (other building) --- count: false # Deep image retrieval .center.width-70[data:image/s3,"s3://crabby-images/87c93/87c93889f3098a8cc6f9b9c774d77c872da51634" alt=""] .citation[A. Gordo et al., Deep Image Retrieval: Learning Global Representations for Image Search, ECCV 2016 ] - query $\mathbf{x}\_i$, relevant $\mathbf{x}\_i^+$ (same building), irrelevant $\mathbf{x}\_i^-$ (other building) - triplet loss is evaluated on output $(\mathbf{y}\_i, \mathbf{y}\_i^+, \mathbf{y}\_i^-)$ --- # Patch matching .center.width-45[data:image/s3,"s3://crabby-images/c06b1/c06b1ec5ad26b8d4b497ecdb058a1f200ecad718" alt=""] --- # Patch matching .center.width-70[data:image/s3,"s3://crabby-images/0ff57/0ff5700ae4f8a4cf652bb4bc9b0bb316ab427ab2" alt=""] .credit[Figure credit: A. Vedaldi] --- # Patch matching .center.width-60[data:image/s3,"s3://crabby-images/e6e65/e6e6594b1a7ed57794bf3bffd1de087ec3960b50" alt=""] .credit[Figure credit: A. Vedaldi] --- # Image reconstruction .center.width-70[data:image/s3,"s3://crabby-images/4c9c2/4c9c22777a48097130d543299571df70de8f951a" alt=""] .caption[Structure from motion] .citation[F. Radenovic et al., CNN Image Retrieval Learns From BoW: Unsupervised Fine-Tuning with Hard Examples, ECCV 2016
Schonberger et al., From Single Image Query to Detailed 3D Reconstruction, CVPR 2015.] --- # Person re-identification .grid[ .kol-6-12[ .center.width-70[data:image/s3,"s3://crabby-images/dd0b7/dd0b7d7c3b648b0ab428df2bc376d278cdf161bf" alt=""] ] .kol-6-12[ ] ] .citation[J. Almazan et al., Re-ID Done Right: towards Good Practices for Person re-Identification, arXiv 2018] --- count: false # Person re-identification .grid[ .kol-6-12[ .center.width-70[data:image/s3,"s3://crabby-images/dd0b7/dd0b7d7c3b648b0ab428df2bc376d278cdf161bf" alt=""] ] .kol-6-12[ .center.width-70[data:image/s3,"s3://crabby-images/3101f/3101f98e89bcd5416faf70d4005f19fd7109f9a9" alt=""] ] ] .citation[J. Almazan et al., Re-ID Done Right: towards Good Practices for Person re-Identification, arXiv 2018] --- class: middle, center .big[Going beyond pairs and triplets ...] --- # N-pair loss .center.width-70[data:image/s3,"s3://crabby-images/0c05a/0c05ad0ede1f888fcd12f75707b93c28c5eb1d0c" alt=""] .caption[Deep metric learning with (left) triplet loss and (right) $(N+1)$-tuplet loss.
$(N+1)$-tuplet loss pushes $N-1$ negative examples all at once.] - generalize *triplet-loss* with **N-pair loss** - sample pairs of similar examples and use as negative all the other samples in the mini-batch .citation[K. Sohn, Improved Deep Metric Learning with Multi-class N-pair Loss Objective, NeurIPS 2016] --- ## Exemplar networks .center.bigger.blue[The image on the left is a distorted crop extracted from an image, which of these crops has the same source image?] .hidden.center.bigger.red[Easy if robust to the desired transformations (geometry and colour) ] .grid[ .kol-4-12[
.center.width-25[data:image/s3,"s3://crabby-images/5095b/5095b1598c9bfd85fc8049913c4d23d404242996" alt=""] ] .kol-8-12[ .center.width-100[data:image/s3,"s3://crabby-images/f4a1c/f4a1c146b9d1db40888c0f1918e6ab7c65aa4579" alt=""] ] ] .citation[A. Dosovitskiy et al., Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks, PAMI 2015] --- count: false ## Exemplar networks .center.bigger.blue[The image on the left is a distorted crop extracted from an image, which of these crops has the same source image?] .center.bigger.red[Easy if robust to the desired transformations (geometry and colour) ] .grid[ .kol-4-12[
.center.width-25[data:image/s3,"s3://crabby-images/5095b/5095b1598c9bfd85fc8049913c4d23d404242996" alt=""] ] .kol-8-12[ .center.width-100[data:image/s3,"s3://crabby-images/f629d/f629d084e23a9c1e465dbda3609bc2c75be61808" alt=""] ] ] .citation[A. Dosovitskiy et al., Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks, PAMI 2015] --- ## Exemplar networks .grid[ .kol-4-12[ .center.width-70[data:image/s3,"s3://crabby-images/cbfca/cbfca5e4ae32d3615eda359fd213ba3f0d677487" alt=""] ] .kol-8-12[ .bigger.green[Pros] - Representations are invariant to desired transformations - Requires preservation of fine-grained information .bigger.red[Cons] - Choosing the augmentations is important - Exemplar based: images of the same class or instance are negatives - Nothing prevents it from focusing on the background - Original formulation is not scalable (number of “classes” = dataset size) ] ] .citation[A. Dosovitskiy et al., Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks, PAMI 2015] --- class: middle ## Some notations Let $\mathbf{x} \in \mathcal{X}$ be input data and $y \in \\{1, \dots, L\\}$ and $f\_{\theta}(\cdot):\mathcal{X}\rightarrow\mathbb{R}^D$ a network generating an embedding vector $f\_{\theta}(\mathbf{x})$. We denote: - $q=f\_{\theta}(\mathbf{x})$ (*query*) - $\\{\mathbf{x}^{i}\\}$ a set of samples from $\mathcal{X}$. - $k\_{i} = f\_{\theta}(\mathbf{x}^{i})$ the embeddings of $\\{\mathbf{x}^{i}\\}$ as keys (*representations*) --- ## Exemplar networks .grid[ .kol-3-12[ .center.width-90[data:image/s3,"s3://crabby-images/b7e1a/b7e1a3ef2af28bc1f3042bb874bb42f6e6268ac7" alt=""] ] .kol-9-12[ Exemplar ConvNets are not scalable (number of “classes” = number of training images) - Using $w\_i$ as class prototype prevents explicit comparison between instances, i.e. individual samples - We can use instead a _non-parametric_ variant that replaces $q^\top w\_j$ with $q^\top k\_i$ $$\mathcal{L}\_{\text{softmax}} (q, c(q)) = - \log\frac{\exp(q^\top w\_{c(q)})}{ \sum\_{c\in C}\exp(q^\top w\_{c})}$$ $$\downarrow$$ $$\mathcal{L}\_{\text{non-param-softmax}}(q) = - \log \frac{\exp(q^\top k\_q )}{ \sum\_{i \in N}\exp(q ^\top k\_{i})}$$ ] ] .citation[A. Dosovitskiy et al., Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks, PAMI 2015] --- count: false ## Exemplar networks .grid[ .kol-3-12[ .center.width-90[data:image/s3,"s3://crabby-images/b7e1a/b7e1a3ef2af28bc1f3042bb874bb42f6e6268ac7" alt=""] ] .kol-9-12[ - $N$ is the number of training samples; $C$ is the number of classes - $c(q)$ is the class index of $q$ - $k\_q \in \\{ k\_i\\}$ is the key of a positive sample for $q$ $$\mathcal{L}\_{\text{softmax}} (q, c(q)) = - \log\frac{\exp(q^\top w\_{c(q)})}{ \sum\_{c\in C}\exp(q^\top w\_{c})}$$ $$\downarrow$$ $$\mathcal{L}\_{\text{non-param-softmax}}(q) = - \log \frac{\exp(q^\top k\_q )}{ \sum\_{i \in N}\exp(q ^\top k\_{i})}$$ ] ] .citation[A. Dosovitskiy et al., Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks, PAMI 2015] --- # Histogram loss .center.width-70[data:image/s3,"s3://crabby-images/cde59/cde5961470fd66767a99fa77b7efedd89c7779e2" alt=""] - for a mini-batch compute all pairwise positive similarities and negative similarities - compile similarities into distributions (differentiable approximations of distributions) - the loss aims to repell the two distributions .citation[E. Ustinova et al., Learning Deep Embeddings with Histogram Loss, NeurIPS 2016] --- class: middle, center ## Few classes, few labels --- # Prototypical networks Learns to extract class prototype vectors: - class prototype vector = mean feature vector of training examples - classify test example to the class with the closest prototype ($L\_2$ distance) - prototype vectors are similar to classification weights of networks .grid[ .kol-8-12[ Prototype vector of $k$-th class: $c\_k=\frac{1}{|S\_k|} \sum\_{(\mathbf{x}\_i, y\_i) \in S\_k} f\_{\theta}(\mathbf{x}\_i)$ Classification for example $x$: $$p\_{\theta}(y=k|\mathbf{x}) = \frac{\exp(-d(f\_{\theta}(\mathbf{x}), c\_k))}{\sum\_{k'}\exp(-d(f\_{\theta}(\mathbf{x}), c\_{k'}))}$$ ] .kol-4-12[ .center.width-70[data:image/s3,"s3://crabby-images/01454/01454f0256ed8bc6ab2e9c5ef5d32bc2175c5a03" alt=""] ] ] .citation[J Snell et al., Prototypical Networks for Few-shot Learning, NeurIPS 2017] --- class: middle, center ## Some classes with labels and some without labels --- # Imprinting .grid[ .kol-6-12[ .center.width-90[data:image/s3,"s3://crabby-images/4ebbd/4ebbd2cc201117d9c8215893a6eb542e9e690997" alt=""] ] .kol-6-12[ The cosine classifier $$p\_{\theta}(y=k|\mathbf{x}) = \frac{\exp(\tau f\_{\theta}(\mathbf{x})^\top w\_k )} { \sum\_{k'}\exp(\tau f\_{\theta}(\mathbf{x})^\top w\_{k'} )}$$ All vectors and weights are $L\_2$-normalized before the last layer ] ] .citation[H. Qi et al., Low-Shot Learning with Imprinted Weights, CVPR 2018
S. Gidaris et al., Dynamic Few-Shot Visual Learning without Forgetting, CVPR 2018] --- count: false # Imprinting .grid[ .kol-6-12[ .center.width-90[data:image/s3,"s3://crabby-images/4ebbd/4ebbd2cc201117d9c8215893a6eb542e9e690997" alt=""] ] .kol-6-12[ .center.width-100[data:image/s3,"s3://crabby-images/088d0/088d0a2c027090c954704c4db4d0a22bc1e9b544" alt=""] .caption[ _left_: before imprinting; _right_: after imprinting] ] ] .citation[H. Qi et al., Low-Shot Learning with Imprinted Weights, CVPR 2018
S. Gidaris et al., Dynamic Few-Shot Visual Learning without Forgetting, CVPR 2018] --- count: false # Imprinting .grid[ .kol-6-12[.center.width-90[data:image/s3,"s3://crabby-images/4ebbd/4ebbd2cc201117d9c8215893a6eb542e9e690997" alt=""] ] .kol-6-12[ - use network as feature extractor, train as usual - for new class, compute average embedding vector - use embedded vector as new class proxy/template - can be also fine-tuned later on ] ] .citation[H. Qi et al., Low-Shot Learning with Imprinted Weights, CVPR 2018
S. Gidaris et al., Dynamic Few-Shot Visual Learning without Forgetting, CVPR 2018] --- # Imprinting ## Results .grid[ .kol-6-12[ .center[
] ] .kol-6-12[ .center[
] ] ] .citation[H. Qi et al., Low-Shot Learning with Imprinted Weights, CVPR 2018
S. Gidaris et al., Dynamic Few-Shot Visual Learning without Forgetting, CVPR 2018] --- # Take-aways - .big[A specific type of architectures suitable for representation learning] - .big[In case of many classes and class imbalance, siamese networks might be better, while for standard settings use classification] - .big[Modern methods leverage siamese networks for unsupervised learning, where the network must recognize a perturbed sample in a large pool of negatives.] --- class: middle, center # Multi-modal siamese networks --- # CLIP .center.width-50[data:image/s3,"s3://crabby-images/e3aa4/e3aa414dcc2a8ae975c9d03e37c6132a55b6b2a3" alt=""] .center[Training an image network and a text network together from $400M$ pairs of images and captions crawled from Internet] .citation[A. Radford et al., Learning Transferable Visual Models From Natural Language Supervision, arXiv 2021 ] --- # CLIP .center.width-55[data:image/s3,"s3://crabby-images/f3cc5/f3cc5ad3f93b66fdb3844a5d71eadb2632c59745" alt=""] .center[Use networks and embeddings in *zero-shot* classification] .citation[A. Radford et al., Learning Transferable Visual Models From Natural Language Supervision, arXiv 2021 ] --- class: middle ## Implementation - numpy-like pseudocode ```python # image_encoder - ResNet or Vision Transformer # text_encoder - CBOW or Text Transformer # I[n, h, w, c] - minibatch of aligned images # T[n, l] - minibatch of aligned texts # W_i[d_i, d_e] - learned proj of image to embed # W_t[d_t, d_e] - learned proj of text to embed # t - learned temperature parameter # extract feature representations of each modality I_f = image_encoder(I) #[n, d_i] T_f = text_encoder(T) #[n, d_t] # joint multimodal embedding [n, d_e] I_e = l2_normalize(np.dot(I_f, W_i), axis=1) T_e = l2_normalize(np.dot(T_f, W_t), axis=1) # scaled pairwise cosine similarities [n, n] logits = np.dot(I_e, T_e.T) * np.exp(t) # symmetric loss function labels = np.arange(n) loss_i = cross_entropy_loss(logits, labels, axis=0) loss_t = cross_entropy_loss(logits, labels, axis=1) loss = (loss_i + loss_t)/2 ``` .citation[A. Radford et al., Learning Transferable Visual Models From Natural Language Supervision, arXiv 2021 ] --- class: middle ## Works very well in practice .center.width-40[data:image/s3,"s3://crabby-images/c94b8/c94b843e0d876f870b4240f280cfdff1013e5f99" alt=""] .caption[Zero-shot CLIP is competitive with a fully supervised baseline] .citation[A. Radford et al., Learning Transferable Visual Models From Natural Language Supervision, arXiv 2021 ] --- class: middle Training CLIP is very expensive: - ResNet (`RN50x64`): $18$ days on $592$ V100 GPUS - ViT (Vision Transformer): $12$ days on $256$ V100 GPUs - batch-size: $32.768$ Such models that are expensive to train but can be repurposed for other tasks, are referred to as *foundation models*. .citation[R. Bommasani et al., On the Opportunities and Risks of Foundation Models, arXiv 2021] --- # CLIPStyler .center.width-65[data:image/s3,"s3://crabby-images/5ec64/5ec642bbfd3a9da68d99d10e7e6285d1cb7d06f0" alt=""] .center[Change style of image using text only] .citation[Kwon et al., CLIPstyler: Image Style Transfer with a Single Text Condition, CVPR 2022] --- # CLIPStyler .center.width-65[data:image/s3,"s3://crabby-images/72aa3/72aa32a7b8c096b3fb089fd641d5336d5d5d0dea" alt=""] .center[Content is preserved thanks to alingnment in CLIP space] .citation[Kwon et al., CLIPstyler: Image Style Transfer with a Single Text Condition, CVPR 2022] --- # CLIPStyler .center.width-75[data:image/s3,"s3://crabby-images/bb339/bb339bc42cb508199644c87d63715fe5c8e97549" alt=""] .center[Works similarly with image based styling] .citation[Kwon et al., CLIPstyler: Image Style Transfer with a Single Text Condition, CVPR 2022] --- # FlexIT .center.width-90[data:image/s3,"s3://crabby-images/f9181/f9181af2e1c19f5de08dfaa173d729f5702b928b" alt=""] .center[Transform image using text indication] .citation[Couairon et al., FlexIT: Towards Flexible Semantic Image Translation, CVPR 2022] --- # FlexIT .center.width-90[data:image/s3,"s3://crabby-images/c4687/c46870aadd7884a16b0525b53bbd84dc5d56bffb" alt=""] .center[Optimize latent variable only from a pre-trained generative model to steer it towards generativ what we want] .citation[Couairon et al., FlexIT: Towards Flexible Semantic Image Translation, CVPR 2022] --- # Prompt-driven Zero-shot Domain Adaptation .center.width-90[data:image/s3,"s3://crabby-images/1b2c7/1b2c7f93af930dcdf018c02086e6d787f63aff04" alt=""] .center[Use textual information to generate features for domains we have seen yet] .citation[Dunlap et al., Using Language to Extend to Unseen Domainsn, ICLR 2023
Fahes et al., PØDA: Prompt-driven Zero-shot Domain Adaptation, arXiv 2022] --- # Prompt-driven Zero-shot Domain Adaptation .center.width-90[data:image/s3,"s3://crabby-images/5d6b3/5d6b3f73606ad3792d7c7e4bb3d388e888c9e675" alt=""] .center[Use textual information to generate features for domains we have seen yet] .citation[Dunlap et al., Using Language to Extend to Unseen Domainsn, ICLR 2023
Fahes et al., PØDA: Prompt-driven Zero-shot Domain Adaptation, arXiv 2022] --- # Prompt-driven Zero-shot Domain Adaptation .center.width-100[data:image/s3,"s3://crabby-images/e5160/e51600fe3732e5c6fafb0b16318ceff412278049" alt=""] .center[Use textual information to generate features for domains we have seen yet] .citation[Dunlap et al., Using Language to Extend to Unseen Domainsn, ICLR 2023
Fahes et al., PØDA: Prompt-driven Zero-shot Domain Adaptation, arXiv 2022] --- # Stable Diffusion .center.width-80[data:image/s3,"s3://crabby-images/841f6/841f606b1eac0c304fbc55dde82bdda363a0b068" alt=""] .center[Condition image generation on text embeddings from CLIP] .citation[Rombach et al., High-Resolution Image Synthesis With Latent Diffusion Models, CVPR 2022] --- # Stable Diffusion .center.width-100[data:image/s3,"s3://crabby-images/9b8fb/9b8fb2f850bb0ee8aa844f88ea3c6e4d8d5db5e2" alt=""] .center[Text to image synthesis examples -- starting to very dated] .citation[Rombach et al., High-Resolution Image Synthesis With Latent Diffusion Models, CVPR 2022] --- class: middle, center .bigger[CLIP is currently widely used in various computer vision and text tasks to kickstart or steer learning.] .bigger[Intuitively the structured learned in the feature space from language can effectively steer visual representation more than simple individual labels.] --- class: end-slide, center count: false The end.