May 25, 2023
Created by
Neo Yin
Done ✨
Reading Notes

What problem is FaceNet trying to solve?

Facial recognition! The training dataset consists of a bunch of people (each person is a class or identity) and we have many pictures of the same person. Given a picture, you want the model to predict who it is.

How does FaceNet differ from standard supervised learning using convolutional neural networks?

Standard training consists of an image input and an categorical output that is the predicted class. FaceNet is a sort of pretraining/representational training that is still supervised because you need labelled dataset. It is meant to produce an encoder whose output would be used in downstream classification task.

What is the so-called
triplet loss
triplet loss:

The FaceNet triplet loss can thought of as a kind class-level contrastive loss. You want your outputs to be similar for images of the same class and to be different for images of different classes. For FaceNet, similarity is measured by L2L_2 distance.

We want to learn a representation f:Rn→Rmf:\mathbb{R}^n \to \mathbb{R}^m, where mm is the dimension of the learned representation, and nn is the dimension of the input feature. ff is designed to minimize the triplet loss defined as follow. A triplet is a triple of three images

  • xiax_i^a β€” the anchor
  • xipx_i^p β€” the positive example, from the same class as xiax_i^a.
  • xinx_i^n β€” the negative example, from a different class.

where the ii runs over the set of all such triplets concerned.

The triplet loss of ff evaluated on the triplet (xia,xip,xin)(x_i^a, x_i^p, x_i^n) is defined as:

[∣∣f(xia)βˆ’f(xip)∣∣22βˆ’βˆ£βˆ£f(xia)βˆ’f(xin)∣∣22+Ξ±]+\Big[||f(x_i^a) - f(x_i^p)||_2^2 - ||f(x_i^a) - f(x_i^n)||_2^2 + \alpha \Big]_+

where Ξ±>0\alpha >0 is a margin constant. If the loss is zero, then the positive examples are at least Ξ±\alpha closer to the anchor than negatives,

∣∣f(xia)βˆ’f(xip)∣∣22+Ξ±<∣∣f(xia)βˆ’f(xin)∣∣22.||f(x_i^a) - f(x_i^p)||_2^2 + \alpha < ||f(x_i^a) - f(x_i^n)||_2^2.

Ideally, we should be training a model that minimizes

βˆ‘i=1N[∣∣f(xia)βˆ’f(xip)∣∣22βˆ’βˆ£βˆ£f(xia)βˆ’f(xin)∣∣22+Ξ±]+,\sum_{i=1}^N \Big[||f(x_i^a) - f(x_i^p)||_2^2 - ||f(x_i^a) - f(x_i^n)||_2^2 + \alpha \Big]_+,

where NN is the total number of possible triplets. This is impractical however, due to a combinatorial explosion β€” and also as the authors point out, unnecessary and suboptimal. A better method is to train the model with stochastic gradient descent using an online triplet mining method. See


What motivates the authors’ specific training method?

The nuance of the FaceNet boils down to the selection of effective triplets.

The authors point out the using all triplets (of a subset of all examples) is unnecessary, it is suboptimal because it is vulnerable toward

. The immediate reaction is to use only bad points β€” ones for which the triplet loss is >0>0, or the hard positives which are the arg max⁑∣∣f(xia)βˆ’f(xip)∣∣22\argmax || f(x_i^a) - f(x_i^p)||_2^2, and the hard negatives which are the arg min⁑∣∣f(xia)βˆ’f(xin)∣∣22\argmin || f(x_i^a) - f(x_i^n)||_2^2. However, this is vulnerable toward which would dominate the chosen triplets in the later part of training, and is also computationally inefficient.

This hence motivates the authors’ online triplet mining method.

Describe the so-called β€œonline triplet mining method”.

In each batch, around 40 images are sampled from each identity (of the same person).

An anchor-positive pair is a pair of (xia,xip)(x_i^a ,x_i^p) from the same class. The authors would use all of the anchor-positive pairs in the sampled images. That amounts to about (402)Γ—#Β identities\binom{40}{2} \times \text{\# identities} total such pairs.

The authors would then compute the hard negatives, as defined before, and add it to each of the anchor-positive pairs to form a big batch of triplets for the gradient computation.

What is the author’s model architecture?

A convolutional encoder.

How does FaceNet’s architecture promote meaningful latent space clustering?

FaceNet’s architecture promotes latent space clustering by literally penalizing distances between latent representations of images in the same class, and promoting distances between those from different classes.

We can think of a pseudo-metric called


Class belonging can be thought of as dividing a set of input features into equivalence classes by partitioning the set of input features. Let [xi][x_i] denote the equivalent class containing the ii-th example. We can define a simple-minded pseudo-metric as

ρ(xi,xj)=I([xi]=[xj]).\rho(x_i, x_j) = I([x_i]=[x_j]).

We can think of FaceNet pre-training as trying to learn a map f:(Rn,ρ)β†’(Rm,βˆ£βˆ£β‹…βˆ£βˆ£2)f:(\mathbb{R}^n, \rho) \to (\mathbb{R}^m, ||\cdot||_2) that is roughly an isometry. (This is not rigorous and not even rigorizable in many ways, but nonetheless I find it a good analogy).

What are some notable results by the authors?

The model outperforms the benchmark at the time by a significant margin β€” something like 30% improvement. Impressive!