Facial recognition! The training dataset consists of a bunch of people (each person is a class or identity) and we have many pictures of the same person. Given a picture, you want the model to predict who it is.

Standard training consists of an image input and an categorical output that is the predicted class. FaceNet is a sort of pretraining/representational training that is still supervised because you need labelled dataset. It is meant to produce an encoder whose output would be used in downstream classification task.

The FaceNet triplet loss can thought of as a kind class-level contrastive loss. You want your outputs to be similar for images of the same class and to be different for images of different classes. For FaceNet, similarity is measured by $L_2$ distance.

We want to learn a representation $f:\mathbb{R}^n \to \mathbb{R}^m$, where $m$ is the dimension of the learned representation, and $n$ is the dimension of the input feature. $f$ is designed to minimize the triplet loss defined as follow. A *triplet *is a triple of three images

- $x_i^a$ — the anchor
- $x_i^p$ — the positive example, from the same class as $x_i^a$.
- $x_i^n$ — the negative example, from a different class.

where the $i$ runs over the set of all such triplets concerned.

The triplet loss of $f$ evaluated on the triplet $(x_i^a, x_i^p, x_i^n)$ is defined as:

where $\alpha >0$ is a margin constant. If the loss is zero, then the positive examples are at least $\alpha$ closer to the anchor than negatives,

Ideally, we should be training a model that minimizes

where $N$ is the total number of possible triplets. This is impractical however, due to a combinatorial explosion — and also as the authors point out, unnecessary and suboptimal. A better method is to train the model with stochastic gradient descent using an online triplet mining method. See

.The nuance of the FaceNet boils down to the selection of effective triplets.

The authors point out the using all triplets (of a subset of all examples) is unnecessary, it is suboptimal because it is vulnerable toward

. The immediate reaction is to use only bad points — ones for which the triplet loss is $>0$, or the*hard positives*which are the $\argmax || f(x_i^a) - f(x_i^p)||_2^2$, and the hard negatives which are the $\argmin || f(x_i^a) - f(x_i^n)||_2^2$. However, this is vulnerable toward which would dominate the chosen triplets in the later part of training, and is also computationally inefficient.

This hence motivates the authors’ online triplet mining method.

In each batch, around 40 images are sampled from each identity (of the same person).

An anchor-positive pair is a pair of $(x_i^a ,x_i^p)$ from the same class. The authors would use *all* of the anchor-positive pairs in the sampled images. That amounts to about $\binom{40}{2} \times \text{\# identities}$ total such pairs.

The authors would then compute the hard negatives, as defined before, and add it to each of the anchor-positive pairs to form a big batch of triplets for the gradient computation.

A convolutional encoder.

FaceNet’s architecture promotes latent space clustering by literally penalizing distances between latent representations of images in the same class, and promoting distances between those from different classes.

Class belonging can be thought of as dividing a set of input features into equivalence classes by partitioning the set of input features. Let $[x_i]$ denote the equivalent class containing the $i$-th example. We can define a simple-minded pseudo-metric as

We can think of FaceNet pre-training as trying to learn a map $f:(\mathbb{R}^n, \rho) \to (\mathbb{R}^m, ||\cdot||_2)$ that is roughly an isometry. (This is not rigorous and not even rigorizable in many ways, but nonetheless I find it a good analogy).

The model outperforms the benchmark at the time by a *significant* margin — something like 30% improvement. Impressive!