Facial recognition! The training dataset consists of a bunch of people (each person is a class or identity) and we have many pictures of the same person. Given a picture, you want the model to predict who it is.
Standard training consists of an image input and an categorical output that is the predicted class. FaceNet is a sort of pretraining/representational training that is still supervised because you need labelled dataset. It is meant to produce an encoder whose output would be used in downstream classification task.
The FaceNet triplet loss can thought of as a kind class-level contrastive loss. You want your outputs to be similar for images of the same class and to be different for images of different classes. For FaceNet, similarity is measured by distance.
We want to learn a representation , where is the dimension of the learned representation, and is the dimension of the input feature. is designed to minimize the triplet loss defined as follow. A triplet is a triple of three images
- — the anchor
- — the positive example, from the same class as .
- — the negative example, from a different class.
where the runs over the set of all such triplets concerned.
The triplet loss of evaluated on the triplet is defined as:
where is a margin constant. If the loss is zero, then the positive examples are at least closer to the anchor than negatives,
Ideally, we should be training a model that minimizes
where is the total number of possible triplets. This is impractical however, due to a combinatorial explosion — and also as the authors point out, unnecessary and suboptimal. A better method is to train the model with stochastic gradient descent using an online triplet mining method. See
The nuance of the FaceNet boils down to the selection of effective triplets.
The authors point out the using all triplets (of a subset of all examples) is unnecessary, it is suboptimal because it is vulnerable toward
This hence motivates the authors’ online triplet mining method.
In each batch, around 40 images are sampled from each identity (of the same person).
An anchor-positive pair is a pair of from the same class. The authors would use all of the anchor-positive pairs in the sampled images. That amounts to about total such pairs.
The authors would then compute the hard negatives, as defined before, and add it to each of the anchor-positive pairs to form a big batch of triplets for the gradient computation.
A convolutional encoder.
FaceNet’s architecture promotes latent space clustering by literally penalizing distances between latent representations of images in the same class, and promoting distances between those from different classes.
We can think of a pseudo-metric called
Class belonging can be thought of as dividing a set of input features into equivalence classes by partitioning the set of input features. Let denote the equivalent class containing the -th example. We can define a simple-minded pseudo-metric as
We can think of FaceNet pre-training as trying to learn a map that is roughly an isometry. (This is not rigorous and not even rigorizable in many ways, but nonetheless I find it a good analogy).
The model outperforms the benchmark at the time by a significant margin — something like 30% improvement. Impressive!