How do you use t-SNE?
Step 1:
You have a collection of high-dimensional points which are fixed. You want to learn a collection of two-dimensional points that have similar metric and clustering behaviour as their higher dimensional counterpart.
Fixing a point , for each of the other points , you calculate the Euclidean distance which is then passed into a Gaussian distribution (unnormalized) and normalized by the sum of all such pairwise quantities (ranging over ):
Then we define the symmetrization, where is the ambient dimension, (where the quantity is to be interpreted as the probability that point would be chosen as cluster neighbour of point ),
Step 2:
The parameters need to be chosen, based on a user-specified perplexity parameter. The perplexity of a point is defined as the binary exponential of the Shannon entropy,
The user would choose what she wants to be for each . Then, a search would be performed to find the value that gives rise to this perplexity value.
Step 3:
For each of the , we calculate a similar quantity as in Step 1, but with a student t-distribution with 1 degree of freedom instead of Gaussian. You compute a cost function using a Kullback-Leibler divergence,
Which can then be optimized over the choice of low-dimensional point coordinates .
Strengths of using t-SNE for clustering:
- Visualization: t-SNE is particularly effective at creating a visual representation of high-dimensional data. It can provide intuitive insights about the structure of the data, including potential clusters.
- Preservation of Local Structure: t-SNE is designed to preserve the local structure of the data, making it effective at keeping similar instances close together in the reduced space.
Weaknesses of using t-SNE for clustering:
- Arbitrariness of Clusters: t-SNE does not provide explicit cluster assignments. Any clusters are visually interpreted and can be subjective or change with different runs of the algorithm, especially given t-SNE's non-deterministic nature.
- Difficulty Interpreting Distances and Densities: t-SNE is not designed to preserve distances between clusters or global structure. The distances between clusters or the relative sizes of clusters in a t-SNE plot may not hold any meaningful interpretation.
- Sensitivity to Hyperparameters: The output of t-SNE is significantly influenced by its hyperparameters, particularly the perplexity parameter. Different settings can lead to different visualizations, which can make the clustering interpretation challenging.
- Lack of Out-of-Sample Extension: t-SNE doesn't provide a straightforward way to map new, unseen data points to the reduced space, which is often needed in clustering tasks.