May 25, 2023
Created by
Neo Yin
Done โœจ
Reading Notes
My goal for this reading is only to have a high-level step-by-step understanding of the algorithm.

How do you use t-SNE?

Step 1:

You have a collection of high-dimensional points P1,...,PnP_1,...,P_n which are fixed. You want to learn a collection of two-dimensional points Q1,...,QnQ_1,...,Q_n that have similar metric and clustering behaviour as their higher dimensional counterpart.

Fixing a point PiP_i, for each of the other points PjP_j, you calculate the Euclidean distance โˆฃโˆฃPiโˆ’Pjโˆฃโˆฃ2||P_i-P_j||_2 which is then passed into a Gaussian distribution (unnormalized) and normalized by the sum of all such pairwise quantities (ranging over jj):

pjโˆฃi=expโก(โˆ’โˆฃโˆฃPiโˆ’Pjโˆฃโˆฃ22/2ฯƒi2)โˆ‘kโ‰ iexpโก(โˆ’โˆฃโˆฃPiโˆ’Pkโˆฃโˆฃ22/2ฯƒi2).p_{j|i} = \frac{\exp (-||P_i - P_j||_2^2 / 2\sigma_i^2)}{\sum_{k\neq i} \exp (-||P_i-P_k||_2^2 / 2\sigma_i^2)} .

Then we define the symmetrization, where NN is the ambient dimension, (where the quantity pjโˆฃip_{j|i} is to be interpreted as the probability that point jj would be chosen as cluster neighbour of point ii),

pij=piโˆฃj+pjโˆฃi2N.p_{ij} = \frac{p_{i|j} + p_{j|i}}{2N}.

Step 2:

The parameters ฯƒi\sigma_i need to be chosen, based on a user-specified perplexity parameter. The perplexity of a point PiP_i is defined as the binary exponential of the Shannon entropy,

Perp(Pi)=2โˆ’โˆ‘pjโˆฃilogโก2pjโˆฃi.Perp(P_i) = 2^{-\sum p_{j|i} \log_2 p_{j|i}}.

The user would choose what she wants to be Perp(Pi)Perp(P_i) for each ii. Then, a search would be performed to find the ฯƒi\sigma_i value that gives rise to this perplexity value.

Step 3:

For each of the QiQ_i, we calculate a similar quantity as in Step 1, but with a student t-distribution with 1 degree of freedom instead of Gaussian. You compute a cost function using a Kullback-Leibler divergence,

C=โˆ‘iโˆ‘jpjโˆฃilogโกpjโˆฃiqjโˆฃi.C = \sum_i \sum_j p_{j|i} \log \frac{p_{j|i}}{q_{j|i}}.

Which can then be optimized over the choice of low-dimensional point coordinates QiQ_{i}.

What are some strengths and weaknesses of using t-SNE (for clustering applications)?

Strengths of using t-SNE for clustering:

  1. Visualization: t-SNE is particularly effective at creating a visual representation of high-dimensional data. It can provide intuitive insights about the structure of the data, including potential clusters.
  2. Preservation of Local Structure: t-SNE is designed to preserve the local structure of the data, making it effective at keeping similar instances close together in the reduced space.

Weaknesses of using t-SNE for clustering:

  1. Arbitrariness of Clusters: t-SNE does not provide explicit cluster assignments. Any clusters are visually interpreted and can be subjective or change with different runs of the algorithm, especially given t-SNE's non-deterministic nature.
  2. Difficulty Interpreting Distances and Densities: t-SNE is not designed to preserve distances between clusters or global structure. The distances between clusters or the relative sizes of clusters in a t-SNE plot may not hold any meaningful interpretation.
  3. Sensitivity to Hyperparameters: The output of t-SNE is significantly influenced by its hyperparameters, particularly the perplexity parameter. Different settings can lead to different visualizations, which can make the clustering interpretation challenging.
  4. Lack of Out-of-Sample Extension: t-SNE doesn't provide a straightforward way to map new, unseen data points to the reduced space, which is often needed in clustering tasks.