Montag, 8. Mai 2017

t-distributed Stochastic Neighbor Embedding (tSNE)

Notes by following the video lecture on t-SNE, given by Laurens van der Maaten at Google Tech Talk, available here on YouTube.

  • t-SNE has the advantage of other dimension reduction methods that it tries to keep local structures when mapping high dimensional spaces in 2D or 3D.
  • In the high-dimensional (hD) space, t-SNE uses multivariate Gaussian (normal) distributions to assign pairwise probabilities of data points.
  • In the low-dimensional (lD) space, t-SNE uses Euclidean distances to assign pairwise distances of data points.
  • In order to make the points in lD reflect their relationship in hD,  the Kullback–Leibler divergence between the two distributions is minimised.
  • The reason to use Student's t statistic is that it prevents the algorithm from rending dissimilar points too far apart in lD (TODO: I have not yet completely understood why).
  • In real-life applications, Barnes-Hut simulation and quadtree implementations make the algorithm efficient enough to handle reasonably large data sets with thousands of samples. 
  • Relation to physics: t-SNE gradient can be viewed as a simulation of the N-body problem, with a spring term, an exertion/compression term, and a sum. (TODO: The N-body problem is new to me).
  • As an extension, multi-map t-SNE allows representation of correlated features in different contexts.
  • t-SNE can be used to cluster samples and to assist feature selection.