Being able to visualize your dataset on a 2D scatter plot is probably something you want to do at the beginning of every machine learning project you tackle.
After dabbling for a while in machine learning, and building your first models you’re probably going to come to the conclusion that the method you’re using is not as important as the data that you have at your disposal. Quantity and quality being equally important here. If you’ve searched for the terms t-SNE and UMAP in a google search it is very likely that you’ve looked them up for practical purposes and to visualize the dataset that you’re working with. If you’ve come across this article then you’re in luck.
On the dimensionality of data
Data comes in many shapes and forms, from grids of pixels (digital images) to floating point arrays (audio), each of which can have many different representations depending on the situation. However, the dimensionality of this data is generally very large, which is usually inconvenient, especially when it comes to visualize our dataset. This’ll make more sense in a second. Let’s say we have a dataset that consist of a number of data-points, each of which consisting of 2 numerical features. This is easy to plot, we just assume that the first feature represents the X coordinate of the data-point, and the second feature the Y coordinate of the data-point. Now, assume we have a dataset consisting of 3 features, here we can simply add another axis, making it a 3 dimensional plot. Ok, but where do we go from here? How would you visualize 4 dimensional data? There might be some creative way to do it, however things tend to start getting messy here. The additional dimensions introduces and overhead of complexity in our plots and the algorithms to process this data. R. Bellman calls this ‘the curse of dimensionality’,
The good thing here, is that most of the dimensions are not really necessary, or even relevant to us. We might be able to find a subset of the features in our dataset, which is equally representative of the data points. This is where dimensionality reduction comes into play, and it essentially does what we just described. It reduces the number of dimensions of a dataset, in such a manner that the resulting lower dimensional representation is still representative and reflects the original feature set.
There are many dimensionality reduction techniques, and each for a different purpose, but in this post we will have a look at two very specific algorithms that are primarily used for visualization of high dimensional data, namely t-SNE and UMAP.
Briefly about MNIST
Let’s use the age old MNIST dataset. If, for some reason, you are not familiar with MNIST yet, it’s essentially a dataset of grayscale images of hand written digits. Each image in this dataset consists of a grid of pixels that is 28 by 28 pixels high and wide. The images are in grayscale, this means that each pixel in the image can have a real value ranging from 0.0 to 1.0, where values closer to 1.0 are almost of a purely white color.
Intuition behind t-SNE
t-SNE stands for t-distributed stochastic neighbour embedding, and was introduced in the paper ‘Visualizing data with t-SNE’ by Laurens van der Maaten in 2008. And more than a decade later it is still a very valid method. There have been some improvements to it over the years, obviously, and some newer methods that outperform it, but for now we’ll just focus on itself. What t-SNE tries to do intuitively, is to capture the similarities and dissimilarities of high dimensional data and represent it on a low dimensional graph, such that similarity and dissimilarity are reflected as distance on a low dimensional 2D or 3D nrighbour graph. In this manner t-SNE preserves local structure, meaning that points that are in the vicinity of each other tend to be similar, however the opposite is not always true, t-SNE can incidentally reflect some global structure but it isn’t something it was designed to do.
t-SNE can be used out of the box in most modern data science framework toolkits. We’ll use scikit learn to use achieve this.
from sklearn.manifold import TSNE from sklearn.datasets import fetch_openml import matplotlib.pyplot as plt import numpy as np
# download the dataset X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False) # random selection from the dataset idx = np.random.choice(np.arange(len(X)), 5000, replace=False) X_sample = X[idx] y_sample = y[idx] # instantiate TSNE tsne = TSNE(verbose=1)
feat = tsne.fit_transform(X_sample)
plt.figure(figsize=(10, 10)) labels = [int(i) for i in y_sample] plt.set_cmap('jet') plt.scatter(feat[:,0], feat[:,1], c=labels) plt.colorbar() plt.legend() plt.show()