The repository propose a Python implementation of the Doubly Stochastic Neighbor Embedding on Spheres (DOSNES) published by Yao Lu, Zhirong Yang, Jukka Corander on Arxiv in Sep. 2016. It is based on the Matlab implemantion available on Github.
The principle of this model is to embed an high dimensionnal datas on a 3D Sphere. The principle is the same as the TSNE but at every iteration, all embedded points are forced to be on a sphere.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
No prerequisites are needed, you can download the repository or clone it.
git clone https://github.com/Coni63/DOSNES.git
No pip installation is available, you just have to include the package in you project folder.
The model has been build to be similar to sklearn model. As imple example is available below or in test_XX.py files.
from sklearn import datasets
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from dosnes import dosnes
X, y = datasets.load_digits(return_X_y = True)
metric = "sqeuclidean"
model = dosnes.DOSNES(metric = metric, verbose = 1, random_state=42)
X_embedded = model.fit_transform(X)
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.scatter(X_embedded[:, 0], X_embedded[:, 1], X_embedded[:, 2], c=y, cmap=plt.cm.Set1)
plt.title("Digits Dataset Embedded on a Sphere with metric {}".format(metric))
plt.show()
You can provide either a distance matrix custom if you set the metric to pre-computed, or provide you dataset and the metric to use (must be part of distances available with the function pdist of scipy)
An very quick analysis of all parameter is present in the Notebook available in "analysis" folder. You can see below the evolution of the cost function and a gif of the training on the Iris dataset. Values are taken from the quick analysis.
- Nicolas MINE - Initial work - Coni63
- btaba - Implementation of Sinkhorn Knopp Algorithm - btaba
- Paul Panzer - Support on StackOverflow
There may be still some errors. For example on digit dataset, the result is not as good as the one from the official paper of matlab implementation. There is no checks / error handling implemented yet. You should not provide a dataset with missing values.