This component performs dimensionality reduction with an R wrapper of the C++ implementation of Barnes-Hut-SNE as described in http://lvdmaaten.github.io/tsne/. Remember to cite his paper whenever you use this component.

Version 1.1
Categories Multivariate Statistics
Authors Julia Casado (julia.casado@helsinki.fi)
Requires R ; installer (bash)
Usage Example with default values


Name Type Mandatory Description
in CSV Mandatory A numeric matrix on which dimensionality reduction will be applied. Rows represent datapoints, e.g. cells/patients, and columns represent the dimensions, e.g. features or markers, that we want to reduce.


Name Type Description
out CSV A numeric matrix with two or three columns, depending on dims parameter.
entropy CSV A matrix of entropy estimates in natural-base units (nats) for each sample. The column "orig" is an estimate of entropy in the original space, while "lots" is an estimate of the Kullback-Leibler divergence from the output to the input space.


Name Type Default Description
check_duplicates boolean false It is best to check for duplicates with previous components because for big files this check-up will take too long time.
cost_tol float 1.48e-8 Tolerance for cost function stall.
dims int 2 Output dimensionality. Possible values are 2 or 3 because the original method was developed for visualization purposes and not thoroughly tested for larger dimensionality.
entropy_fast boolean true Use fast and cheap entropy approximation.
initial_dims int 50 Number of dimensions in a preliminary step of dimensionality reduction using PCA. Only read if parameter pca is true.
is_distance boolean false Indicates whether the input is a distance matrix. In the documentation at the time of creating this component it warns that is an experimental feature. Use at own risk.
max_iter int 1000 Number of iterations.
pca boolean false Recommended for big files, over 5000 datapoints and 100 features. If true, it will run first basic PCA to reduce the dimensions. May result in poor performance for small datasets.
perplexity int 30 It is a measure of information that in this case can be used as the number of nearest neighbors k that is employed in many manifold learners. If the visualization out of the output shows most of the points clustered like a ball means that the perplexity parameter was too high. It will depend on the size and structure of the data.
seed int -1 Seed number to make test cases reproducible. If null, the system generates one every time.
theta float 0 Variable for Speed/accuracy trade-off. Higher theta means shorter running time and less accuracy of the results. Change only if the dataset is really really big.
verbose boolean true Log the tsne process to terminal

