ugtm is a Python package that implements generative topographic mapping (GTM), a dimensionality reduction algorithm by Bishop, Svensén and Williams. Because of its probabilistic framework, GTM can also be used to build classification and regression models, and is an attractive alternative to t-distributed neighbour embedding (t-SNE) or other non-linear dimensionality reduction methods. The package is compatible with scikit-learn, and includes a GTM transformer (eGTM), a GTM classifier (eGTC) and a GTM regressor (eGTR). The input and output of these functions are numpy arrays. The package implements supplementary functions for GTM visualization and kernel GTM (kGTM). The code is under MIT license and available on GitHub (

ugtm (v2.0) is a package for multidimensional space analysis based on the generative topographic mapping (GTM). A complete documentation with API reference and tutorials is available online (

GTM maps are similar to self-organizing maps [

Other implementations of the core algorithm of GTM are available online. The netlab package [

ugtm v2.0 is a package implemented in pure Python. The API reference is accessible online (

Graph of ugtm v2.0 modules: (1) ugtm_classes: classes for generative topographic mapping (GTM) models, (2) ugtm_core: kernel GTM (kGTM) and GTM core functions, (3) ugtm_gtm: expectation-maximization algorithm for GTM, (4) ugtm_kgtm: expectation-maximization algorithm for kGTM, (5) ugtm_landscape: functions for colouring maps, (6) ugtm_predictions: GTM-based prediction algorithms, (7) ugtm_sklearn: sklearn-compatible eGTM transformer, eGTC classifier, and eGTR regressor, (8) ugtm_preprocess: preprocessing functions for data scaling and PCA preprocessing, using sklearn, (9) ugtm_plot: plotting functions for GTM maps, using matplotlib and mpld3, (10) ugtm_crossvalidate: cross-validation workflows.

The ugtm package can be downloaded from PyPI using pip, by typing “pip install ugtm” in a terminal. In the Python console, the package can be imported by typing “import ugtm”.

The basic GTM dimensionality reduction is implemented into the scikit-learn-compatible eGTM class. The

The four GTM hyperparameters (

A data point can be represented in 3 ways using GTMs: responsibilities, means and modes. ^{2}^{2}

The eGTC and eGTR classes implement GTM-based classification and regression algorithms. eGTC and eGTR algorithms are based on class maps and landscapes, which are different ways of coloring a GTM. GTM class maps are constructed using discrete labels and GTM landscapes using continuous labels. New data can be projected onto these colored maps to predict labels. A GTM landscape for the S curve dataset is shown in Figure

Generative topographic mapping (GTM) representations of the S curve dataset (downloaded from sklearn): mean positions, modes, and landscape for continuous labels. The code to reproduce this plot is accessible online (

GTM representations of the hand-written digits dataset (digits 0 to 5, from the UCI database): mean positions, modes, and class map for discrete labels. The code to reproduce this plot is accessible online (

The

The package uses the following external resources:

scikit-learn [

numpy and scipy [

matplotlib [

mpld3 (

Several examples are provided in the online documentation (

Core GTM algorithm (test_ugtm_gtm.py): the core GTM test checks matrix dimensions, convergence of the log likelihood function, and the projection of new data on the map. If the training set and test set are the same, responsibilities of the training and test sets should also be the same.

Scikit-learn-compatible classes (test_ugtm_sklearn.py): these tests check the output dimensions of the eGTM transformer, eGTC classifier and eGTR regressor. It also checks for correct projection of new data on the GTM map.

Workflow test (test_ugtm_workflow.py): the workflow test script was designed to test all possible workflows, for label-free data, categorical labels, and continuous labels.

Supplementary tests were implemented for plots (test_ugtm_plot.py), printing results (test_ugtm_write.py), classification models (test_ugtm_GTC.py), regression models (test_ugtm_GTR.py), and kernel algorithm (test_ugtm_kgtm.py).

These tests were carried out with Python 2.7.14 and Python 3.4.6 on Scientific Linux 6.6 and macOS High Sierra 10.13.2.

ugtm was tested on Scientific Linux 6.6 and macOS High Sierra 10.13.2 but not on Windows. ugtm is written in pure Python and should be available on any operating system supporting Python frameworks.

Python >= 2.7 (tested on Python 3.4.6 and Python 2.7.14).

ugtm does not require any supplementary data. The amount of required active memory depends on the input data and on the map size (hyperparameters

scikit-learn >= 0.20

numpy >= 1.13.1

matplotlib >= 2.0.2

scipy >= 0.19.1

mpld3 >= 0.3

Héléna A. Gaspar

ugtm was developed in English.

Support for ugtm is available on GitHub (

The eGTM (data transformation), eGTC (classification) and eGTR (regression) classes implemented in ugtm are fully compatible with scikit-learn and can be used in scikit-learn pipelines for visualization, regression or classification. GTM hyperparameters can be optimized using scikit-learn grid search for regression and classification tasks – examples are provided in the online documentation (

Now that we have access to very large amounts of data, dimensionality reduction methods are becoming more and more popular. It is often necessary to get an overview of a dataset that exists in a space of hundreds or thousands of dimensions (data features). GTM can be a nice alternative to other dimensionality reduction algorithms such as t-SNE, MDS or LLE. It also provides a probabilistic framework that can be used to obtain a comprehensive overview of a dataset. For example, GTM can be useful to visualize and cluster multidimensional data for health research, and investigate very large datasets in chemistry or in genomics – single cell data, genotypes, endophenotypes, or polygenic risk scores. At the moment, t-SNE is very popular for these purposes but presents a major drawback: new data cannot be easily projected onto a pre-trained t-SNE map. Another possible application for GTM could be the visualization of feature vectors from deep neural networks. Hopefully, ugtm should make it easier to use GTM for these applications.

ugtm also provides opportunities for further developments. Future enhancements could include the implementation of a mini-batch version (to process data block by data block) – for now, the entire data matrix is processed in one batch. A multispace version of GTM (Stargate GTM [

Thanks to Prof. Alexandre Varnek and Prof. Igor I. Baskin for their help in designing the GTC and GTR algorithms.

The author has no competing interests to declare.