Embo: a Python package for empirical data analysis using the Information Bottleneck

We present embo, a Python package to analyze empirical data using the Information Bottleneck (IB) method and its variants, such as the Deterministic Information Bottleneck (DIB). Given two random variables X and Y, the IB finds the stochastic mapping M of X that encodes the most information about Y, subject to a constraint on the information that M is allowed to retain about X. Despite the popularity of the IB, an accessible implementation of the reference algorithm oriented towards ease of use on empirical data was missing. Embo is optimized for the common case of discrete, low-dimensional data. Embo is fast, provides a standard data-processing pipeline, offers a parallel implementation of key computational steps, and includes reasonable defaults for the method parameters. Embo is broadly applicable to different problem domains, as it can be employed with any dataset consisting in joint observations of two discrete variables. It is available from the Python Package Index (PyPI), Zenodo and GitLab.


Introduction
The Information Bottleneck Method-In the Information Bottleneck (IB) framework [1], given two random variables X and Y, we are interested in extracting all the information that X may contain about Y and discarding the rest as irrelevant. To solve this problem, we seek a third random variable M that solves the following optimization problem: min p(m x) I(M : X) − βI(M : Y ) (1) where I(· : ·) is Shannon's mutual information [2], and M is constrained to be independent of Y conditional on X: p(x, m, y) = p(x)p(m | x)p(y | x) (2) Intuitively, Equation (1) says that we are looking for a stochastic mapping of X to M that keeps as little information about X as possible while maximizing the information about Y. β is an arbitrary (positive) parameter quantifying the relative importance of these two competing goals. In the spirit of rate distortion theory [2], it can be shown [1] that the set of solutions to this method for all possible values of β gives an upper bound to the amount of information one can encode about Y given a certain amount of information about X, or vice versa, the minimum amount of information about X needed to encode a certain amount of information about Y. These bounds are typically summarized by plotting a curve showing I(M : Y) versus I(M : X), obtained by computing these quantities for the solution of Equation (1) across many different values of β. This is known as the IB curve. Example IB curves, taken from one of the notebooks in embo's documentation, are shown in Figure 1.
Because of its appealing theoretical properties, since it inception the IB has enjoyed continued attention as a method for unsupervised [3] and supervised [4,5] learning, as well as becoming more recently a popular tool in the study of learning and generalization in deep neural networks [6,7] and in neuroscience [8,9,10,11].
where α ≥ 0. We call this the Generalized Information Bottleneck problem, or GIB (note that the same acronym is used in [13] with a different meaning). The GIB reduces to the standard IB as a special case for α = 1.
If α = 0, the problem consists of finding the minimum-entropy bottleneck variable M that contains a certain amount of information about Y (or the M with the largest amount of information about Y among all Ms with a set entropy). This is called the Deterministic Information Bottleneck (DIB) by [12]. The term "deterministic" comes from the fact that solutions in the α = 0 case are shown to be deterministic mappings from X to M, with H(M|X) = 0. A simple demonstration of application of the DIB, inspired by one of the examples given in [12], is illustrated in Figure 2. IB for empirical data; comparison with other software-Despite the large body of existing work on the IB (and GIB), public, off-the-shelf implementations of its "reference" version based on the Blahut-Arimoto algorithm [1,12] have been lacking. The supplementary Python code associated to [12] implements the GIB, but it is rather tightly coupled to the specifics of that paper and is not distributed as a standard package (it does not contain tests or licensing information, and is not available on the Python Package Index). To our knowledge, the only existing Python implementation that offers a reasonably flexible and documented interface is that contained in dit [14], a multipurpose information theory toolbox. By focusing narrowly on the IB, embo can offer greater ease of use for the most common applications (by removing the need to preprocess the data and reducing the amount of boilerplate code to a minimum) and support for specialized applications such as the past-future information bottleneck [15] (documented more in detail in the notebook located at examples/Markov-Chains.ipynb within the source distribution). Moreover, and very importantly for the application of IB methods to real-world research problems, embo is much more computationally efficient than dit. Figure 3 shows that embo offers a 1000x-10000x speedup over dit on a set of simple problems (embo can solve much larger problems, but these are not included in the comparison because they become prohibitively time-consuming with dit).
Taken together, the features discussed in this section help to remove all barriers in going from empirical data to an IB curve, thus making the IB method more accessible to a broad generalist audience.

Implementation and architecture
The main point of entry to the package is the InformationBottleneck class. In its constructor, InformationBottleneck takes as arguments an array of observations for X and an (equally long) array of observations for Y, together with other optional parameters (see the software documentation for details). Alternatively, a joint probability mass function p(x, y) can be directly specified. In the most basic use case, users can call the get_bottleneck method of an InformationBottleneck object. Embo will then solve the optimization problem in Equation (1)  From the architectural standpoint, embo can parallelize the computation of the IB curve on multicore machines by breaking down the set of β values into k smaller subsets and running each subset in parallel. This functionality is implemented with the multiprocessing Python module and can be controlled by the user by setting an optional parameter specifying the number k of processes to use.
Embo has several other optional parameters, which allow the user to control precisely the range and number of β values to be considered, as well as finer aspects of the behaviour of the algorithm that solves the optimization problem (3) for a given β (the Blahut-Arimoto algorithm [1,2,12]) and to automatically preprocess data for the application of the past-future bottleneck method [15]. These parameters are all described in the software's documentation, but embo comes with reasonable defaults allowing users to worry about such details only if needed.

Quality control
Embo has a suite of unit tests to ensure basic functionality and prevent regressions.

(3) Reuse potential
In [11], Embo has been used to assess the complexity of the strategies adopted by human subjects during cognitive tasks. In the computational cognitive science and neuroscience domain, the same approach can be used to analyze human or animal behavior in different tasks, as well as the statistical relationship between sensory stimuli and recorded neuronal activity [8,9]. More generally, the Information Bottleneck method is entirely domain agnostic, and embo can be used in any setting involving joint observations of two discrete, low-dimensional variables.
Embo may be extended in several ways. Possible technical upgrades include improving the software's performance, for instance by rewriting the Blahut-Arimoto algorithm implementation (or some critical paths of it) in C, or by using performance-oriented Python libraries such as Numba or Cython. Features that may be added include the estimation of finite sample bounds for the IB [19]. Finally, embo may be coupled with analyses based on multipartite information decompositions [20,21] to study the mutual relationship of triplets of empirical variables, where one is hypothesized to act as a bottleneck between the other two. This condition is highly relevant for the study of neural activity recorded concomitantly with sensory stimulation and behavioural output in awake animals [22].
The recommended support channel for Embo is via its GitLab projects, where issues can be reported, and patches and merge requests are welcome. Additionally, the maintainers can be contacted directly at their institutional email addresses. documentation for further detail on how these figures were generated. Note that the IB curve is always below the identity line and that the values of I(M : Y) and I(M : X) are never larger than the base 2 logarithm of the number of states (1 bit and 2 bit, respectively, corresponding to 2 and 4 states, respectively). These are conditions that the IB curve should always satisfy [1] and can be taken as sanity checks for embo's correct operation.  Figure 2 in [12]. In this example, X can take on one out of 128 possible states, Y can take on one out of 32 states, and p(x) is close to uniform (see the notebook for details about the joint p(x, y)). Left: IB and DIB solutions for a range of β values, visualized in the "IB plane" where I(M : Y) is plotted against I(M : X). Right: same solutions as in the left panel, visualized in the "DIB plane" where I(M : Y) is plotted against H(M). As expected from [12], in the IB plane the two methods behave similarly. In the DIB plane, however, the DIB performs better than the IB in the sense that H(M) is much lower for the DIB than for the IB, for any given value of I(M : Y).