(1) Overview

Introduction

Many biological and social systems are characterized by collective behavior: the correlated pattern of neural firing [], protein diversity in the immune system [], conflict participation in monkeys [], flocking in birds [], statistics of letters in words [], or consensus voting in the US Supreme Court [, ]. Statistical physics is a natural approach to probing such systems precisely because they are collective []. Recently, the development of numerical, analytic, and computational tools have made it feasible in these large collective systems to solve for the maximum entropy (maxent) model that reproduces system behavior, corresponding to solving an “inverse problem.” This approach contrasts with the typical problem in statistical physics where one postulates the microscopic model (the Hamiltonian) and works out the physical behavior of the system. In the inverse problem, we find the parameters that correspond to observed behavior of a known system. In many cases, this is a very difficult problem to solve and does not have an analytical solution, and we must rely on analytic approximation and numerical techniques to estimate the parameters.

The pairwise maxent model, the Ising model, has been of particular interest because of its simplicity and generality. A variety of algorithms have been proposed to solve the inverse Ising problem, but different approaches are disparately available on separate code bases in different coding languages, which makes comparison difficult and pedagogy more complicated. With ConIII, it is possible to solve the inverse Ising problem with a variety of algorithms in just a few lines of code.

ConIII is intended to provide a centralized resource for the inverse Ising problem and easy extension to other maxent problems. Although some of the implemented algorithms are specific to the pairwise Ising model, maxent models with arbitrary combinations of higher order constraints can be solved as well by specifying the particular constraints of the maxent model of interest.

In the first few sections of this paper, we give a brief overview of maxent and describe at a high level the algorithms implemented in this package. For those unfamiliar with maxent, we also provide some useful references like [] and the appendix of []. For those seeking more detail about the implemented algorithms, we provide references specific to each algorithm section. Then, we describe the architecture of the package and how to contribute.

What is maximum entropy?

Shannon introduced the concept of information entropy in his seminal paper about communication over a noisy channel []. Information entropy is the unique measure of uncertainty that follows from insisting on elementary principles of consistency. According to Shannon, the entropy over the probability distribution p(s) of possible discrete configurations S of a system is

(1)
S[p]=sSp(s)logp(s).

These configurations could be on-off patterns of firing in neurons, the arrangement of letters in a word, or the orientation of spins in a material.

When there is no structure in the distribution, meaning that the probability is uniform, entropy is at a maximum. In the context of communication theory as Shannon first discussed, this means that there is no structure to exploit to make a prediction about the next part of an incoming message; thus, maximum entropy means that each new part of the message is maximally “surprising.” At the other extreme, when the message consists of the same bit over and over again, we can always guess at the following part of the message and the signal has zero entropy. In the context of modeling, we use entropy not to refer to the difficulty of the message, but to our state of knowledge about it. Entropy precisely measures our uncertainty about the configuration in which we expect to find the system.

Maximum entropy, or maxent, is the formal framework for building models that are consistent with statistics from the data but otherwise as structureless as possible [, ]. We begin by choosing a set of K useful or important features from the data fk(s) that should be true for the model that we are trying to build. These could be whether or not a set of neurons fire together in a temporal bin or the pairwise coincidence for primates in a conflict. The average of this feature across the data set D with R samples is

(2)
fkdata=1RsDfk(s).

According to the model in which each observation s occurs with some probability p(s), the same average is calculated over all possible states

(3)
fk=sSp(s)fk(s).

We assert that the model should fit the K features while maximizing entropy. The standard procedure is to solve this by the method of Langrangian multipliers. We construct the Langrangian functional by introducing the multipliers λk.

(4)
[p] =sSp(s)logp(s)k=1Kλk(fkdatafk)

Then, we solve for the fixed point by taking the derivative with respect to λk. The resulting maxent model is a Boltzmann distribution over states:

(5)
p(s)=eE(s)/Z,

with relative negative log-likelihood (also known as the energy or Hamiltonian)

(6)
E(s)=k=1Kλkfk(s),

and normalization factor (also known as the partition function)

(7)
Z=sSeE(s).

Entropy is a convex function of p and the constraints are linear with respect to p, so the problem is convex and the maxent distribution unique. Readers familiar with statistical physics will recognize this as an alternative derivation of the microcanonical ensemble, demonstrating that statistical mechanics can be viewed as an inference procedure using the maxent principle [].

Finding the parameters λk that match the constraints ⟨fkdata is equivalent to minimizing the Kullback-Leibler divergence between the model and the data []

(8)
DKL(pdata||p)=spdatalog(pdata(s)p(s))
(9)
DKLλk=spdata(s)(E(s)logZ)λk=0fkdata=fk.

In other words, the parameters of the maxent model are the ones that minimize the information theoretic “distance” to the distribution of the data given the constraints. Note that these parameters are given by the data: once the constraints have been chosen, there is a single maxent solution, with no free parameters.

The Ising model

The Ising model is a statistical physics model of magnetism, also known as the pairwise maxent model []. It consists of a set of spins {si} with 2 possible orientations (up and down), each responds to its own external magnetic field hi and each pair is coupled to each other with pairwise coupling Jij. The strength of the magnetic field determines the tendency of each of the spins to orient in a particular direction and the couplings determine whether the spins tend to point together (Jij > 0) or against each other (Jij < 0). Typically, neighbors are defined as spins that interact with one another given by some underlying network structure. Figure 1 shows a fully-connected example.

Figure 1 

Example of a fully connected pairwise Ising model with positive and negative couplings. Each spin si (circle) can take one of two states (black or white, corresponding to –1 and 1) and is connected to every other spin in the system with a positive (red) or negative (blue) coupling. These states could describe the on-off patterns of firing in neurons, the orientation of spins in a material, or if each spin is no longer binary the arrangement of letters in a word (a Potts model).

The energy of each configuration determines its probability via Eq (5),

(10)
E(s)=i<jNJijsisji=1Nhisi,

such that lower energy states are more probable.

We can derive the Ising model from the perspective of maxent. Fixing the the means and pairwise correlations to those observed in the data

(11)
si=sidata
(12)
sisj=sisjdata

we go through the procedure of constructing the Langrangian from Eq 4

(13)
[p]=sp(s)logp(s)+i<jNJijsisj+iNhisi
(14)
[p]p(s)=logp(s)1+i<jNJijsisj+iNhisi
(15)
logp(s) =1+i<jNJijsisj+iNhisi
(16)
p(s)=eE(s)/Z

where the –1 in Eq 15 has been absorbed into the normalization factor

(17)
Z =seE(s).

such that the probability distribution is normalized ∑sp(s) = 1. Thus, the resulting model is exactly the Ising model mentioned earlier.

Despite the simplicity of the Ising model, the structure imposed by the discrete nature of the spins means that finding the parameters is challenging analytically and computationally. In the last few years, numerous techniques have been suggested for solving the inverse Ising problem exactly or approximately []. We have implemented some of them in ConIII and designed a package structure to make it easily extensible to include more methods. Here, we briefly describe the algorithms that are part of the first official version of the package. The goal is to give the user a sense of how they work without getting bogged down in heavy detail. For more detail, we suggest perusing the papers referenced in each section or the review []. For a complete beginner, it may be useful to first get familiar with a slower introduction like in the Appendices of Ref [], Ref [], or Ref [].

Inverse Ising methods implemented in ConIII

Enumeration

The naїve approach that only works for small systems is to write out the equations from Eq 9 and solve them numerically. After writing out all K equations,

(18)
fk=lnZλk=fkdata,

we can use any standard root-finding algorithm to find the parameters λk. This approach, however, involves enumerating all states of the system, whose number grows exponentially with system size.

For the Ising model, writing down the equations has a number of steps O(K22N), where K is the number of constraints and N the number of spins. Each evaluation of the objective in the root-finding algorithm will be of the same order. For relatively small systems, around N ≤ 15, this approach is feasible on a typical desktop computer and is a good way to test the results of a more complicated algorithm. This approach is implemented by the Enumerate class.

Monte Carlo Histogram (MCH)

Perhaps the most straightforward and most expensive computational approach is Monte Carlo Markov Chain (MCMC) sampling. A series of states sampled from a proposed p(s) is produced by MCMC to approximate ⟨fk⟩ and determine how close we are to matching ⟨fkdata. The parameters are then adjusted using a learning rule, and both sampling and learning are repeated until a stopping criterion is met. This can be combined with a variety of approximate gradient descent methods to reduce the number of sampling steps by predicting how the distribution will change if we modify the parameters slightly. The particular technique implemented in ConIII is the Monte Carlo Histogram (MCH) method [].

Since the sampling step is expensive, the idea behind MCH is to reuse a sample for more than one gradient descent step []. Given that we have a sample with probability distribution p(s) generated with parameters λk, we would like to estimate the proposed distribution p′(s) from adjusting our parameters λ′k = λk + Δλk. We can leverage our current sample to make this extrapolation.

(19)
p=ppp
(20)
p(s)=ZZekΔλkfk(s)p(s)

To estimate the average,

(21)
sp(s)fk(s) =ZZsp(s)ekΔλkfk(s)fk(s)

To be explicit about the fact that we only have a sampled approximation to p, we replace p with the sample distribution.

(22)
fk=ZZekΔλkfk(s)fk(s)sample

Likewise, the ratio of the partition function can be estimated

(23)
ZZ1/ekΔλkfk(s)sample

At each step, we update the Lagrangian multipliers {λk} while being careful to stay within the bounds of a reasonable extrapolation. One suggestion is to update the parameters with some inertia []

(24)
Δλk(t+1)=Δλk(t)+ϵΔλk(t1)
(25)
Δλk(t)=η(fkfk)

This has a fixed point at the correct parameters.

In practice, MCH can be difficult to tune properly and one must check in on the progress of the algorithm often. One issue is choosing how to set the learning rule parameters η and ∈. One suggestion for η is to shrink it as the inverse of the number of iterations []. Another issue is that parameters cannot be changed by too much when using the MCH approximation step or the extrapolation to λ′k will be inaccurate and the algorithm will fail to converge. In ConIII, this can be controlled by setting a bound on the maximum possible change in each parameter Δλmax and restricting the norm of the vector of change in parameters kΔλk2. Another issue is setting the parameters of the MCMC sampling routine. Both the burn time (the number of iterations before starting to sample) and sampling iterations (number of iterations between samples) must be large enough that we are sampling from the equilibrium distribution. Typically, these are found by measuring how long the energy or individual parameter values remain correlated as MCMC progresses. The parameters may need to be updated during the course of MCH because the sampling parameters may need to change with the estimated parameters of the model. For some regimes of parameter space, samples are correlated over long times and alternative sampling methods like Wolff or Swendsen-Wang would vastly reduce time to reach the equilibrium distribution although these are not included in the current release of ConIII. We do not discuss these sampling details here, but see Refs [, ] for examples.

The main computational cost for MCH lies in the sampling step. For each iteration of MCH, the runtime is proportional to the number of samples n, number of MCMC iterations T, and the number of constraints for the Ising model N2, O(TnN2), whereas the MCH estimate is relatively quick O(tnN2) because the number of MCH approximation steps needed to converge is much smaller than the number of MCMC sampling iterations t << T.

MCH is implemented in the MCH class.

Pseudolikelihood

The pseudolikelihood approach is an analytic approximation to the likelihood that drastically reduces the computational complexity of the problem and is exact as N []. We calculate the conditional probability of each spin si given the rest of the system {sj≠i}

(26)
p(si|{sji})=(1+e2si(hi+jiJijsj))1

Taking the logarithm, we define the approximate log-likelihood by summing over data points indexed by r:

(27)
f(hi,{Jij})=r=1Rlnp(si(r)|{sji}(r)).

In the limit where the ensemble is well sampled, the average over the data can be replaced by an average over the ensemble:

(28)
f(hi,{Jij}) =slnp(si|{sji})p(s;hi,{Jij}).

To find the point of maximum likelihood for a single spin si, we calculate the analytical gradient and Hessian, ∂f/∂Jij and ∂2f/∂JijJi′j′ for a Newton conjugate-gradient descent method. After maximizing likelihood for all spins, the maximum likelihood parameters may not satisfy the symmetry Jij = Jji. We impose the symmetry by insisting that

(29)
J'ij=(Jij+Jji)/2.

Pseudolikelihood is extremely fast and often surprisingly accurate. Each calculation of the gradient is order O(RN2) and Hessian O(RN3), which must be done for all N. With analytic forms for the gradient and Hessian, the conjugate-gradient descent method tends to converge quickly.

Pseudolikelihood for the Ising model is implemented in Pseudo.

Minimum Probability Flow (MPF)

Minimum probability flow involves analytically approximating how the probability distribution changes as we modify the configurations [, ]. In the methods so far mentioned, the approach has been to maximize the objective (the likelihood function) by immediately taking the derivative with respect to the parameters. With MPF, we first posit a set of dynamics that will lead the data distribution to equilibrate to that of the model. When these distributions are equivalent, then there is no “probability flow” between them. This technique is closely related to score matching, where we instead have a continuous state space and can directly take the derivative with respect to the states without specifying dynamics [].

First note that Monte Carlo dynamics (satisfying ergodicity and detailed balance) would lead to equilibration to the stationary distribution. The dynamics are specified by a transition matrix, an example of which is given in Ref []:

(30)
p˙s=ssΓsspsssΓssps
(31)
Γss=ɡssexp[12(EsEs)]

with transition probabilities Γss′ from state s′ to state s. The connectivity matrix ɡss′ specifies whether there is edge between states s and s′ such that probability can flow between them. By choosing a sparse ɡss′ while not breaking ergodicity, we can drastically reduce the computational cost of computing this matrix.

Imagine that we start with the distribution over the states as given by the data and run the Monte Carlo dynamics. When data and model distributions are different, probability will flow between them and indicate that the parameters must be changed. By minimizing a derivative of the Kullback-Leibler divergence, we measure how the difference between the model and the states in the data D changes when the dynamics are run for an infinitesimal amount of time.

(32)
L({λk})tDKL(p(0)||p(t)({λk}))=sDp˙s(λk)

The idea is that this derivative is also minimized with optimal parameters: the MPF algorithm looks for a minimum of the objective function L.

For the Ising model, each evaluation of the objective function where Γss′ connects each data state with G neighbors has runtime O(RGN2). In a large fully connected system, G2N would be prohibitively large so a sparse choice is necessary.

MPF is implemented in the MPF class.

Regularized mean-field method

One attractively simple and efficient approach uses a regularized version of mean-field theory. In the inverse Ising problem, mean-field theory is equivalent to treating each binary individual as instead having a continuously varying state (corresponding to its mean value). The inverse problem then turns into simply inverting the correlation matrix C []:

(33)
Jijmean-field=(C1)ijpi(1pi)pj(1pj),

where

(34)
Cij=pijpipjpi(1pi)pj(1pj),

and where pi corresponds to the frequency of individual i being in the active (+1) state and pij is the frequency of the pair i and j being simultanously in the active state.

A simple regularization scheme in this case is to discourage large values in the interaction matrix Jij. This corresponds to putting more weight on solutions that are closer to the case with no interactions (independent individuals). A particularly convenient form adds the following term, quadratic in Jij, to the negative log-likelihood:

(35)
γii<jJij2pi(1pi)pj(1pj).

In this case, the regularized version of the mean-field solution in (33) can be solved analytically, with the slowest computational step coming from the inversion of the correlation matrix. For details, see Refs. [, ].

The idea is then to vary the regularization strength γ to move between the non-interacting case (γ → ) and the naively calculated mean-field solution (33) (γ → 0). While there is no guarantee that varying this one parameter will produce solutions that are good enough to “fit within error bars,” this approach has been successful in at least one case of fitting social interactions [].

The inversion of the correlation matrix is relatively fast, bounded by O(N3). Finding the optimal γ involves Monte Carlo sampling from the model distribution, which has computational cost similar to MCH. It is, however, much more efficient because we are only optimizing a single parameter.

This is implemented in RegularizedMeanField.

Cluster expansion

Adaptive cluster expansion [, ] iteratively calculates terms in the cluster expansion of the entropy S:

(36)
SS0=ΓΔSΓ,

where the sum is over clusters Γ and in the exact case includes all 2N – 1 possible nonempty subsets of individuals in the system. In the simplest version of the expansion, one expands around S0 = 0. In some cases it can be more advantageous to expand around the independent individual solution or one of the mean-field solutions described in the previous section [].

The inverse Ising problem is solved independently on each of the clusters, which can be done exactly when the clusters are small. These results are used to construct a full interaction matrix Jij. The expansion starts with small clusters and expands to use larger clusters, neglecting any clusters whose contribution ΔSΓ to the entropy falls below a threshold. To find the best solution that does not overfit, the threshold is initially set at a large value and then lowered, gradually including more clusters in the expansion, until samples from the resulting Jij fit the desired statistics of the data sufficiently well.

The runtime will depend on the size of clusters included in the expansion. If the expansion is truncated at clusters of size n, the worst-case runtime would be O((Nn)2n). The point is that S can often be accurately estimated even when nN. The adaptive cluster expansion method is implemented in the ClusterExpansion class.

Implementation and architecture

The package is divided into three principal modules containing the algorithms for solving the inverse maxent problem (solvers.py), the Monte Carlo Markov Chain (MCMC) sampling algorithms (samplers.py), and supporting “utility” functions for the remaining modules (utils.py) as shown in Figure 2. Besides the utils.py module, the package is organized around classes that correspond to different algorithms. This class-based structure ensures that the state of the solver or sampler, including the data it was fit to and the current guess for the parameters, are all contained within the instance of the algorithm class. As a result, the current state of work can be saved and moved between workstations using the Python package dill.

Figure 2 

Brief summary of ConIII architecture. The principal modules are solvers.py, samplers.py, and utils.py. The module solvers.py contains classes based on Solver that each implement a different algorithm for solving the relevant inverse maxent problem accessible through the method Solver.solve(). The samplers.py module contains the Metropolis algorithm for Monte Carlo Markov Chain sampling and will support other samplers in future versions (gray font) including Wolff sampling, Swendsen-Wang sampling, and parallel tempering. The utils.py module contains supporting functions for the other modules such as the few examples listed. ConIII’s modularized structure ensures that contributed algorithms can be appended independently of existing code.

For the solvers, the different algorithms available are accessible from the coniii.solvers module as listed in Figure 2. These algorithm classes are all derived from a base Solver class as shown in Figure 2. The module Solver.solve serves as the interface for solving the inverse maxent problem. To keep the solution algorithms generic enough to solve a variety of different maxent problems, they all require that the user define the maxent model upon instantiation through the definition of keyword arguments like calc_observables. The particular methods required to specify the maxent problem differ by algorithm, but for the pairwise maxent problem we have made it easy by defining those functions as part of the package. These helper functions are available as part of the utils.py module and their use is demonstrated in the Jupyter notebook usage guide.

The MCMC sampling algorithms are likewise based on a class architecture derived from Sampler as shown in Figure 2. Each instance of Solver automatically instantiates this class under Solver.sampler and wraps calls to it. For the Ising model, this is an instance of Metropolis. Other sampling algorithms listed in the samplers.py box in Figure 2 will be released with later versions of this package.

Quality Control

For checks of basic functionality, the package is released with unit tests that can be run with the Python package pytest.

The most direct test of the algorithms is to generate a system where the parameters are known, sample from the system to generate a data set, and run the inverse solution to make sure that the correlations and parameters match the known values. With a finite sample, exact correspondence to the correct parameters is not expected although differences should decrease with a larger sample. Furthermore, most of the algorithms only return an approximate solution such that the fidelity of the found parameters to the original ones will depend on the sample size and whether or not the approximation is valid. The Jupyter notebook released with the software provides examples for using the algorithms included in ConIII for a random system of five spins. We recommend that the user run this notebook to check how well different algorithms converge to the solutions depending on the algorithm and sample size.

More importantly, the user can check if the algorithms match the expected correlations closely or not. How one checks the validity of a particular maxent model for data is beyond the scope of this paper, but we point the reader to the appendix of Ref [] where the methodology is explained in detail for a broad audience.

If there are any issues or bugs in the software, we organize improvements and patches through the GitHub repository where both issues can be filed and pull requests made.

(2) Availability

Operating system

Linux, MacOS, Windows

Programming language

Python 3.6, 3.7

Dependencies

Python packages multiprocess ≥ v0.70.5 and <v1, numpy, scipy, joblib, matplotlib, numba ≥ v0.39.0, dill.

Software location

Name: PyPI

Persistent identifier:https://pypi.org/project/coniii/

Licence: MIT License

Publisher: Edward D. Lee

Version published: v1.1.4

Date published: 1/6/2019

Code repository

Name: GitHub

Persistent identifier:https://github.com/eltrompetero/coniii

Licence: MIT License

Version published: v1.1.4

Name: Zenodo

Persistent identifier:https://doi.org/10.5281/zenodo.2236632

Licence: MIT License

Version published: v1.1.1

Language

English

(3) Reuse potential

To contribute either an algorithm for the inverse maxent problem or a sampling technique, we suggest following the template for the classes described in the base Solver and Sampler classes. New algorithms should be filed as a pull request to the GitHub repository along with an example solution that can be included in the usage guide Jupyter notebook and unit tests.

Documentation for the package is included as part of the GitHub repository and also hosted online at https://eddielee.co/coniii/index.html.