(1) Overview

Introduction

It is increasingly common to obtain sampled data in the form of graph or network realisations. This is either through construction of graph structures as summary statistics for multivariate spatial-temporal data or directly as a graph-network data set from say a social network realisation or the like. In such settings in practice, one often needs to draw inferences about two samples of such graphs or networks. That is, to test whether the collections of graphs/networks in one sample were generated from the same distribution as the collection of graphs/networks in the second sample.

Kernel methods have proven to be useful in pattern recognition tasks such as classification and can be further extended as an inference procedure to two-sample hypothesis testing on structured data. The method embeds the graphs into a reproducing kernel Hilbert space (RKHS) via a feature map which is then extended further to the embedding of a probability distribution. The two-sample null hypothesis is that the generating mechanism behind the two samples is the same and the test statistic, which is called the maximum mean discrepancy (MMD), is the largest distance between the means of the two sample embeddings [].

Graph kernels are already well established and widely used for solving classification tasks on graphs and can further be used to compare samples of graphs and to perform graph screening []. They provide a very flexible way of comparing graphs as they exist for a wide range of different graph structures, for example, weighted, directed, labelled, and attributed graphs. Their performance depends on their expressiveness, that is, their ability to distinguish non-isomorphic graphs. The difficulty of distinguishing two samples of graphs varies strongly based on the type of graphs.

Graph two-sample hypothesis testing is a problem that frequently arises in various disciplines, for example in bio informatics [], community detection [], and risk management []. Graph two-sample hypothesis have mostly been performed by using graph statistics such has the degree centrality and shortest paths. Although these methods can often give good performances they fail to take into account various attributes that are often present in real graphs such as node labels, edge labels, node attributes and edge weights. When the kernel two-sample hypothesis testing was introduced [] a flood gate opened to allow for testing of such attributes and therefore providing a flexible way of performing two-sample hypothesis testing. Luckily, there also exists a vast literature on graph kernels [, ]. Until now, there is no package which allows one to estimate graphs from real valued data matrices and perform hypothesis testing in a flexible manner. The package GTST provides three functionalities

  1. Functions to estimate the two-sample hypothesis test statistic;
  2. Functions to calculate different graph kernels; and
  3. A function to estimate graphs from data matrices.

The package allows other network science researches who may not be programmers to use the MMD testing framework.

There exists a python package called GraKel [] which is dedicated to calculating various graph kernels. The package is very user-friendly so the GTST user can use all graph kernels available in the Grakel package. The GTST package extends the choice of graph kernels available for use in graph two-sample testing, by also developing a collection of kernels not available in GraKel. Such as, the fast random walk kernels which is based on ideas from [] along with an additional fast random walk kernel for edge-labelled graphs; The Wasserstein Weisfeiler-Lehman Graph kernel [] whose original code was adjusted for the package needs. The Deep Graph kernel [], and the graph neural tangent kernel [] whose original code was adjusted for the package needs. The MONK estimator, which is a robust estimator of the MMD, was developed by MONK [] and they do provide the code online and in a packaged environment. However, we have adjusted the code slightly to allow for robust comparing of samples of different sizes. The MMDGraph then estimates the p-value of test by using a bootstrap or a permutation sampling scheme. The package also allows for estimating graphs using sklearn’s graphical lasso []. Additional preprocessing can be done by using the nonparanormal transform []. The best graph is found by using the EBIC criterion []. GTST assumes that the graphs passed are a networkx object []. One can additionally use pre-computed kernels to perform tests.

Implementation and architecture

Let G(V, S) denote a graph with vertex set V and edge set S. In the two-sample testing of graph-valued data, we assume we are given two sets of samples/observations that comprise collections of graph-valued data {G1, …, Gn} and {G1,,Gn} where Gi,GjΩ,i,j. The graphs in the two samples are all generated independently from two probability spaces (Ω,,) and (Ω,,), and the goal is to infer whether ℙ = ℚ. We note that in this general testing framework the vertex sets and edge-sets do not have to be equal, but they can be common if it is desirable for the application. This is therefore a very general testing framework. In the simplest case, where we assume both sets of sample graphs come from a common set of vertices, then the sample space Ω contains all possible edges that can occur in a graph G, that is Ω={(v1,v2),,(v1,v|V|),(v2,v1),,(v|V|,v|V|1)}. As the sample space is discrete we can define the σ-algebra as the power set of Ω, namely, =(Ω). The probability function :[0,1] then defines the probability of obtaining a certain graph in the sample set of graph-valued data. As an example we can define for instance a population distribution to be uniform

(G(V,S))=1(M|S|)

where M=(2|V|) is the total number of possible edges and G(V, S) is a graph with |V| vertices and |S| number of edges. This setting is illustrated in Figure 1.

Figure 1 

The graph two sample testing scenario. Here we have observed 4 graphs from ℙ and 3 graphs from ℚ. The sample space of ℙ and ℚ is the same (( 25) possible edges).

Now, returning to the concept of two-sample testing for graph-valued data. The goal is to infer whether the two samples of graphs are generated according to the same distribution. This involves developing a statistical test T({G}i=1n,{G}i=1n) to determine from the population samples whether there is sufficient evidence to reject the null hypothesis that both population distributions generating the two samples of graphs are equivalent, where T({G}i=1n,{G}i=1n):{G}i=1n×{G}i=1n{0,1} is a function that distinguishes between the null hypothesis and the alternative hypothesis:

H0:=H1:.

The test statistic used in this case is the largest distance between the expectation of some function w.r.t. to the two probability distributions. Let B be a class of functions f:ΩR. The maximum mean discrepancy (MMD) is defined as:

MMD[B,,]:=supfB(EG~[f(G)]EG~[f(G)]).

Where the class of function, B, is the unit ball in a RKHS, then the squared population MMD becomes kernalized:

MMD2[1,,]=E,[k(G,G)]2E,[k(G,G)]+E,[k(G,G)],

where k is some graph kernel. The kernel matrix plays a central role and is defined as:

Kij=k(Gi,Gj),

where {Gi}i=12n is the data. Note, both sampels are included in the data. In the context of two sample graph testing, we have two data sources, sample 1 of graphs assumed drawn from P and sample 2 of graphs assumed drawn from ℚ. It can be good to order the kernel matrix such that K has a block structured as follows:

(1)
K=[KKKK],

where Kℙℚ is the Kernel function evaluated at data points within the sample coming from the unknown distributions ℙ and ℚ. Kℙℙ, Kℚℙ and Kℚℚ are defined analogously. Note K=KT.

Different estimates of MMD2 are:

Unbiased estimate

An unbiased estimate of MMD2 for n and n’ samples of graphs is given by:

(2)
MMD^u2[B,,]=1n(n1)i=1njink(Gi,Gj)+1n(n1)i=1njink(Gi,Gj)2nni=1nj=1nk(Gi,Gj).

Biased estimate

A biased estimate of MMD2 for n and n’ samples of graphs is given by:

(3)
MMD^b2[B,,]=1n2i=1nj=1nk(Gi,Gj)+1n2i=1nj=1nk(Gi,Gj)2nni=1nj=1nk(Gi,Gj).

Unbiased linear time estimate

An unbiased can be computed in O(n) time. Assume that n = n’ and define n2 = ⌊n/2⌋, where ⌊∙⌋ is the floor function, then the linear estimate is computed as:

MMD^l2[B,,]=1n2i=1n2(k(G2i1,G2i)+k(G2i1,G2i)k(G2i1,G2i)k(G2i,G2i1)),

while MMD^l2 has higher variance than MMD^u2, it is computationally much more appealing.

Robust Estimate

Consider the partition {Sq}q1:Q of |Sq|=n/Q partitions. By the representer theorem [] we can express a function f within the RKHS space as:

f=iaik(,xi)+ibik(,yi).

A robust MMD estimator is found by solving:

maxc:cTKc1medq1:Q{1|Sq|[1q,1q]Kc},

where c=[a,b]2n,K=[K,K;K,K]2n×2n, where Kℙℚ is the Kernel function evaluated at data points within the sample coming from the unknown distributions ℙ and ℚ, and 1qn is an indicator vector of block q. For more details see [, ].

The package includes graph kernels such as:

  • Random walk kernels which can be used on weighted, directed, undirected, and bipartite graphs with node labels, node attributes, edge labels, and edge attributes.
  • Deep graph kernels can be used on node labelled binary graphs.
  • Graph neural tangent kernels can be used on binary graphs with node labels/attributes.
  • Wasserstein Weisfeiler-Lehman graph kernels can be used on undirected graphs with node labels and can be used on node-attributed graphs with the right embedding scheme as well.

Other graph kernels can be utilized via the graph kernel library GraKel [] which is dedicated to calculating various graph kernels. The package is very user-friendly so the GTST user can use all graph kernels available in the Grakel package. For more details on various graph kernels see [, , ].

Example

The workflow is as follows: 1) Use two data arrays to estimate two sequences/samples of graphs using the graphical lasso []. This step can be skipped if the practitioner already has the samples of graphs in a networkx format []. 2) Select a graph kernel. 3) Select an estimator of the MMD or try multiple estimators to obtain a p-value. A quick example for the random walk kernel is:

import numpy as np
import networkx as nx
import GTST
 
# Generate random samples
n1 = n2 = 50 # sample sizes
# generate two samples
 
g1 = [nx.fast_gnp_random_graph (30,0.3) for _ in range (n1)]
g2 = [nx.fast_gnp_random_graph (30,0.4) for _ in range (n2)]
 
# Random Walk, r is number of eigen-pairs, c is the discount constant
MMD_out = GTST.MMD()
MMD_out.fit(G1 = g1,
        G2 = g2,
        kernel = 'RW_ARKU_PLUS',
        mmd_estimators = 'MMD_u',
        r = 6,
        c = 0.001)
print (f" RW_ARKU_plus {MMD_out.p_values}")

For more exampels please see https://github.com/ragnarlevi/GTST.

Quality control

The requirements for running test for GTST is listed in the requirements_dev.txt in the Github page. The tests involve testing if kernels matrices are positive semi definite and whether the kernels, test statistic, and permutation method for the p-value are able to reject the null when the null is “extremely” false using known random data sets. Once the required testing packages have been installed, the tests can be performed by running the command:

pytest

in the root folder, which will take around 15 minutes. This will generate a coverage report which can be found in the htmlcov directory. To view it run:

cd htmlcov python - m http.server

and open the localhost link (something like http://localhost:8000/) in a browser.

To test GTST in a clean environment for all python versions from 3.7–3.10, we use Tox. This can be achieved by running

tox

in the root directory. Note that this takes significantly longer to run, so is best performed as a final check. Whenever the code is pushed to the remote repository, the Tox test suite is automatically run using GitHub actions. To investigate this process, consult the file found at .github/workflows/tests.yml on the github page. The output of the report can be found in the github actions tab.

A coverage report was made by using coveralls locally, but can be found by clicking the coverage shield on the github page.

(2) Availability

Operating system

This package can be run on any operating system where python can be run (GNU/Linux, Mac OSX, Windows).

Programming language

python 3.7+

Additional system requirements

None

Dependencies

Packages that are required are scipy>=1.6.1, scikit-learn>=1.0.2, tqdm>=4.59.0, numpy>=1.20.1, networkx>=2.5, POT>= 0.8.2.

The Grakel package is optional but is recommended to be installed so an extra suite of graph kernels can be used.

Software location

Archive

Name: Zenodo

Persistent identifier: https://doi.org/10.5281/zenodo.8037197

Licence: MIT

Publisher: Ragnar Leví Guðmundarson

Version published: 0.0.6

Date published: 14/06/23

Code repository

Name: Github

Persistent identifier: https://github.com/ragnarlevi/GTST

Licence: MIT

Date published: 11/11/22

Language

English

(3) Reuse potential

The code was originally used in the paper [] to compare pairwise asset return process relationships to study and understand risk and return in portfolio management practice. This allows one to statistically test for significance of any detected differences in portfolio diversification between any portfolio investment strategy when applying differing investment screening criteria or optimal investment strategies. We further remark that this package can further be used in other fields other than portfolio comparison. For example, the package allows, in a straight forwards manner, to compare different community structures, to detect changes in communities, to detect change-point events, to test for differences in traffic networks, and to compare ego-networks of entities.

Every function has docstrings to ensure clarity. Examples for all graph kernels, some Grakel kernels and estimators can be found on the Github page, along with an example of the graph estimation functionality. GTST is released under the MIT license and welcomes any contributions. We encourage users to submit feedback using GitHub issue tracker, or by emailing Ragnar.

List of Contributors

RLG wrote the software package, developed the methodology, wrote the paper, and tested the software. GWP guided the problem formulations and solutions conceptually.