(1) Overview

Introduction

The homogeneity test is a statistical test method that checks if two (or more) datasets came from the same distribution or not. In the time series, the homogeneity test is applied to detect one (or more) change-point/breakpoint in the series. This breakpoint occurs when the dataset changes its distribution. This detection of distribution change makes the homogeneity test an essential test in statistical analysis. In fact, homogeneity is one of the most important assumptions in time series analysis. For example, predicting financial time series, analysis or forecasting of historical meteorological data, etc. If non-homogeneity is undetected, selection of the best model could be influenced by the selected sample. There are several tests available to check homogeneity viz. Pettitt Test, Standard Normal Homogeneity Test (SNHT), Buishand’s Test, etc. These tests can be performed using different commercial software packages like XLSTAT []. Different programming languages like R, Matlab, etc. have different packages or scripts for those tests [, ]. However, a python based single package to perform the most widely used homogeneity tests will save time and bring diversity into the analysis. Because Python is one of the most widely used tool used by data scientists. A large number of data analysis and research tools are also developed using Python. However, till now, there is no Python package available for the homogeneity test. pyHomogeneity package will fill this gap.

Implementation and architecture

pyHomogeneity is a pure Python implementation for the homogeneity test. Two core python packages named NumPy [] and SciPy [] are used to build pyHomogeneity. The vectorization approach is used instead of the traditional for loop to improve calculation speed. This package can perform six different homogeneity tests (three unique tests with four variants of Buishand’s test), which are widely used in time series analysis. Available tests in the pyHomogeneity package are briefly discussed below:

Pettitt Test

In 1979, A. N. Pettitt proposed a change-point detection test based on Mann-Whitney two-sample test []. For a continuous dataset, Pettitt statistic U(k) can be calculated using the equation below:

U(k)= 2i=1krik(n+1)

Where, r1, r2, r3,………,rk are the ranks of the k observations x1, x2, x3,……, xk in the complete sample of n observations and U(k) is calculated for every k = 1,2,3,……,n. The maximum of absolute values of U(k) refers to the probable change point at k-th data. The approximate probability for a two-sided test is given by

p=2 exp( 6*(max(|U(k)|))2n3+ n2)

Where, the approximate probability is good for p ≤ 0.5 []. The probability or critical values for the test statistic also can be estimated using Monte Carlo simulation.

Standard Normal Homogeneity Test (SNHT)

Standard Normal Homogeneity Test (SNHT) is based on the Ratio Test Method []. This method is best suitable to detect non-homogeneity near the beginning and end of the series []. The T(k) is calculated by comparing the mean of the first k data of the record with the last n-k data as follows:

T(k)=kz1¯2+(nk)z2¯2

Where,

z1¯= 1ki=1kxix¯S

z2¯= 1nki=k+1nxix¯S

x¯=mean= i=1nxn n

S= sample standard deviation= 1n1i=1n(xix¯)2

n=number of sample data

The T(k) reaches its maximum value when a breakpoint is detected at the data point K. The test statistic T0 is defined as:

T0= max(T(k))

The null hypothesis will be rejected if T0 is above a certain level, which is estimated using Monte Carlo simulation.

Buishand’s Test

In 1982, Buishand proposed a homogeneity test method based on adjusted partial sums []. The test statistic is given below:

S(k)= i=1kxix¯σ

Where,

σ=standard deviation= 1ki=1k(xix¯)2

The maximum of the absolute values of S(k) is referred to as the probable change point at k-th data.

Buishand proposed four ways to check the sensitivity of this homogeneity test. These are:

Q test

In this method, Q is calculated using the equation given below and critical values for the test statistic are obtained from the table by Buishand [] or using Monte Carlo simulation.

Q= max(S(k))n

Range test

In this method, Range R is calculated using the equation below. Critical values for the test statistic can be derived from the table by Buishand [] or using Monte Carlo simulation.

R= max(S(k))min(S(k))n

Likelihood Ratio test

The test statistic V(k) is calculated from the equation below []. Critical values for the test statistic are derived from the Monte Carlo simulation.

V=max(|S(k)|k(nk)) 

U Test

According to Buishand, U statistic is a robust test and good for detecting change point in the middle of a series []. The U statistic is calculated using the equation below. At the same time, critical values for the test statistic can be found in the table given by Buishand [] or using Monte Carlo simulation.

U= 1n(n+1)k=1n1S(k)2

Example

A quick example of pyHomogeneity usage is given below.

import numpy as np

import pyhomogeneity as hg

# Data generation for analysis

data = np.random.rand(360,1)

result = hg.pettitt_test(data)

print(result)

Output is like this:

Pettitt_Test(h=False, cp=89, p=0.1428, U=3811.0, avg=mean(mu1=0.5487521427805625, mu2=0.46884198890609463))

Whereas, the output is a named tuple, so user can call by name for specific result:

print(result.cp)

print(result.avg.mu1)

or, user can directly unpack the results like this:

h, cp, p, U, mu = hg.pettitt_test(x, 0.05)

Users can plot results by following (Figure 1):

Figure 1 

Homogeneity result plot.

mn = 0

mx = len(data)

loc = result.cp

mu1 = result.avg.mu1

mu2 = result.avg.mu2

plt.figure(figsize=(16,6))

plt.plot(data, label=”Observation”)

plt.hlines(mu1, xmin=mn, xmax=loc, linestyles=’--’, colors=’orange’,lw=1.5, label=’mu1 : ‘ + str(round(mu1,2)))

plt.hlines(mu2, xmin=loc, xmax=mx, linestyles=’--’, colors=’g’, lw=1.5, label=’mu2 : ‘ + str(round(mu2,2)))

plt.axvline(x=loc, linestyle=’-.’ , color=’red’, lw=1.5, label=’Change point : ‘+ str(loc) + ‘\n p-value : ‘ + str(result.p))

plt.title(‘Title’)

plt.xlabel(‘X’)

plt.ylabel(‘Y’)

plt.legend(loc=’upper right’)

plt.savefig(“F:/homogeneiry_results_plot.jpg”, dpi=600)

Users can find more examples in pyHomogeneity’s Github repository’s example section.

Quality control

Tests for pyHomogeneity package are performed using some fixed random data, where the results of those data are known. So, the performance of the functions is easily determined by comparing the output of the functions with the known results. Anyone can perform the unittest locally by using the below command in the root of the local copy of pyHomogeneity:

pytest –v

In addition, pyHomongeneity uses the continuous integration (CI) platform Travis CI for automatic testing after each change of the code base uploaded to the source repository []. The tests on Travis CI have been performed using different Python versions (2.7, 3.4., 3.5, 3.6, 3.7, 3.8) on Linux system. Users may raise issues on GitHub for additional support on using the package. Anyone can also contribute to this package. The contributor guideline can be found in the Contribution section on Github.

(2) Availability

Operating system

This package is platform independent, so it can be run on any operating system (GNU/Linux, Mac OSX, Windows) where python can be run.

Programming language

Python 2.7 and 3.4+

Additional system requirements

None

Dependencies

pyHomogeneity is written using core python packages. Only Numpy and Scipy are required to use it.

Software location

Archive

Name: Zenodo

Persistent identifier: http://doi.org/10.5281/zenodo.3785287

Licence: MIT

Publisher: Md. Manjurul Hussain Shourov

Version published: 1.1 and earlier versions. The DOI above always resolves to the latest version, previous versions can be identified with separate DOIs (see versions sections on the Zenodo repository page).

Date published: 04/05/2020

Code repository

Name: Github

Identifier: https://github.com/mmhs013/pyHomogeneity

Licence: MIT

Date published: 04/05/2020

Language

English

(3) Reuse potential

pyHomogeneity is a Python package, a widely used and freely available programming language. Because the package is for Python, it is platform-independent and therefore can be used by the majority of individuals in the data science community. It is a statistical analysis tool that performs different types of homogeneity tests for time series data. So, it can be used for data quality tests for study or academic research purposes. Many researchers have already started to use pyHomogeneity package in their research [, , , ].

Every function has docstrings to ensure clarity about what each function does and available options that the user can declare. The user documentation of pyHomogeneity is hosted on GitHub repository. The documentation contains some sample examples that can be easily modified for different user scenarios. pyHomogeneity is released under the MIT license and welcomes any contributions. We encourage users to submit feedback using GitHub issue tracker, or by emailing the authors.