METAPAPER WekaPyScript : Classification , Regression , and Filter Schemes for WEKA Implemented in Python

WekaPyScript is a package for the machine learning software WEKA that allows learning algorithms and preprocessing methods for classification and regression to be written in Python, as opposed to WEKA’s implementation language, Java. This opens up WEKA to its machine learning and scientific computing ecosystem. Furthermore, due to Python’s minimalist syntax, learning algo rithms and preprocessing methods can be prototyped easily and utilised from within WEKA. WekaPyScript works by running a local Python server using the host’s installation of Python; as a result, any libraries installed in the host installation can be leveraged when writing a script for WekaPyScript. Three example scripts (two learning algorithms and one preprocessing method) are presented.


Introduction
WEKA [1] is a popular machine learning workbench written in Java that allows users to easily classify, process, and explore data.There are many ways WEKA can be used: through the WEKA Explorer, users can visualise data, train learning algorithms for classification and regression and examine perfor mance metrics; in the WEKA Experimenter, datasets and algorithms can be compared in an automated fashion; or, it can simply be invoked on the terminal or used as an external library in a Java project.
Another machine learning library that is increasingly becoming popular is Scikit-Learn [2], which is written in Python.Part of what makes Python attrac tive is its ease of use, minimalist syntax, and interactive nature, which makes it an appealing language to learn for non-specialists.As a result of Scikit-Learn's popularity the wekaPython [3] package was released, which allows users to build Scikit-Learn classifiers from within WEKA.While this package makes it easy to access the host of algorithms that Scikit-Learn provides, it does not provide the capability of executing external custom-made Python scripts, which limits WEKA's ability to make use of other interesting Python libraries.For example, in the world of deep learning (currently a hot topic in machine learning), Python is widely used, with libraries or wrappers such as Theano [4], Lasagne [5], and Caffe [6].The ability to create classifiers in Python would open up WEKA to popular deep learning implementations.
In this paper we present a WEKA classifier and a WEKA filter, 1 PyScript Classifier and PyScriptFilter (under the umbrella "WekaPyScript"), that are able to call arbitrary Python scripts using the functionality provided by the wekaPython package.So long as the script conforms to what the WekaPyScript expects, virtually any kind of Python code can be called.We present three example scripts in this paper: one that re-implements WEKA's ZeroR classifier (i.e., simply predicts the majority class from the training data), one that makes use of Theano in order to train a linear regression model, and a simple filter that standardises numeric attributes in the data.Theano is a symbolic expression library that allows users to construct arbitrarily complicated functions and au tomatically compute the derivatives of them -this makes it trivial to implement classifiers such as logistic regression or feedforward neural networks (according to Baydin et al. [7], the use of automatic differentiation in machine learning is scant).
In our research, we used this package to implement new loss functions for neural networks using Theano and compare them across datasets using the WEKA Experimenter.

Implementation and architecture
In this section, we explain how wekaPython is implemented and how WekaPyScript makes use of it to allow classifiers and filters to be implemented in Python.

wekaPython
WekaPyScript relies on a package for WEKA 3.7 called wekaPython [3].This package provides a mechanism that allows the WEKA software, which is running in a Java JVM, to interact with CPython -the implementation of the Python language written in C.Although there are versions of the Python language that can execute in a JVM, there is a growing collection of Python libraries for scientific computing that are backed by fast C or Fortran implementations, and these are not available when using a JVM-based version of Python.
In order to execute Python scripts that can access packages incorporating native code, the wekaPython package uses a micro-service architecture.The package starts a small server, written in Python, and then communicates with it over local sockets.The server implements a simple protocol that allows WEKA to transfer and receive datasets, invoke CPython scripts, and retrieve the values of variables set in Python.The format for transporting datasets to and from Python is comma-separated values (CSV).On the Python side, the fast CSV parsing routine from the pandas package [8] is used to convert the CSV data read from a socket into a data frame data structure.On the WEKA side, WEKA's CSVLoader class is used to convert CSV data sent back from Python.
The two primary goals of the wekaPython package are to: a) allow users of WEKA to execute arbitrary Python scripts in a Python console implemented in Java or as part of a data processing workflow; and (b) enable access to classifica tion and regression schemes implemented in the Scikit-Learn [2] Python library.In the case of the former, users can write and execute scripts within a plug-in graphical environment that appears in WEKA's Explorer user interface, or by using a scripting step in WEKA's Knowledge Flow environment.In the case of the latter, the package provides a "wrapper" WEKA classifier implementation that executes Python scripts to run Scikit-Learn algorithms.Because the wrap per classifier implements WEKA's Classifier API, it works in the same way as a native WEKA classifier, which allows it to be processed by WEKA's evalua tion routines and used in the Experimenter framework.Although the general scripting functionality provided by wekaPython allows users to write scripts that access machine learning libraries other than Scikit-Learn, they do not appear as a native classifier to WEKA and can not be evaluated in the same way as the Scikit-Learn wrapper.The goal of the WekaPyScript package described in this paper is to provide this functionality.

WekaPyScript
The new PyScriptClassifier and PyScriptFilter components contain various op tions such as the name of the Python script to execute and arguments to pass to the script when training or testing.The arguments are represented as a semicolon-separated list of variable assignments.All of WekaPyScript's options are described below in Table 1.Figures 1 and 2 show the GUI in the WEKA Explorer for PyScript Classifier and PyScriptFilter, respectively.
When PyScriptClassifier/PyScriptFilter is invoked, it will utilise wekaPython to start up a Python server on localhost and construct a dictionary called args, which contains either the training or the testing data (depending on the context) and meta-data such as the attribute names and their types.This meta-data is described in Table 2.
This args dictionary can be augmented with extra arguments by using the -args option and passing a semicolon-separated list of variable assignments.For instance, if -args is alpha=0.01;reg='l2'then the dictionary args will have a variable called alpha (with value 0.01) and a variable reg (with value 'l2') and these will be available for access at both training and testing time. 2  Given some Python script, PyScriptClassifier will execute the following block of Python code to train the model: In other words, it will try and call a function in the specified Python script called train, passing it the args object, and this function should return (in some form) something that can be used to reinstantiate the model.When the resulting WEKA model is saved to disk (e.g., through the command line or the WEKA Explorer) it is the model variable that gets serialised (thanks to wekaPython's ability to receive variables from the Python VM).If the   -save flag is set, the WEKA model will internally store the Python script so that at testing time the script specified by -script is not needed -this is not ideal however if the script is going to be changed frequently in the future.
When PyScriptClassier needs to evaluate the model on test data, it dese rialises the model, sends it back into the Python VM, and runs the following code for testing: PyScript Filter also has a train function that works in the same way. 3Unlike a test function however, there is a process(args, model) function, which is applied to both the training and testing data.This function returns a modified version of the args object (this is because filters may change the structure, i.e., attributes, and contents of the data): This new args object is then automatically converted back into WEKA's internal ARFF file representation, which then can be input into another filter or classifier.
The skeleton of a Python filter is shown in Listing 2.

Example use
In this section we present three examples: a classification algorithm that simply predicts the majority class in the training data; an excerpt of a linear regres sor that uses automatic differentiation; and a filter that standardises numeric attributes in the data.

ZeroR
The first example we present is one that re-implements WEKA's ZeroR classifier, which simply finds the majority class in the training set and uses that for all predictions (see Listing 3).
In the train function we simply count all the classes in y_train and return the index (starting from zero) of the majority class, m (lines 5-7).So for this particular script, the index of the majority class is the "model" that is returned.In line 15 of the test function, we convert the majority class index into a (one hot-encoded) probability distribution by indexing into a k × k identity matrix, and in line 16, return this vector for all n test instances (i.e., it returns an n × k array, where n is the number of test instances and k is the number of classes; n im = 1 and the other entries in n i are zero).
Here is an example use of this classifier from a terminal session (assuming it is run from the root directory of the WekaPyScript package, which includes zeror.py in its scripts directory and iris.arff in the datasets directory) 4 : java weka.Run .PyScriptClassifier \ -cmd python \ -script scripts/zeror.py\ -t datasets/iris.arff\

-no-cv
This example is run on the entire training set (i.e., no cross-validation is performed) since the standard -no-cv flag for WEKA is supplied.We have also used -cmd to tell WekaPyScript where the Python executable is located (in our case, it is located in the PATH variable so we only have to specify the executable name rather than the full path).If -cmd is not specified, then WekaPyScript will assume that the value is python.The output of this command is shown below in Listing 4.

Linear regression
We now present an example that uses Theano's automatic differentiation capa bility to train a linear regression classifier.We do not discuss the full script and instead present the gist of the example.To introduce some notation, let x = {x (1) , x (2) , . . ., x (n) } be the training examples, where x (i) ∈ R p , and y = {y (1) , y (2) , . . ., y (n) } where y (i) ∈ R.Then, the sum-of-squares loss is where w ∈ R p is the vector of coefficients for the linear regression model and b ∈ R is the intercept term.we can use gradient descent and iteratively update w and b: We repeat above until we reach a maximum number of epochs (i.e., scans through the training data) or until we reach convergence (with some epsilon, ∈).Fortunately, we do not need to manually compute the partial derivatives because Theano can do this for us.Listing 5 illustrates this.
In this code, which we would place into the train function of the script for PyScriptClassifier, we define our parameters w and b in lines 7-9, initialising w and b to zeros.In lines 12-13, we define our symbolic matrices x ∈ R n×p and y ∈ R n×1 , and in line 15, the output function h(x) = wx+b, where h(x) ∈ R n×1 .In line 18, we finally compute the loss function in Equation 1 and in lines 20-21 we compute the gradients w ¶ ¶ L(w, b) and b ¶ ¶ L(w, b).We define our learning rate α in line 23 and in line 24, we define the parameter updates as described in Equations 2 and 3. Finally, in line 26 we define the iter_train function: given some x and y (which can be the entire training set, or a mini-batch, or a single example), it will output the loss (Equation 1) and automatically update the parameters as per Equations 2 and 3. We can run this example from a terminal session by executing: java weka.Run .PyScriptClassifier \ -script scripts/linear-reg.py\ -args "alpha=0.1;epsilon=0.00001"\ -standardize \ -t datasets/diabetes_numeric.arff \ -no-cv In this example we have used the -standardize flag to perform zero-mean unit-variance normalisation on all the numeric attributes.Also note that we did not have to explicitly specify an alpha and epsilon since the script has default values for these -this was done just to illustrate how arguments work.The output of this script is shown below in Listing 6.
Because we created a textual representation of the model with the describe function, we get the equation of the linear classifier in the output.

Standardise filter
Lastly, we present an example filter script that standardises all numeric at tributes by subtracting the mean and dividing by the standard deviation.This is shown in Listing 7.   Listing 6: Output from linear-reg.py script.
In lines 11-18, we iterate through all attributes in the dataset and store the means and standard deviations for the numeric attributes.The "model" that we return in this script is a tuple of two lists (the means and standard deviations).In lines 26-28, we perform the standardisation.From there, we return the args object (which has changed due to the modification of X).We can run this example on the diabetes dataset: The output of this script is the transformed dataset.An excerpt of this, in WEKA's ARFF data format, is shown in Listing 8. @relation diabetes_numeric-weka.filters.pyscript.PyScriptFilter ... @attribute age numeric @attribute deficit numeric @attribute c_ peptide numeric @data -0.952771, 0.006856, 4.8 -0.057814, -1.116253, 4.1 0.364805, 1.017655, 5.2 0.389665, 0.048973, 5.5 0.339945, -2.927268, 5 ...

Figure 1 :
Figure 1: The graphical user interface for PyScriptClassifier.

Figure 2 :
Figure 2: The graphical user interface for PyScriptFilter.
cls = imp.load_source('test',<name of python script>) preds = cls.test(args,model) In this example, test is a function that takes a variable called model in addition to args.This additional variable is the model that was previously returned by the train function.The test function returns an n × k Python list (i.e., not a NumPy array) in the case of classification (where n i is the probability distribution for k classes for the i'th test instance), and an n-long Python list in the case of regression.To get a textual representation of the model, users must also write a function called describe which takes two arguments -the args object as described earlier, and the model itself -and returns some textual representation of the model (i.e. a string).This function is used as follows: cls = imp.load_source('describe',<name of python script>) model_description = cls.describe(args,model) From the information described so far, the basic skeleton of a Python script implementing a classifier will look like what is shown in Listing 1.

Table 1 :
Options for PyScriptClassifier and PyScriptFilter (* = applicable only to PyScriptClassifier, Data matrix and label vector for training data.If -ignore-class is set or the class attribute is not specified, y_train will not exist and will instead be inside X_train as an extra column ** = applicable only to PyScriptFilter).(Note that the names in parentheses are the names of the options as shown in the Explorer GUI, as opposed to the terminal).-save(saveScript)Savethescriptinthe model?(E.g., do not dynamically load the script specified by -script at testing time)-ignore-class (ignoreClass)** Ignore class attribute?(SeeTable2formore information.)

Table 2 :
Data and meta-data variables passed into args (* = only applicable to PyScriptFilter).