WekaPyScript: Classification, Regression, and Filter Schemes for WEKA Implemented in Python

Christopher Beckham; Mark Hall; Eibe Frank

doi:10.5334/jors.108

Figures & Tables

Table 1

Options for PyScriptClassifier and PyScriptFilter (* = applicable only to PyScriptClassifier, ** = applicable only to PyScriptFilter). (Note that the names in parentheses are the names of the options as shown in the Explorer GUI, as opposed to the terminal).

Option	Description
`-cmd` (pythonCommand)	Name of the Python executable
`-script` (pythonFile)	Path to the Python script
`-args` (arguments)	Semicolon-separated list of arguments (variable assignments) to pass to the script when training or testing
`-binarize` (shouldBinarize)*	Should nominal attributes be converted to binary ones?
`-impute` (shouldImpute)*	Should missing values be imputed (with mean imputation)?
`-standardize` (shouldStandardize)*	Should attributes be standardised? (If imputation is set then this is done after it)
`-stdout` (printStdOut)	Print any stdout from Python script?
`-save` (saveScript)	Save the script in the model? (E.g., do not dynamically load the script specified by `–script` at testing time)
`-ignore–class` (ignoreClass)**	Ignore class attribute? (See Table 2 for more information.)

The graphical user interface for PyScriptClassifier.

The graphical user interface for PyScriptFilter.

Table 2

Data and meta-data variables passed into args (* = only applicable to PyScriptFilter).

Variable(s)	Description	Type
`X_train, y_train`	Data matrix and label vector for training data. If `-ignore-class` is set or the class attribute is not specified, `y_train` will not exist and will instead be inside `X_train` as an extra column	`numpy.ndarray (float 64), numpy.ndarray (int 64)`
`X, y*`	Data matrix and label vector for data, when PyScriptFilter calls the `process` method (see Listing 2)	`numpy.ndarray (float 64), numpy.ndarray (int 64)`
`X_test`	Data matrix for testing data	`numpy.ndarray (float 64)`
`relation_name`	Relation name of ARFF	`str`
`class_type`	Type of class attribute (e.g., numeric, nominal)	`str`
`num_classes`	Number of classes	`int`
`attributes`	Names of attributes	`list`
`class`	Name of class attribute	`str`
`attr_values`	Dictionary mapping nominal/string attributes to their values	`dict`
`attr_types`	Dictionary mapping attribute names to their types (possible values are either `nominal` or `numeric`	`dict`

Listing 1

Skeleton of a Python script for PyScriptClassifier.

def train(args):
  # code for training model
def test(args, model):
  # code for running model on new instances
def describe(args, model):
  # textual representation of model

Listing 2

Skeleton of a Python script for PyScriptFilter.

def train(args):
  # code for training filter
def process(args, model):
  # code for processing instances(training or testing)

Listing 3

Python implementation of ZeroR.

from collections import Counter
import numpy as np

def train(args):
       y_train = args[“y_train”].flatten()
       counter = Counter(y_train)
       return counter.most_common()[0][0]

def describe(args, model):
       return “Majority class: %i” % model

def test(args, model):
       num_classes = args[“num_classes”]
       n = args[“X_test”].shape[0]
       majority_cls = np.eye(num_classes)[model].tolist()
       return [majority_cls for x in range(0, n)]

Listing 4

Output from zeror.py script.

Options: –script scripts/zeror.py

Majority class: 0

Time taken to build model: 2.54 seconds
Time taken to test model on training data: 0.02 seconds

=== Error on training data ===

Correctly Classified Instances                         50                                       33.3333 %
Incorrectly Classified Instances                      100                                       66.6667 %
Kappa statistic                                         0
Mean absolute error                                     0.4444
Root mean squared error                                 0.6667
Relative absolute error                               100      %
Root relative squared error                           141.4214 %
Coverage of cases (0.95 level)                         33.3333 %
Mean rel. region size (0.95 level)                     33.3333 %
Total Number of Instances                             150

=== Detailed Accuracy By Class ===
TP Rate                  FP Rate     Precision     Recall     F–Measure     MCC       …
1.000                    1.000       0.333         1.000      0.500         0.000     …
0.000                    0.000       0.000         0.000      0.000         0.000     …
0.000                    0.000       0.000         0.000      0.000         0.000     …
Weighted Avg.            0.333       0.333         0.111      0.333         0.167     …

=== Confusion Matrix ===

a    b     c      <--  classified as
50   0     0   |   a = Iris – setosa
50   0     0   |   b = Iris – versicolor
50   0     0   |   c = Iris – virginica

Listing 5

Optimising sum-of-squares loss in Theano.

import theano
from theano import tensor as T
import numpy as np

# assume 5 attributes for this example
num_attributes = 5
w = theano.shared(
    np.zeros((num_attributes, 1)), name =’w’)
b = theano.shared(0.0, name =’b’)

# let x be a n*p matrix, and y be a n*1 matrix
x = T.dmatrix(’x’)
y = T.dmatrix(’y’)
# prediction is simply xw + b
out = T.dot(x, w) + b

# loss function is mean squared error
loss = T.mean((out - y)**2)
# compute gradient of cost w.r.t. w and b
g_w = T.grad(cost = loss, wrt = w)
g_b = T.grad(cost = loss, wrt = b)

alpha = 0.01
updates = [(w, w - alpha * g_w), (b, b - alpha * g_b)]

iter_train = theano.function(
    [x, y], outputs=loss, updates=updates)

Listing 6

Output from linear-reg.py script.

Options: –script scripts/linear–reg.py…

f(x)=
   age *0.266773099848 +
   deficit *0.289990210412 +
   4.74354333559

Time taken to build model: 8.49 seconds
Time taken to test model on training data: 1.18 seconds

=== Error on training data ===

Correlation coefficient              0.607
Mean absolute error                  0.448
Root mean squared error              0.5659
Relative absolute error             82.3838 %
Root relative squared error         79.4711 %
Coverage of cases (0.95 level)       0      %
Mean rel.region size (0.95 level)    0      %
Total Number of Instances           43

Listing 7

Standardise filter in Python.

from wekapyscript import \
     ArffToArgs, get_header, instance_to_string
import numpy as np

def train(args):
    X_train = args[“X_train”]
    means = []
    sds = []
    attr_types = args[“attr_types”]
    attributes = args[“attributes”]
    for i in range(0, X_train.shape[1]):
        if attr_types[attributes[i]] == “numeric”:
            means.append(np.nanmean(X_train[:,i]))
            sds.append(
                np.nanstd(X_train[:,i],ddof=1))
        else:
            means.append(None)
            sds.append(None)
    return (means, sds)

def process(args, model):
    X = args[“X”]
    attr_types = args[“attr_types”]
    attributes = args[“attributes”]
    means, sds = model
    for i in range(0, X.shape[1]):
        if attr_types[attributes[i]] == “numeric”:
           X[:,i] = (X[:,i] - means[i]) / sds[i]
    return args

Listing 8

Output from standardise.py script.

@relation diabetes_numeric—weka.filters.pyscript.PyScriptFilter …

@attribute age numeric
@attribute deficit numeric
@attribute c_peptide numeric

@data

—0.952771, 0.006856, 4.8
—0.057814, —1.116253, 4.1
0.364805, 1.017655, 5.2
0.389665, 0.048973, 5.5
0.339945, —2.927268, 5
…