Have a personal or library account? Click to login
WekaPyScript: Classification, Regression, and Filter Schemes for WEKA Implemented in Python Cover

WekaPyScript: Classification, Regression, and Filter Schemes for WEKA Implemented in Python

Open Access
|Aug 2016

Figures & Tables

Table 1

Options for PyScriptClassifier and PyScriptFilter (* = applicable only to PyScriptClassifier, ** = applicable only to PyScriptFilter). (Note that the names in parentheses are the names of the options as shown in the Explorer GUI, as opposed to the terminal).

OptionDescription
-cmd (pythonCommand)Name of the Python executable
-script (pythonFile)Path to the Python script
-args (arguments)Semicolon-separated list of arguments (variable assignments) to pass to the script when training or testing
-binarize (shouldBinarize)*Should nominal attributes be converted to binary ones?
-impute (shouldImpute)*Should missing values be imputed (with mean imputation)?
-standardize (shouldStandardize)*Should attributes be standardised? (If imputation is set then this is done after it)
-stdout (printStdOut)Print any stdout from Python script?
-save (saveScript)Save the script in the model? (E.g., do not dynamically load the script specified by –script at testing time)
-ignore–class (ignoreClass)**Ignore class attribute? (See Table 2 for more information.)
Fig01_web.jpg
Figure 1

The graphical user interface for PyScriptClassifier.

Fig02_web.jpg
Figure 2

The graphical user interface for PyScriptFilter.

Table 2

Data and meta-data variables passed into args (* = only applicable to PyScriptFilter).

Variable(s)DescriptionType
X_train, y_trainData matrix and label vector for training data. If -ignore-class is set or the class attribute is not specified, y_train will not exist and will instead be inside X_train as an extra columnnumpy.ndarray (float 64), numpy.ndarray (int 64)
X, y*Data matrix and label vector for data, when PyScriptFilter calls the process method (see Listing 2)numpy.ndarray (float 64), numpy.ndarray (int 64)
X_testData matrix for testing datanumpy.ndarray (float 64)
relation_nameRelation name of ARFFstr
class_typeType of class attribute (e.g., numeric, nominal)str
num_classesNumber of classesint
attributesNames of attributeslist
className of class attributestr
attr_valuesDictionary mapping nominal/string attributes to their valuesdict
attr_typesDictionary mapping attribute names to their types (possible values are either nominal or numericdict
Listing 1

Skeleton of a Python script for PyScriptClassifier.

def train(args):
  # code for training model
def test(args, model):
  # code for running model on new instances
def describe(args, model):
  # textual representation of model
Listing 2

Skeleton of a Python script for PyScriptFilter.

def train(args):
  # code for training filter
def process(args, model):
  # code for processing instances(training or testing)
Listing 3

Python implementation of ZeroR.

from collections import Counter
import numpy as np

def train(args):
       y_train = args[“y_train”].flatten()
       counter = Counter(y_train)
       return counter.most_common()[0][0]

def describe(args, model):
       return “Majority class: %i” % model

def test(args, model):
       num_classes = args[“num_classes”]
       n = args[“X_test”].shape[0]
       majority_cls = np.eye(num_classes)[model].tolist()
       return [majority_cls for x in range(0, n)]
Listing 4

Output from zeror.py script.

Options: –script scripts/zeror.py

Majority class: 0

Time taken to build model: 2.54 seconds
Time taken to test model on training data: 0.02 seconds

=== Error on training data ===

Correctly Classified Instances                         50                                       33.3333 %
Incorrectly Classified Instances                      100                                       66.6667 %
Kappa statistic                                         0
Mean absolute error                                     0.4444
Root mean squared error                                 0.6667
Relative absolute error                               100      %
Root relative squared error                           141.4214 %
Coverage of cases (0.95 level)                         33.3333 %
Mean rel. region size (0.95 level)                     33.3333 %
Total Number of Instances                             150

=== Detailed Accuracy By Class ===
TP Rate                  FP Rate     Precision     Recall     F–Measure     MCC       …
1.000                    1.000       0.333         1.000      0.500         0.000     …
0.000                    0.000       0.000         0.000      0.000         0.000     …
0.000                    0.000       0.000         0.000      0.000         0.000     …
Weighted Avg.            0.333       0.333         0.111      0.333         0.167     …

=== Confusion Matrix ===

a    b     c      <--  classified as
50   0     0   |   a = Iris – setosa
50   0     0   |   b = Iris – versicolor
50   0     0   |   c = Iris – virginica
Listing 5

Optimising sum-of-squares loss in Theano.

import theano
from theano import tensor as T
import numpy as np

# assume 5 attributes for this example
num_attributes = 5
w = theano.shared(
    np.zeros((num_attributes, 1)), name =’w’)
b = theano.shared(0.0, name =’b’)

# let x be a n*p matrix, and y be a n*1 matrix
x = T.dmatrix(’x’)
y = T.dmatrix(’y’)
# prediction is simply xw + b
out = T.dot(x, w) + b

# loss function is mean squared error
loss = T.mean((out - y)**2)
# compute gradient of cost w.r.t. w and b
g_w = T.grad(cost = loss, wrt = w)
g_b = T.grad(cost = loss, wrt = b)

alpha = 0.01
updates = [(w, w - alpha * g_w), (b, b - alpha * g_b)]

iter_train = theano.function(
    [x, y], outputs=loss, updates=updates)
Listing 6

Output from linear-reg.py script.

Options: –script scripts/linear–reg.py…

f(x)=
   age *0.266773099848 +
   deficit *0.289990210412 +
   4.74354333559

Time taken to build model: 8.49 seconds
Time taken to test model on training data: 1.18 seconds

=== Error on training data ===

Correlation coefficient              0.607
Mean absolute error                  0.448
Root mean squared error              0.5659
Relative absolute error             82.3838 %
Root relative squared error         79.4711 %
Coverage of cases (0.95 level)       0      %
Mean rel.region size (0.95 level)    0      %
Total Number of Instances           43
Listing 7

Standardise filter in Python.

from wekapyscript import \
     ArffToArgs, get_header, instance_to_string
import numpy as np

def train(args):
    X_train = args[“X_train”]
    means = []
    sds = []
    attr_types = args[“attr_types”]
    attributes = args[“attributes”]
    for i in range(0, X_train.shape[1]):
        if attr_types[attributes[i]] == “numeric”:
            means.append(np.nanmean(X_train[:,i]))
            sds.append(
                np.nanstd(X_train[:,i],ddof=1))
        else:
            means.append(None)
            sds.append(None)
    return (means, sds)

def process(args, model):
    X = args[“X”]
    attr_types = args[“attr_types”]
    attributes = args[“attributes”]
    means, sds = model
    for i in range(0, X.shape[1]):
        if attr_types[attributes[i]] == “numeric”:
           X[:,i] = (X[:,i] - means[i]) / sds[i]
    return args
Listing 8

Output from standardise.py script.

@relation diabetes_numeric—weka.filters.pyscript.PyScriptFilter …

@attribute age numeric
@attribute deficit numeric
@attribute c_peptide numeric

@data

—0.952771, 0.006856, 4.8
—0.057814, —1.116253, 4.1
0.364805, 1.017655, 5.2
0.389665, 0.048973, 5.5
0.339945, —2.927268, 5
…
DOI: https://doi.org/10.5334/jors.108 | Journal eISSN: 2049-9647
Language: English
Submitted on: Dec 9, 2015
Accepted on: Jul 1, 2016
Published on: Aug 8, 2016
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2016 Christopher Beckham, Mark Hall, Eibe Frank, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.