-
Notifications
You must be signed in to change notification settings - Fork 37
3 User Manual
A comprehensive user manual is here. This page contains a few examples for basic learning pipelines and is a great place to start.
GURLS (GURLS++) basically consists of a set of tasks, each one belonging to a predefined category, and of a method (a class in the C++ implementation) called GURLS Core that is responsible for processing an ordered sequence of tasks called pipeline. An additional ”options structure”, often referred to as OPT, is used to store all configuration parameters needed to customize the tasks behaviour. Tasks receive configuration parameters from the options structure in read-only mode and, after terminating, their results are appended to the structure by the GURLS Core in order to make them available to the subsequent tasks. This allows the user to easily skip the execution of some tasks in a pipeline, by simply inserting the desired results directly into the options structure. All tasks belonging to the same category can be interchanged with each other, so that the user can easily choose how each task shall be carried out.
Gurls Design
The gurls command accepts exaclty four arguments:
- The data point, stored in a NxD matrix.
- The data encoded labels stored (for One-Vs-All) in a NxT matrix.
- An options' structure.
- A job-id number.
The three main fields in the options' structure are:
- opt.name: defines a name for a given experiment.
- opt.seq: specifies the sequence of tasks to be executed.
-
opt.process: specifies what to do with each task. In particular here are the codes:
- 0 = Ignore
- 1 = Compute
- 2 = Compute and save
- 3 = Load from file
- 4 = Explicitly delete
seq
of the options' structure as
{'<CATEGORY1>:<TASK1>';'<CATEGORY2>:<TASK2>';...}
These tasks can be combined in order to build different train-test pipelines. The most popular learning pipelines are outlined in the following.
We want to run the training on a dataset {Xtr,ytr} and the test on a different dataset {Xte,yte}. We are interested in the precision-recall performance measure as well as the average classification accuracy. In order to train a linear classifier using a leave one out cross-validation approach, we just need the following lines of code:
name = 'ExampleExperiment'; opt = defopt(name); opt.seq = {'paramsel:loocvprimal','rls:primal','pred:primal','perf:macroavg'}; opt.process{1} = [2,2,0,0]; opt.process{2} = [3,3,2,2]; gurls (Xtr, ytr, opt,1) gurls (Xte, yte, opt,2)
The meaning of the above code fragment is the following:
- For the training data: calculate the regularization parameter lambda, minimizing classification accuracy via Leave-One-Out cross-validation and save the result, solve RLS for a linear classifier in the primal space and save the solution. Ignore the rest.
- For the test data set, load the used lambda (this is important if you want to save this value for further reference), load the classifier. Predict the output on the test-set and save it. Evaluate the two performance measures 'macroavg' and save it.
opt.name
is implicitly specified by the defopt
function which assigns to it its only input argument.
Fields opt.seq
and opt.process
have to be explicitly assigned.
name = 'ExampleExperiment'; opt = defopt(name); [Xtr] = norm_zscore(Xtr, ytr, opt); [Xte] = norm_testzscore(Xte, yte, opt); opt.seq = {'split:ho','paramsel:hoprimal','rls:primal','pred:primal','perf:macroavg'}; opt.process{1} = [2,2,2,0,0]; opt.process{2} = [3,3,3,2,2]; gurls (Xtr, ytr, opt,1) gurls (Xte, yte, opt,2)
Here the training set is first normalized and the column-wise means and covariances are saved to file. Then the test data are normalized according to the stats computed with the training set.
name = 'ExampleExperiment'; opt = defopt(name); opt.seq = {'kernel:linear', 'paramsel:loocvdual', 'rls:dual', 'pred:dual', 'perf:macroavg'}; opt.process{1} = [2,2,2,0,0]; opt.process{2} = [3,3,3,2,2]; gurls (Xtr, ytr, opt,1) gurls (Xte, yte, opt,2)
name = ’ExampleExperiment’; opt = defopt(name); opt.seq = {’paramsel:hoprimal’,’rls:primal’,’pred:primal’,’perf:rmse’}; opt.process{1} = [2,2,0,0]; opt.process{2} = [3,3,2,2]; opt.hoperf = @perf_rmse; gurls(Xtr, ytr, opt,1) gurls(Xte, yte, opt,2)
Here GURLS is used for regression. Note that the objective function is explicitly set to @perf_rmse, i.e. root mean square error, whereas in the first example opt.hoperf is set to its default @perf_macroavg which evaluates the average classification accuracy per class. The same code can be used for multiple output regression.
name = 'ExampleExperiment'; opt = defopt(name); opt.seq = {'paramsel:siglam', 'kernel:rbf', 'rls:dual', 'predkernel:traintest', 'pred:dual', 'perf:macroavg'}; opt.process{1} = [2,2,2,0,0,0]; opt.process{2} = [3,3,3,2,2,2]; gurls (Xtr, ytr, opt,1) gurls (Xte, yte, opt,2)
Here parameter selection for gaussian kernel requires selection of both the regularization parameter &lambda and the kernel parameter &sigma, and is performed selecting the task siglam
for the
category paramsel
. Once the value for kernel parameter σ has been chosen, the gaussian kernel is built through the kernel task with option rbf.
name = 'ExampleExperiment'; opt = defopt(name); opt.seq = {'split:ho','paramsel:siglamho', 'kernel:rbf', 'rls:dual', 'predkernel:traintest', 'pred:dual', 'perf:macroavg'}; opt.process{1} = [2,2,2,2,0,0,0]; opt.process{2} = [3,3,3,3,2,2,2]; gurls (Xtr, ytr, opt,1) gurls (Xte, yte, opt,2)
name = ’ExampleExperiment’; opt = defopt(name); opt.seq = {’paramsel:calibratesgd’,’rls:pegasos’,’pred:primal’,’perf:macroavg’}; opt.process{1} = [2,2,0,0]; opt.process{2} = [3,3,2,2]; gurls(Xtr, ytr, opt,1) gurls(Xte, yte, opt,2)
Here the optimization is carried out using a stochastic gradient descent algorithm, namely Pegasos, Shalev-Shwartz, Singer and Srebro (2007).
name = ’ExampleExperiment’; opt = defopt(name); opt.seq = {’split:ho’,’paramsel:horandfeats’,’kernel:randfeats’,’rls:randfeats', ’pred:randfeats', 'perf:macroavg'} opt.process{1} = [2,2,2,2,0,0]; opt.process{2} = [3,3,3,3,2,2]; gurls(Xtr, ytr, opt,1) gurls(Xte, yte, opt,2)
Computes a classifier for the primal formulation of RLS using the Random Features approach proposed by Rahimi and Recht (2007). In this approach the primal formulation is used in a new space built through random projections of the input data.
As both GURLS and GURLS++ are similarly designed, we refere to the section GURLS Usage and here we describe what changes in the C++ implementation.
In C++ the counterpart of the gurls function is the GURLS
class, with its only method run
,
whereas function defopt has its equivalent in the class GurlsOptionsList
. In the ’demo’
directory you will find GURLSloocvprimal.cpp, which implements exactly the first example
described in the Section Linear classifier, primal case, leave one out cv.
In order to obtain other pipelines, you simply have to change the code fragment where the pipeline is defined pipeline
*seq << "paramsel:loocvprimal" << "optimizer:rlsprimal" << "pred:primal" << "perf:macroavg" << "perf:precrec";
and where the sequence of instructionsis defined
*process1 << GURLS::computeNsave << GURLS::computeNsave << GURLS::ignore << GURLS::ignore << GURLS::ignore; process->addOpt("one", process1);
*process2 << GURLS::load << GURLS::load << GURLS::computeNsave << GURLS::computeNsave << GURLS::computeNsave; process->addOpt("two", process2);
with the desired task pipeline and instructions sequence. For example, for the case Gaussian kernel classifier,_hold-out cv the code for defining the task pipeline must be
*seq <<"split:ho"<<"paramsel:siglamho"<<"kernel:rbf"<<"optimizer:rlsdual"; *seq <<"pred:dual"<<"predkernel:traintest"<<"perf:macroavg";
the code fragment specifying the sequence of instructions for the training process must be
*process1 << GURLS::computeNsave << GURLS::computeNsave << GURLS::computeNsave << GURLS::computeNsave; *process1 << GURLS::ignore << GURLS::ignore << GURLS::ignore;
and the code fragment specifying the sequence of instructions for the testing process must be
*process2 << GURLS::load << GURLS::load << GURLS::load << GURLS::load; *process2 << GURLS:: computeNsave << GURLS:: computeNsave << GURLS:: computeNsave;
The bGURLS package includes all the design patterns described for GURLS, and has been complemented with additional big data and distributed computation capabilities. Big data support is obtained using a data structure called bigarray, which allows to handle data matrices as large as a machine's available space on hard drive instead of its RAM: we store the entire dataset on disk and load only small chunks in memory when required.
bGURLS relies on a simple interface -- developed ad-hoc and called Gurls Distributed Manager (GDM) -- to distribute matrix-matrix multiplications, thus allowing users to perform the important task of kernel matrix computation on a distributed network of computing nodes. After this step, the subsequent tasks behave as in GURLS.
bGurls Design
The bGURLS Core is identified with the bgurls
command, which behaves as gurls
.
As gurls
it accepts exactly four arguments:
- the bigarray of the input data.
- the bigarray of the labels vector.
- An options' structure.
- A job-id number.
bigdefopt
function with default fields and values.
Most of the main fields in the options' structure are the same as in GURLS,
however bgurls
requires the options' structure to have the additional field files
, which must be a structure with fields:
-
Xva_filename
: the prefix of the files that constitute the bigarray of the input data used for validation -
yva_filename
: the prefix of the files that constitute the bigarray of the labels vector used for validation -
pred_filename
: the prefix of the files that constitute the bigarray of the predicted labels for the test set -
XtX_filename
: the name of the files where pre-computed matrix X'X is stored -
Xty_filename
: the name of the files where pre-computed matrix Xt'y is stored -
XvatXva_filename
: the name of the files where pre-computed matrix Xva'Xva is stored -
Xvatyva_filename
: the name of the files where pre-computed matrix Xva'yva is stored
Let us consider the demo bigdemoA.m
in the demo directory to better understand the usage of bGURLS. The demo computes a linear classifier with the regularization parameter chosen via hold-out validation, and then evaluate the prediction accuracy on a test set.
The data set used in the demo is the bio
data set used in Lauer and Guermeur 2011, which is saved
in the demo directory as a .zip file, 'bio\_unique.zip', containing two files:
- 'X.csv': containing the input nxd data matrix, where n is the number of samples (24,942) and d is the number of variables (68)
- 'Y.csv': containing the input nx1 label vector
In the following we examine the salient part of the demo in details. First unzip the data file
unzip('bio_unique.zip','bio_unique')
and set the name of the data files
filenameX = 'bio_unique/X.csv'; %nxd input data matrix filenameY = 'bio_unique/y.csv'; %nx1 or 1xn labels vector
Now set the size of the blocks for the bigarrays (matrices of size blocksizexd must fit into memory):
blocksize = 1000;
the fraction of total samples to be used for testing:
test_hoproportion = .2;
the fraction of training samples to be used for validation:
va_hoproportion = .2;
and the directory where all processed data is going to be stored:
dpath = 'bio_data_processed';
Now set the prefix of the files that will constitute the bigarrays
mkdir(dpath) files.Xtrain_filename = fullfile(dpath, 'bigarrays/Xtrain'); files.ytrain_filename = fullfile(dpath, 'bigarrays/ytrain'); files.Xtest_filename = fullfile(dpath, 'bigarrays/Xtest'); files.ytest_filename = fullfile(dpath, 'bigarrays/ytes'); files.Xva_filename = fullfile(dpath, 'bigarrays/Xva'); files.yva_filename = fullfile(dpath, 'bigarrays/yva');
and the name of the files where pre-computed matrices will be stored
files.XtX_filename = fullfile(dpath, 'XtX.mat'); files.Xty_filename = fullfile(dpath, 'Xty.mat'); files.XvatXva_filename = fullfile(dpath,'XvatXva.mat'); files.Xvatyva_filename = fullfile(dpath, 'Xvatyva.mat');
We are now ready to prepare the data for bGURLS.
The following line of command reads files filenameX
and filenameY
blockwise -- thus avoiding to load all file at the same time-- and stores them in the bigarray format, after having split the data into train, validation and test set
bigTrainTestPrepare(filenameX, filenameY,files,blocksize,va_hoproportion,test_hoproportion)
Bigarrays are now stored in the file names specified in the structure files
.
We can now precompute matrices that will be recursively used in the training phase,
and store them in the file names specified in the structure files
bigMatricesBuild(files)
The data set is now prepared for running the learning pipeline with the bgurls
command. This phase behaves almost completely as in GURLS. The only differences are that:
- we need not to load the data into memory, but simply 'load' the bigarray, that is load the information necessary to access the data blockwise.
- we have to specify in the options' structure the path where the already computed matrix multiplications, and bigarrays for validation data are stored.
name = fullfile(wpath,'gurls'); opt = bigdefopt(name); opt.seq = {'paramsel:dhoprimal','rls:dprimal','pred:primal','perf:macroavg'}; opt.process{1} = [2,2,0,0]; opt.process{2} = [3,3,2,2];
Note that no task is defined for the split
category,
as data has already been split in the preprocessing phase and bigarrays for validation were built.
In the following fragment of code we add to the options' structure the information relative to the already computed matrix multiplications and to the validation bigarrays
opt.files = files; opt.files = rmfield(opt.files,{'Xtrain_filename';'ytrain_filename';'Xtest_filename';'ytest_filename'}); %not used by bgurls opt.files.pred_filename = fullfile(dpath, 'bigarrays/pred');
Note that we have also defined where the predicted labels shall be stored as bigarray.
Now we have to 'load' bigarrays for training
X = bigarray.Obj(files.Xtrain_filename); y = bigarray.Obj(files.ytrain_filename); X.Transpose(true); y.Transpose(true);
and run bgurls
on the training set
bgurls(X,y,opt,1)
In order to run the testing process, we first have to 'load' bigarrays variables for test data
X = bigarray.Obj(files.Xtest_filename); y = bigarray.Obj(files.ytest_filename); X.Transpose(true); y.Transpose(true);
and then we can finally run bgurls
on the test set
bgurls(X,y,opt,2);
Now you should have a mat file named 'gurls.mat' in your path. This file contains all the information about your experiment. If you want to see the mean accuracy, for example, load the file in your workspace and type
>> mean(opt.perf.acc)
If you are interested in visualizing or printing stats and facts about your experiment, check the documentation about the summarizing functions in the gurls package.
Other two demos can be found in the 'demo' directory. The three demos differ in the format of the input data, as we tried to provide examples for the most common data formats.
The data set used in bigdemoB
is again the bio data set, though in a slightly different format as it is already split into train and test data. The bigTrainPrepare
and bigTestPrepare
take care of preparing the train and test set separately.
The data set used in bigdemoC
is the ImageNet data set, which is automatically downloaded from http://bratwurst.mit.edu/sbow.tar, when running the demo. This data set is stored in 1000 .mat files where the i-th file contains the variable x
which is a dxn_i input data matrix for the n_i samples of class i. The bigTrainTestPrepare_manyfiles
takes care of preparing the bigarrays for the ImageNet data format.
Note that, while the bio data is not properly a big data set, the ImageNet occupies about 1G of RAM and can thus be called a big data set.
In order to run bGURLS on other data formats, one can simply use bigdemoA
after having substituted the line
bigTrainTestPrepare(filenameX, filenameY,files,blocksize,va_hoproportion,test_hoproportion)
with a suitable fragment of code. The remainder of the data preparation, that is the computation and storage of the relevant matrices, can be left unchanged.
You can visualize the results of one or more experiments (i.e. GURLS pipelines) using the summary_*
functions.
Below we show the usage of these set of functions for two sets of experiments each one run 5 times. First we have to run the experiments. nRuns
contains the number of runs for each experiment, and filestr
contains the names of the experiments.
nRuns = {5,5}; filestr = {’hoprimal’; ’hodual’}; for i = 1:nRuns{1}; opt = defopt(filestr{1} ’_’ num2str(i)]; opt.seq = {’paramsel:loocvprimal’,’rls:primal’,’pred:primal’,’perf:macroavg’,’perf:precrec’}; opt.process{1} = [2,2,0,0,0]; opt.process{2} = [3,3,2,2,2]; gurls(Xtr, ytr, opt,1) gurls(Xte, yte, opt,2) end
for i = 1:nRuns{2}; opt = defopt(filestr{2} ’_’ num2str(i)]; opt.seq = {’kernel:linear’, ’paramsel:loocvdual’,’rls:dual’, ’pred:dual’, ’perf:macroavg’, ’perf:precrec’}; opt.process{1} = [2,2,2,0,0,0]; opt.process{2} = [3,3,3,2,2,2]; gurls(Xtr, ytr, opt,1) gurls(Xte, yte, opt,2) end
In order to visualize the results we have to specify in fields which fields of opt are to be displayed (as many plots as the elements of fields will be generated)
>> fields = {’perf.ap’,’perf.acc’};
we can generate "per-class" plots with the following command:
>> summary_plot(filestr,fields,nRuns)
and “global” plots with:
>> summary_overall_plot(filestr,fields,nRuns)
this generates “global” table:
>> summary_table(filestr, fields, nRuns)
This plots times taken by each step of the pipeline for performance reference:
>> plot_times(filestr,nRuns)