Skip to content
Matteo Santoro edited this page Jun 21, 2013 · 20 revisions

Table of Contents

User Manual

A comprehensive user manual is here. This page contains a few examples for basic learning pipelines and is a great place to start.

Design

GURLS (GURLS++) basically consists of a set of tasks, each one belonging to a predefined category, and of a method (a class in the C++ implementation) called GURLS Core that is responsible for processing an ordered sequence of tasks called pipeline. An additional ”options structure”, often referred to as OPT, is used to store all configuration parameters needed to customize the tasks behaviour. Tasks receive configuration parameters from the options structure in read-only mode and, after terminating, their results are appended to the structure by the GURLS Core in order to make them available to the subsequent tasks. This allows the user to easily skip the execution of some tasks in a pipeline, by simply inserting the desired results directly into the options structure. All tasks belonging to the same category can be interchanged with each other, so that the user can easily choose how each task shall be carried out.

Gurls DesignGurls Design

GURLS Usage

The gurls command accepts exaclty four arguments:

  • The data point, stored in a NxD matrix.
  • The data encoded labels stored (for One-Vs-All) in a NxT matrix.
  • An options' structure.
  • A job-id number.
Each time the data need to be changed (e.g. going from training phase to testing phase) gurls needs to be called again.

The three main fields in the options' structure are:

  • opt.name: defines a name for a given experiment.
  • opt.seq: specifies the sequence of tasks to be executed.
  • opt.process: specifies what to do with each task. In particular here are the codes:
    • 0 = Ignore
    • 1 = Compute
    • 2 = Compute and save
    • 3 = Load from file
    • 4 = Explicitly delete
The gurls command executes an ordered sequence of tasks, the 'pipeline', specified in the field seq of the options' structure as
  {'<CATEGORY1>:<TASK1>';'<CATEGORY2>:<TASK2>';...}

These tasks can be combined in order to build different train-test pipelines. The most popular learning pipelines are outlined in the following.

Examples

Linear classifier, primal case, leave one out cv

We want to run the training on a dataset {Xtr,ytr} and the test on a different dataset {Xte,yte}. We are interested in the precision-recall performance measure as well as the average classification accuracy. In order to train a linear classifier using a leave one out cross-validation approach, we just need the following lines of code:

 name = 'ExampleExperiment';
 opt = defopt(name);
 opt.seq = {'paramsel:loocvprimal','rls:primal','pred:primal','perf:macroavg'};
 opt.process{1} = [2,2,0,0];
 opt.process{2} = [3,3,2,2];
 gurls (Xtr, ytr, opt,1)
 gurls (Xte, yte, opt,2)

The meaning of the above code fragment is the following:

  • For the training data: calculate the regularization parameter lambda, minimizing classification accuracy via Leave-One-Out cross-validation and save the result, solve RLS for a linear classifier in the primal space and save the solution. Ignore the rest.
  • For the test data set, load the used lambda (this is important if you want to save this value for further reference), load the classifier. Predict the output on the test-set and save it. Evaluate the two performance measures 'macroavg' and save it.
Note that the field opt.name is implicitly specified by the defopt function which assigns to it its only input argument. Fields opt.seq and opt.process have to be explicitly assigned.

Normalized data, linear classifier, primal case, hold-out cv

 name = 'ExampleExperiment';
 opt = defopt(name);
 [Xtr] = norm_zscore(Xtr, ytr, opt); 
 [Xte] = norm_testzscore(Xte, yte, opt); 
 opt.seq = {'split:ho','paramsel:hoprimal','rls:primal','pred:primal','perf:macroavg'};
 opt.process{1} = [2,2,2,0,0];
 opt.process{2} = [3,3,3,2,2];
 gurls (Xtr, ytr, opt,1)
 gurls (Xte, yte, opt,2)

Here the training set is first normalized and the column-wise means and covariances are saved to file. Then the test data are normalized according to the stats computed with the training set.

Linear classifier, dual case, leave one out cv

 name = 'ExampleExperiment';
 opt = defopt(name);
 opt.seq = {'kernel:linear', 'paramsel:loocvdual', 'rls:dual', 'pred:dual', 'perf:macroavg'};
 opt.process{1} = [2,2,2,0,0];
 opt.process{2} = [3,3,3,2,2];
 gurls (Xtr, ytr, opt,1)
 gurls (Xte, yte, opt,2)

Linear regression, primal case, hold-out cv

 name = ’ExampleExperiment’; 
 opt = defopt(name); 
 opt.seq = {’paramsel:hoprimal’,’rls:primal’,’pred:primal’,’perf:rmse’}; 
 opt.process{1} = [2,2,0,0]; 
 opt.process{2} = [3,3,2,2]; 
 opt.hoperf = @perf_rmse; 
 gurls(Xtr, ytr, opt,1) 
 gurls(Xte, yte, opt,2) 

Here GURLS is used for regression. Note that the objective function is explicitly set to @perf_rmse, i.e. root mean square error, whereas in the first example opt.hoperf is set to its default @perf_macroavg which evaluates the average classification accuracy per class. The same code can be used for multiple output regression.

Gaussian Kernel classifier, leave one out cv

 name = 'ExampleExperiment';
 opt = defopt(name);
 opt.seq = {'paramsel:siglam', 'kernel:rbf', 'rls:dual', 'predkernel:traintest', 'pred:dual', 'perf:macroavg'};
 opt.process{1} = [2,2,2,0,0,0];
 opt.process{2} = [3,3,3,2,2,2];
 gurls (Xtr, ytr, opt,1)
 gurls (Xte, yte, opt,2)

Here parameter selection for gaussian kernel requires selection of both the regularization parameter &lambda and the kernel parameter &sigma, and is performed selecting the task siglam for the category paramsel. Once the value for kernel parameter σ has been chosen, the gaussian kernel is built through the kernel task with option rbf.

Gaussian kernel classifier, hold-out cv

 name = 'ExampleExperiment';
 opt = defopt(name);
 opt.seq = {'split:ho','paramsel:siglamho', 'kernel:rbf', 'rls:dual', 'predkernel:traintest', 'pred:dual', 'perf:macroavg'};
 opt.process{1} = [2,2,2,2,0,0,0];
 opt.process{2} = [3,3,3,3,2,2,2];
 gurls (Xtr, ytr, opt,1)
 gurls (Xte, yte, opt,2)

Linear classifier via stocastic gradient descent

 name = ’ExampleExperiment’; 
 opt = defopt(name); 
 opt.seq = {’paramsel:calibratesgd’,’rls:pegasos’,’pred:primal’,’perf:macroavg’}; 
 opt.process{1} = [2,2,0,0]; 
 opt.process{2} = [3,3,2,2]; 
 gurls(Xtr, ytr, opt,1) 
 gurls(Xte, yte, opt,2) 

Here the optimization is carried out using a stochastic gradient descent algorithm, namely Pegasos, Shalev-Shwartz, Singer and Srebro (2007).

Random features RLS classifier, hold-out cv

 name = ’ExampleExperiment’; 
 opt = defopt(name); 
 opt.seq = {’split:ho’,’paramsel:horandfeats’,’kernel:randfeats’,’rls:randfeats', ’pred:randfeats', 'perf:macroavg'} 
 opt.process{1} = [2,2,2,2,0,0]; 
 opt.process{2} = [3,3,3,3,2,2]; 
 gurls(Xtr, ytr, opt,1) 
 gurls(Xte, yte, opt,2) 

Computes a classifier for the primal formulation of RLS using the Random Features approach proposed by Rahimi and Recht (2007). In this approach the primal formulation is used in a new space built through random projections of the input data.

GURLS++ Usage

As both GURLS and GURLS++ are similarly designed, we refere to the section GURLS Usage and here we describe what changes in the C++ implementation. In C++ the counterpart of the gurls function is the GURLS class, with its only method run, whereas function defopt has its equivalent in the class GurlsOptionsList. In the ’demo’ directory you will find GURLSloocvprimal.cpp, which implements exactly the first example described in the Section Linear classifier, primal case, leave one out cv.

In order to obtain other pipelines, you simply have to change the code fragment where the pipeline is defined pipeline

 *seq << "paramsel:loocvprimal" << "optimizer:rlsprimal" << "pred:primal" << "perf:macroavg" << "perf:precrec";

and where the sequence of instructionsis defined

 *process1 << GURLS::computeNsave << GURLS::computeNsave << GURLS::ignore << GURLS::ignore << GURLS::ignore;
 process->addOpt("one", process1);
 *process2 << GURLS::load << GURLS::load << GURLS::computeNsave << GURLS::computeNsave << GURLS::computeNsave;
 process->addOpt("two", process2);

with the desired task pipeline and instructions sequence. For example, for the case Gaussian kernel classifier,_hold-out cv the code for defining the task pipeline must be

 *seq <<"split:ho"<<"paramsel:siglamho"<<"kernel:rbf"<<"optimizer:rlsdual";
 *seq <<"pred:dual"<<"predkernel:traintest"<<"perf:macroavg";

the code fragment specifying the sequence of instructions for the training process must be

 *process1 << GURLS::computeNsave << GURLS::computeNsave << GURLS::computeNsave << GURLS::computeNsave;
 *process1 << GURLS::ignore << GURLS::ignore << GURLS::ignore;

and the code fragment specifying the sequence of instructions for the testing process must be

 *process2 << GURLS::load << GURLS::load << GURLS::load << GURLS::load;
 *process2 << GURLS:: computeNsave << GURLS:: computeNsave << GURLS:: computeNsave;

bGURLS Usage

The bGURLS package includes all the design patterns described for GURLS, and has been complemented with additional big data and distributed computation capabilities. Big data support is obtained using a data structure called bigarray, which allows to handle data matrices as large as a machine's available space on hard drive instead of its RAM: we store the entire dataset on disk and load only small chunks in memory when required.

bGURLS relies on a simple interface -- developed ad-hoc and called Gurls Distributed Manager (GDM) -- to distribute matrix-matrix multiplications, thus allowing users to perform the important task of kernel matrix computation on a distributed network of computing nodes. After this step, the subsequent tasks behave as in GURLS.

bGurls DesignbGurls Design

The bGURLS Core is identified with the bgurls command, which behaves as gurls. As gurls it accepts exactly four arguments:

  • the bigarray of the input data.
  • the bigarray of the labels vector.
  • An options' structure.
  • A job-id number.
The options' structure is built through the bigdefopt function with default fields and values. Most of the main fields in the options' structure are the same as in GURLS, however bgurls requires the options' structure to have the additional field files, which must be a structure with fields:
  • Xva_filename: the prefix of the files that constitute the bigarray of the input data used for validation
  • yva_filename: the prefix of the files that constitute the bigarray of the labels vector used for validation
  • pred_filename: the prefix of the files that constitute the bigarray of the predicted labels for the test set
  • XtX_filename: the name of the files where pre-computed matrix X'X is stored
  • Xty_filename: the name of the files where pre-computed matrix Xt'y is stored
  • XvatXva_filename: the name of the files where pre-computed matrix Xva'Xva is stored
  • Xvatyva_filename: the name of the files where pre-computed matrix Xva'yva is stored

bGURLS example

Let us consider the demo bigdemoA.m in the demo directory to better understand the usage of bGURLS. The demo computes a linear classifier with the regularization parameter chosen via hold-out validation, and then evaluate the prediction accuracy on a test set. The data set used in the demo is the bio data set used in Lauer and Guermeur 2011, which is saved in the demo directory as a .zip file, 'bio\_unique.zip', containing two files:

  • 'X.csv': containing the input nxd data matrix, where n is the number of samples (24,942) and d is the number of variables (68)
  • 'Y.csv': containing the input nx1 label vector
Note that the bio data is not properly a big data set, as it could reside in memory, however it is large enough to make it reasonable to use bGURLS.

In the following we examine the salient part of the demo in details. First unzip the data file

 unzip('bio_unique.zip','bio_unique')

and set the name of the data files

 filenameX = 'bio_unique/X.csv'; %nxd input data matrix
 filenameY = 'bio_unique/y.csv'; %nx1 or 1xn labels vector

Now set the size of the blocks for the bigarrays (matrices of size blocksizexd must fit into memory):

 blocksize = 1000; 

the fraction of total samples to be used for testing:

 test_hoproportion = .2;

the fraction of training samples to be used for validation:

 va_hoproportion = .2;  

and the directory where all processed data is going to be stored:

 dpath = 'bio_data_processed'; 

Now set the prefix of the files that will constitute the bigarrays

 mkdir(dpath)
 files.Xtrain_filename = fullfile(dpath, 'bigarrays/Xtrain');
 files.ytrain_filename = fullfile(dpath, 'bigarrays/ytrain');
 files.Xtest_filename = fullfile(dpath, 'bigarrays/Xtest');
 files.ytest_filename = fullfile(dpath, 'bigarrays/ytes');
 files.Xva_filename = fullfile(dpath, 'bigarrays/Xva');
 files.yva_filename = fullfile(dpath, 'bigarrays/yva');

and the name of the files where pre-computed matrices will be stored

 files.XtX_filename = fullfile(dpath, 'XtX.mat');
 files.Xty_filename = fullfile(dpath, 'Xty.mat');
 files.XvatXva_filename = fullfile(dpath,'XvatXva.mat');
 files.Xvatyva_filename = fullfile(dpath, 'Xvatyva.mat');

We are now ready to prepare the data for bGURLS. The following line of command reads files filenameX and filenameY blockwise -- thus avoiding to load all file at the same time-- and stores them in the bigarray format, after having split the data into train, validation and test set

 bigTrainTestPrepare(filenameX, filenameY,files,blocksize,va_hoproportion,test_hoproportion)

Bigarrays are now stored in the file names specified in the structure files. We can now precompute matrices that will be recursively used in the training phase, and store them in the file names specified in the structure files

 bigMatricesBuild(files)

The data set is now prepared for running the learning pipeline with the bgurls command. This phase behaves almost completely as in GURLS. The only differences are that:

  • we need not to load the data into memory, but simply 'load' the bigarray, that is load the information necessary to access the data blockwise.
  • we have to specify in the options' structure the path where the already computed matrix multiplications, and bigarrays for validation data are stored.
Let us first define the option structure as in GURLS
 name = fullfile(wpath,'gurls');
 opt = bigdefopt(name);
 opt.seq = {'paramsel:dhoprimal','rls:dprimal','pred:primal','perf:macroavg'};
 opt.process{1} = [2,2,0,0];
 opt.process{2} = [3,3,2,2];

Note that no task is defined for the split category, as data has already been split in the preprocessing phase and bigarrays for validation were built. In the following fragment of code we add to the options' structure the information relative to the already computed matrix multiplications and to the validation bigarrays

 opt.files = files;
 opt.files = rmfield(opt.files,{'Xtrain_filename';'ytrain_filename';'Xtest_filename';'ytest_filename'}); %not used by bgurls
 opt.files.pred_filename = fullfile(dpath, 'bigarrays/pred');

Note that we have also defined where the predicted labels shall be stored as bigarray.

Now we have to 'load' bigarrays for training

 X = bigarray.Obj(files.Xtrain_filename);
 y = bigarray.Obj(files.ytrain_filename);	
 X.Transpose(true);
 y.Transpose(true);

and run bgurls on the training set

 bgurls(X,y,opt,1)

In order to run the testing process, we first have to 'load' bigarrays variables for test data

 X = bigarray.Obj(files.Xtest_filename);
 y = bigarray.Obj(files.ytest_filename);	
 X.Transpose(true);
 y.Transpose(true);

and then we can finally run bgurls on the test set

 bgurls(X,y,opt,2);

Now you should have a mat file named 'gurls.mat' in your path. This file contains all the information about your experiment. If you want to see the mean accuracy, for example, load the file in your workspace and type

 >> mean(opt.perf.acc)

If you are interested in visualizing or printing stats and facts about your experiment, check the documentation about the summarizing functions in the gurls package.

Dealing with other data formats

Other two demos can be found in the 'demo' directory. The three demos differ in the format of the input data, as we tried to provide examples for the most common data formats. The data set used in bigdemoB is again the bio data set, though in a slightly different format as it is already split into train and test data. The bigTrainPrepare and bigTestPrepare take care of preparing the train and test set separately.

The data set used in bigdemoC is the ImageNet data set, which is automatically downloaded from http://bratwurst.mit.edu/sbow.tar, when running the demo. This data set is stored in 1000 .mat files where the i-th file contains the variable x which is a dxn_i input data matrix for the n_i samples of class i. The bigTrainTestPrepare_manyfiles takes care of preparing the bigarrays for the ImageNet data format. Note that, while the bio data is not properly a big data set, the ImageNet occupies about 1G of RAM and can thus be called a big data set.

In order to run bGURLS on other data formats, one can simply use bigdemoA after having substituted the line

 bigTrainTestPrepare(filenameX, filenameY,files,blocksize,va_hoproportion,test_hoproportion)

with a suitable fragment of code. The remainder of the data preparation, that is the computation and storage of the relevant matrices, can be left unchanged.

Results visualization

You can visualize the results of one or more experiments (i.e. GURLS pipelines) using the summary_* functions. Below we show the usage of these set of functions for two sets of experiments each one run 5 times. First we have to run the experiments. nRuns contains the number of runs for each experiment, and filestr contains the names of the experiments.

 nRuns = {5,5}; 
 filestr = {’hoprimal’; ’hodual’}; 
 for i = 1:nRuns{1}; 
   opt = defopt(filestr{1} ’_’ num2str(i)]; 
   opt.seq = {’paramsel:loocvprimal’,’rls:primal’,’pred:primal’,’perf:macroavg’,’perf:precrec’}; 
   opt.process{1} = [2,2,0,0,0]; 
   opt.process{2} = [3,3,2,2,2]; 
   gurls(Xtr, ytr, opt,1) 
   gurls(Xte, yte, opt,2) 
 end 
 for i = 1:nRuns{2}; 
   opt = defopt(filestr{2} ’_’ num2str(i)]; 
   opt.seq = {’kernel:linear’, ’paramsel:loocvdual’,’rls:dual’, ’pred:dual’, ’perf:macroavg’, ’perf:precrec’}; 
   opt.process{1} = [2,2,2,0,0,0]; 
   opt.process{2} = [3,3,3,2,2,2]; 
   gurls(Xtr, ytr, opt,1) 
   gurls(Xte, yte, opt,2) 
 end 

In order to visualize the results we have to specify in fields which fields of opt are to be displayed (as many plots as the elements of fields will be generated)

 >> fields = {’perf.ap’,’perf.acc’}; 

we can generate "per-class" plots with the following command:

 >> summary_plot(filestr,fields,nRuns) 

and “global” plots with:

 >> summary_overall_plot(filestr,fields,nRuns) 

this generates “global” table:

 >> summary_table(filestr, fields, nRuns) 

This plots times taken by each step of the pipeline for performance reference:

 >> plot_times(filestr,nRuns)
Clone this wiki locally