-
Notifications
You must be signed in to change notification settings - Fork 4
/
README
executable file
·97 lines (58 loc) · 3.5 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
Table of Contents
=================
- Introduction
- Installation
- Data Format
- Usage
- Examples
- Additional Information
Introduction
============
XIA-NB is a C++ implementation of Naive Bayes Classifier, which is a well-known generative classification algorithm for applications such as text classification. The Naive Bayes algorithm requires the probabilistic distribution to be discrete. XIA-NB uses the multinomial event model for representation, the maximum likelihood estimate with a Laplace smoothing technique for learning parameters. A sparse-data structure is defined to represent the feature vector in XIA-NB to seek higher computational speed.
Installation
============
On Linux system, type `make' to build the `nb_learn' and `nb_classify' programs. Run them without arguments to show the usages of them.
On Windows system, refer to `Makefile' to build them, or use the pre-built binaries (in the directory `windows').
Data Format
===========
The format of training and testing data file is:
<label> <index1>:<value1> <index2>:<value2> ...
.
.
.
Each line contains an instance and is ended by a '\n' character.
<label> is an integer indicating the class id. The range of class id should be from 1 to the size of classes. For example, the class id is 1, 2, 3 and 4 for a 4-class classification problem.
<label> and <index>:<value> are sperated by a '\t' character. <index> is a postive integer denoting the feature id. The range of feature id should be from 1 to the size of feature set. For example, the feature id is 1, 2, ... 9 or 10 if the dimension of feature set is 10. Indices must be in ASCENDING order. <value> is a float denoting the feature value. The value must be an INTEGER since Naive Bayes Algorithm requires the probabilistic distribution to be discrete.
If the feature value equals 0, the <index>:<value> is encouraged to be neglected for the consideration of storage space and computational speed.
Labels in the testing file are only used to calculate accuracy or errors. If they are unknown, just fill the first column with any class labels.
Usuage
======
XIA-NB learning module
usage: nb_learn [options] training_file model_file
options: -h -> help
-e [0,1] -> 0: multi-variate Bernoulli event model
-> 1: multinomial event model (default)
-s [0] -> Laplace smoothing (default)
XIA-NB classification module
usage: nb_classify [options] testing_file model_file output_file
options: -h -> help
-e [0,1] -> 0: multi-variate Bernoulli event model
-> 1: multinomial event model (default)
-f [0..2] -> 0: only output class label (default)
-> 1: output class label with log-likelihood
-> 2: output class label with probability
Examples
========
The "data" directory contains a dataset of text classification task. This dataset
has six class labels and more than 250,000 features.
For learning with the default multinomial event model:
> nb_learn data/train.samp data/nb.mod
For learning with the multi-variate Bernoulli event model:
> nb_learn -e 0 data/train.samp data/nb0.mod
For classifing with the default multinomial event model and the default output format:
> nb_classify data/test.samp data/nb.mod data/nb.out
For classifing with the multi-variate Bernoulli event model and the loglikelihood output:
> nb_classify -e 0 -f 1 data/test.samp data/nb0.mod data/nb0.out
Additional Information
======================
For any questions and comments, please email [email protected].