-
Notifications
You must be signed in to change notification settings - Fork 14
WingBaldridge_EMNLP2014
This page explains the process of replicating the results of:
Benjamin Wing and Jason Baldridge. Hierarchical Discriminative Classification for Text-Based Geolocation. EMNLP 2014. Doha, Qatar.
The first step is to get the code. Check out or download the code from
https://github.com/utcompling/textgrounder/commits/emnlp-2014-release-candidate-same-results
You'll need to set the $TEXTGROUNDER_DIR
variable to the root of the TextGrounder source code, and add $TEXTGROUNDER_DIR/bin
to your $PATH
variable. You will also need to make sure that $JAVA_HOME
is appropriately set to the location of your Java installation (e.g. /Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home
on Mac OS X Mavericks).
The experiments in the paper used Vowpal Wabbit to do logistic regression. You will need to download and compile the program from the following location:
https://github.com/JohnLangford/vowpal_wabbit
Doing this will also require the C++ Boost library. It may be easiest if you can install this using a package manager. For example, on Mac OS X, using MacPorts, run
sudo port install boost
This will install Boost in /opt/local/lib
and /opt/local/include
. (The Vowpal Wabbit Makefile knows about these locations.)
Finally, make sure that the vw
executable is in your PATH
.
Next you'll need the data. The Wikipedia and Twitter data sets are available from
http://web.corral.tacc.utexas.edu/utcompling/wing-baldridge-2014/
For instructions on obtaining and processing the Cophir data set, please contact the first author ([email protected]).
Run textgrounder build compile
. This should download a number of packages and then compile the code.
To run the program using Vowpal Wabbit as the classifier and doing hierarchical classification, using a uniform grid, do the following:
$ tg-geolocate --memory MEMORY $PATH_TO_CORPUS/$CORPUS_NAME --ranker hier-classifier --classifier vowpal-wabbit --classify-features gram-binary --num-levels 3 --dpc $DEGREES_PER_CELL --subdivide-factor $SUBDIVIDE_FACTOR --beam-size $BEAM_SIZE --vw-args "--bfgs -b 26 --passes 40 --holdout_off" --nested-vw-args "--bfgs -b 24 --passes 12 --holdout_off" --debug parallel-hier-classifier,parallel-evaluation --eval-set (dev|test)
where dev|test
indicates either dev
or test
, for the development or test sets, respectively.
Note that this uses the default Vowpal Wabbit parameters of 26-bit features, 40 passes for the top level, and 24-bit features, 12 passes for the lower levels. Some of the experiments used different parameters, as described in the paper. DEGREES_PER_CELL
, SUBDIVIDE_FACTOR
and BEAM_SIZE
are as described in the paper. The amount of memory required varies somewhat depending on the particular experiment and is specified as e.g. 32g
for 32 gigabytes, which is the default that we used for most experiments.
Note that the --debug
flags in this case aren't actually related to debugging, but turn on parallel training (parallel-hier-classifier
) and evaluation (parallel-evaluation
). This is useful on multi-core machines but will increase the total memory usage, as multiple instances of Vowpal Wabbit will run at once. (The memory used here is separate from the memory specified using --memory
, which only controls the Java virtual machine.)
The above experiment may take a long time, and will save all the learned models in /tmp
or equivalent. You can use --load-vw-model
, --save-vw-model
, --load-vw-submodels
, --save-vw-submodels
, --load-vw-submodels-levels
and --save-vw-submodels-levels
to load and/or save the top-level or lower-level models.
For example, to load a previously-saved top-level model and save out lower-level models, run as follows:
$ tg-geolocate ... --load-vw-model $CORPUS.bfgs.b26.passes40.dpc$DEGREES_PER_CELL.model \
--save-vw-submodels $CORPUS.hier.dpc$DEGREES_PER_CELL.subdiv$SUBDIVIDE_FACTOR.bfgs.b24.passes12.l%l.i%i.submodel
To load the level 1 and level 2 models, while saving level 3 models, run as follows:
$ tg-geolocate ... --load-vw-model $CORPUS.bfgs.b26.passes40.dpc$DEGREES_PER_CELL.model --load-vw-submodels $CORPUS.hier.dpc$DEGREES_PER_CELL.subdiv$SUBDIVIDE_FACTOR.bfgs.b24.passes12.l%l.i%i.submodel --load-vw-submodels-levels 2 --save-vw-submodels $CORPUS.hier.dpc$DEGREES_PER_CELL.subdiv$SUBDIVIDE_FACTOR.bfgs.b24.passes12.l%l.i%i.submodel --save-vw-submodels-levels 3
To run the program using Vowpal Wabbit as the classifier and doing hierarchical classification, using a K-d grid, do the following:
$ tg-geolocate --memory MEMORY $PATH_TO_CORPUS/$CORPUS_NAME --ranker hier-classifier --classifier vowpal-wabbit --classify-features gram-binary --num-levels 3 --kd-tree --kd-bucket-size $BUCKET_SIZE --subdivide-factor $SUBDIVIDE_FACTOR --beam-size $BEAM_SIZE --vw-args "--bfgs -b 26 --passes 40 --holdout_off" --nested-vw-args "--bfgs -b 24 --passes 12 --holdout_off" --debug parallel-hier-classifier,parallel-evaluation --eval-set (dev|test)
This changes only the --dpc
parameter, which is replaced with --kd-tree
and --kd-bucket-size
.
To run the program using Vowpal Wabbit as the classifier but doing flat logistic regression instead of hierarchical classification, using a uniform grid, saving out the model (so that e.g. it can be loaded as level 1 of a hierarchical classifier), do the following:
$ tg-geolocate --memory MEMORY $PATH_TO_CORPUS/$CORPUS_NAME --ranker classifier --classifier vowpal-wabbit --classify-features gram-binary --dpc $DEGREES_PER_CELL --vw-args "--bfgs -b 26 --passes 40 --holdout_off" --debug parallel-evaluation --eval-set (dev|test) --save-vw-model $CORPUS.bfgs.b26.passes40.dpc$DEGREES_PER_CELL.model
To run the program using Naive Bayes as the classifier, using a uniform grid, do the following:
$ tg-geolocate --memory MEMORY $PATH_TO_CORPUS/$CORPUS_NAME --ranker naive-bayes --dpc $DEGREES_PER_CELL --eval-set (dev|test)
The main thing in this case to note is that --debug parallel-evaluation
should probably not be specified; doing so seems to greatly increase the memory requirements, and there is already parallelism built in even without this.
Please contact Ben Wing [email protected] for any questions pertaining to replicating results. This program can take some effort to get up and running, and so please feel free to ask for help.