Siyuan's Fork of Waleking's fork of Autophrase
- Sync with the original repository to fetch some fixes made by shangjingbo1226.
Please cite the following two papers if you are using this tool. Thanks!
-
Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R Voss, Jiawei Han, "Automated Phrase Mining from Massive Text Corpora", accepted by IEEE Transactions on Knowledge and Data Engineering, Feb. 2018.
-
Jialu Liu*, Jingbo Shang*, Chi Wang, Xiang Ren and Jiawei Han, "Mining Quality Phrases from Massive Text Corpora”, Proc. of 2015 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'15), Melbourne, Australia, May 2015. (* equally contributed, slides)
The originial version is shangjingbo1226/AutoPhrase.
This fork version is mainly desinged for SparseTP, a topic modeling tool for phrases, which is going to be published in the 29th IEEE International Conference on Tools with Artifical Intelligence (ICTAI'17).
- Efficient Topic Modeling on Phrases via Sparsity, Weijing Huang, Wei Chen, Tengjiao Wang and Shibo Tao, Proceedings of the 29th IEEE International Conference on Tools with Artifical Intelligence (ICTAI'17), Boston, USA, Nov 2017. (slides)
The modification of this fork is mainly in three folds.
-
Provide an portal runAutoPhrase.sh to process the raw input file to get the final result file
input_forTopicModel.txt
, which is used as the input of SparseTP. -
We add filter.py to remove the low quality phrases (e.g., score<0.5), and get high quality phrases file
results/filtered_phrases.txt
. And with the high quality phrases, we update src/segment.cpp to segment the raw input file. Finaly, we add prepare_for_topicmodeling.py to get the result fileinput_forTopicModel.txt
, with the foramtword_1,word_2,word_3,...,word_n,phrases_1,...,phrases_m\n
in each line representing a single document in a corpus. -
We add several running examples to provide a "one click" quick way to know how to use this tool. The running example 1 is designed to process the dataset 20newsgroups; The running example 2 is designed to process the Wikipedia articles under the Mathematics category (the json data are available at Dropbox); The running example 3 and 4 are designed for Chemistry (availabe json data) and Argentina (available json data).
bash runAutoPhrase.sh $input_file
$input_file
is the path of the input file, which includes the whole corpus with each line representing a single file in a corpus.
The result file will be restored in results/input_forTopicModel.txt
.
1, bash runningExample1.sh
After running on the 20newsgroups dataset, the result file can be found as results/input_forTopicModel.txt.
Or for a quick view without running, the result can be downloaded from Dropbox.
Take one line in the result file as an example, it represents the document after extracting phrases: alt,introduction,april,version,introduction,atheism,mathew,...,read,article,mathew,version,pgp signed message,frequently asked questions,faq files,strong atheism,weak atheism,strong atheism,god exists,point of view,weak atheism,...,god exists,peer pressure,pgp signature,pgp signature
2, bash runningExample2.sh
After running on the Mathematics Wiki dataset, the result file can be found as results/input_forTopicModel.txt.
Or for a quick view without running, the result can be downloaded from Dropbox.
Take one line in the result file as an example, it represents the document after extracting phrases: kohli,scientist,lab,cambridge,majority,research,field,machine,learning,vision,contributions,game,theory,psychometrics,picture,josh,semantic,paint,kinect,fusion,voxel,crf,inference,microsoft research,discrete algorithms,programming language,higher order,graphical models
3, bash runningExample3.sh
After running on the Chemistry Wiki dataset, the result file can be found as results/input_forTopicModel.txt.
Or for a quick view without running, the result can be downloaded from Dropbox.
4, bash runningExample4.sh
After running on the Argentina Wiki dataset, the result file can be found as results/input_forTopicModel.txt.
Or for a quick view without running, the result can be downloaded from Dropbox.
We test runAutoPhrase.sh on a signle 4-Core 3.4GHz CPU, 24GB RAM machine. To see what will happen for processing a very big input file, we take whole Wikipedia pages as an input. There are 5,738,260 articles, 2,036,099,636 tokens, 10.67GB. In order to fit it in our limit memory, we split this big file into 5 smaller ones, each one with about 2.1GB size. In this way, we run AutoPhrase sequencely on these 5 splitted files, in which each 2.1GB file costs 24GB memory. After 12.5 hours, we got the processed result for Wikipedia pages.
In short, we summarize the performance as the following table.
setting | input file size | memory cost | time cost |
---|---|---|---|
Directly | 2.1GB | 24 GB | 2.5 hours |
Running on 5 splited files sequencely | 10.67GB | 24 GB | 12.5 hours |
- Fix a few bugs during the pre-processing and post-processing, i.e.,
Tokeninzer.java
. Previously, when the corpus contains characters like/
, the results could be wrong or errors may occur. - When the phrasal segmentation is serving new text, for the phrases (every token is seen in the traning corpus) provided in the knowledge base (
wiki_quality.txt
), the score is set as1.0
. Previously, it was kind of infinite.
- Support extremely large corpus (e.g., 100GB or more). Please comment out the
// define LARGE
in the beginning ofsrc/utils/parameters.h
before you run AutoPhrase on such a large corpus. - Quality phrases (every token is seen in the raw corpus) provided in the knowledge base will be incorporated during the phrasal segmentation, even their frequencies are smaller than
MIN_SUP
. - Stopwords will be treated as low quality single-word phrases.
- Model files are saved separately. Please check the variable
MODEL
in bothauto_phrase.sh
andphrasal_segmentation.sh
. - The end of line is also a separator for sentence splitting.
(compared to SegPhrase)
- Minimized Human Effort. We develop a robust positive-only distant training method to estimate the phrase quality by leveraging exsiting general knowledge bases.
- Support Multiple Languages: English, Spanish, and Chinese. The language in the input will be automatically detected.
- High Accuracy. We propose a POS-guided phrasal segmentation model incorporating POS tags when POS tagger is available. Meanwhile, the new framework is able to extract single-word quality phrases.
- High Efficiency. A better indexing and an almost lock-free parallelization are implemented, which lead to both running time speedup and memory saving.
Linux or MacOS with g++ and Java installed.
Ubuntu:
- g++ 4.8
$ sudo apt-get install g++-4.8
- Java 8
$ sudo apt-get install openjdk-8-jdk
- curl
$ sudo apt-get install curl
MacOS:
- g++ 6
$ brew install gcc6
- Java 8
$ brew update; brew tap caskroom/cask; brew install Caskroom/cask/java
$ ./auto_phrase.sh
The default run will download an English corpus from the server of our data
mining group and run AutoPhrase to get 3 ranked lists of phrases as well as 2 segmentation model files under the
MODEL
(i.e., models/DBLP
) directory.
AutoPhrase.txt
: the unified ranked list for both single-word phrases and multi-word phrases.AutoPhrase_multi-words.txt
: the sub-ranked list for multi-word phrases only.AutoPhrase_single-word.txt
: the sub-ranked list for single-word phrases only.segmentation.model
: AutoPhrase's segmentation model (saved for later use).token_mapping.txt
: the token mapping file for the tokenizer (saved for later use).
You can change RAW_TRAIN
to your own corpus and you may also want change MODEL
to a different name.
We also provide an auxiliary function to highlight the phrases in context based on our phrasal segmentation model. There are two thresholds you can tune in the top of the script. The model can also handle unknown tokens (i.e., tokens which are not occurred in the phrase mining step's corpus).
In the beginning, you need to specify AutoPhrase's segmentation model, i.e., MODEL
. The default value is set to be consistent with auto_phrase.sh
.
$ ./phrasal_segmentation.sh
The segmentation results will be put under the MODEL
directory as well (i.e., model/DBLP/segmentation.txt
). The highlighted phrases will be enclosed by the phrase tags (e.g., <phrase>data mining</phrase>
).
If domain-specific knowledge bases are available, such as MeSH terms, there are two ways to incorporate them.
- (recommended) Append your known quality phrases to the file
data/EN/wiki_quality.txt
. - Replace the file
data/EN/wiki_quality.txt
by your known quality phrases.
In fact, our tokenizer supports many different languages, including Arabics (AR), German (DE), English (EN), Spanish (ES), French (FR), Italian (IT), Japanese (JA), Portuguese (PT), Russian (RU), and Chinese (CN). If the language detection is wrong, you can also manually specify the language by modify the TOKENIZER
command in the bash script auto_phrase.sh
using the two-letter code for that language. For example, the following one forces the language to be English.
TOKENIZER="-cp .:tools/tokenizer/lib/*:tools/tokenizer/resources/:tools/tokenizer/build/ Tokenizer -l EN"
We also provide a default tokenizer together with a dummy POS tagger in the tools/tokenizer
.
It uses the StandardTokenizer in Lucene, and always assign a tag UNKNOWN
to each token.
To enable this feature, please add the -l OTHER"
to the TOKENIZER
command in the bash script auto_phrase.sh
.
TOKENIZER="-cp .:tools/tokenizer/lib/*:tools/tokenizer/resources/:tools/tokenizer/build/ Tokenizer -l OTHER"
If you want to incorporate your own tokenizer and/or POS tagger, please create a new class extending SpecialTagger in the tools/tokenizer
. You may refer to StandardTagger as an example.
You may try to search online or create your own list.
Meanwhile, you have to add two lists of quality phrases in the data/OTHER/wiki_quality.txt
and data/OTHER/wiki_all.txt
.
The quality of phrases in wiki_quality should be very confident, while wiki_all, as its superset, could be a little noisy. For more details, please refer to the tools/wiki_enities.
###Default Run
sudo docker run -v $PWD/results:/autophrase/results -it \
-e FIRST_RUN=1 -e ENABLE_POS_TAGGING=1 \
-e MIN_SUP=30 -e THREAD=10 \
remenberl/autophrase
./autophrase.sh
The results will be available in the results folder.
###User Specified Input Assuming the path to input file is ./data/input.txt.
sudo docker run -v $PWD/data:/autophrase/data -v $PWD/results:/autophrase/results -it \
-e RAW_TRAIN=data/input.txt \
-e FIRST_RUN=1 -e ENABLE_POS_TAGGING=1 \
-e MIN_SUP=30 -e THREAD=10 \
remenberl/autophrase
./autophrase.sh