Skip to content

iclementine/TreebankPreprocessing

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TreebankPreprocessing

Python scripts preprocessing Penn Treebank and Chinese Treebank.

When designing a parser, preprocessing treebanks is a troublesome problem. We need to:

  • Split dataset into train/dev/test, following conventional splits.
  • Remove xml tags inside CTB.
  • Combine the multiline bracketed files into one file, one line for one sentence.

I wondered why there were no open-source tools handling these tedious works. Then I decide to write one myself. Hopefully it will save you some time.

These scripts convert original Penn Treebank (PTB) and Chinese Treebank 5.1 (CTB) corpora into the conventional data setup from Chen and Manning (2014), Dyer et al. (2015). The detailed splits are:

  • PTB Training: 02-21. Development: 22. Test: 23.
  • CTB Training: 001–815, 1001–1136. Development: 886– 931, 1148–1151. Test: 816–885, 1137–1147.

Let's do it on the fly.

Required software

  • Python3
  • NLTK

PTB

1. Import PTB into NLTK

Bracketed files parsing relies on NLTK. Please follow NLTK instruction, put BROWN and WSJ into nltk_data/corpora/ptb, e.g.

ptb
├── BROWN
└── WSJ

2. Run ptb.py

This script does all the work for you, only requires a path to store output.

usage: ptb.py [-h] --output OUTPUT

Combine Penn Treebank WSJ MRG files into train/dev/test set

optional arguments:
  -h, --help       show this help message and exit
  --output OUTPUT  The folder where to store the output
                   train.txt/dev.txt/test.txt

E.g.

$ python3 ptb.py --output ptb-combined
Importing ptb from nltk

Generating ptb-combined/train.txt
1875 files...
100.00%
39832 sentences.

Generating ptb-combined/dev.txt
83 files...
100.00%
1700 sentences.

Generating ptb-combined/test.txt
100 files...
100.00%
2416 sentences.

CTB

The CTB is a little messy, it contains extra xml tags in every gold tree, and is not natively supported by NLTK. You need to specify the CTB root path (the folder containing index.html).

usage: ctb.py [-h] --ctb CTB --output OUTPUT

Combine Chinese Treebank 5.1 fid files into train/dev/test set

optional arguments:
  -h, --help       show this help message and exit
  --ctb CTB        The root path to Chinese Treebank 5.1
  --output OUTPUT  The folder where to store the output
                   train.txt/dev.txt/test.txt

E.g.

$ python3 ctb.py --ctb corpus/ctb5.1 --output ctb5.1-combined
Converting CTB: removing xml tags...
Importing to nltk...

Generating ctb5.1-combined/train.txt
773 files...
100.00%
16083 sentences.

Generating ctb5.1-combined/dev.txt
36 files...
100.00%
803 sentences.

Generating ctb5.1-combined/test.txt
81 files...
100.00%
1910 sentences.

Then you can start your research, enjoy it!

About

Python scripts preprocessing Penn Treebank and Chinese Treebank

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%