Skip to content

Latest commit

 

History

History
58 lines (41 loc) · 2.44 KB

README.md

File metadata and controls

58 lines (41 loc) · 2.44 KB

PerlPipelineExtractor

Legend single threaded extractor written in Perl. External libraries such as SVMheaderparse, ParsCit, and PDFBox are involked.

Java Wrapper/Runner for CSX Extractor

To build and run project: ant jar # builds jar in dist directory and copies other resources there ant run # starts program These commands should be run from the application's root directory

Everything in the cpy directory gets copied to the dist folder along with the jar when ant jar or ant run command is run.


Config Important

The config options should be set appropriate in the config/config.properties file. Note that these settings are only read once at startup and changing them while the program is running won't have any effect.

The various perl modules in the lib directory also have Config files where some options can be set.


Stopping The Program

Modify the dist/runtime.properties file so the 'stopProcessing' property is set to true.


Project Structure

project root
|
/build         # java class files generated by compiler - directory created automatically on build
build.xml      # ant build file
/config        # holds the config.properties file which contains project settings
/converters    # contains binaries and needed files for pdf to text converters
/cpy           # all files in here got copied to the dist directory on `ant jar` or `ant run` command 
/crfpp         # contains crf_learn and crf_test binaries as well as traindata folder. Used by parsCit I think
/dist          # where the built jar file is placed as well as working resources during runtime - generated on `ant jar`
/lib           # contains perl libraries for parsing, jar files required by the java program, and parseDocuments.pl script executed by jar
/logs          # contains log files from each run of jar 
/resources     # contains resources such as dictionaries used by perl scripts during parsing
/src           # contains java source code
/svm-light     # ??? holds stuff used for something 
/tmp           # holds inconsequential files used temporarily

Troubleshooting

If an error like this appears when trying to run TET:

/lib/ld-linux.so.2: bad ELF interpreter: No such file or directory

it's most likely caused by a lack of proper 32-but libraries. See http://stackoverflow.com/questions/8328250/centos-64-bit-bad-elf-interpreter for a solution