Skip to content

Legend single threaded extractor written in Perl. External libraries such as SVMheaderparse, ParsCit, and PDFBox are involked.

Notifications You must be signed in to change notification settings

SeerLabs/PerlPipelineExtractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PerlPipelineExtractor

Legend single threaded extractor written in Perl. External libraries such as SVMheaderparse, ParsCit, and PDFBox are involked.

Java Wrapper/Runner for CSX Extractor

To build and run project: ant jar # builds jar in dist directory and copies other resources there ant run # starts program These commands should be run from the application's root directory

Everything in the cpy directory gets copied to the dist folder along with the jar when ant jar or ant run command is run.


Config Important

The config options should be set appropriate in the config/config.properties file. Note that these settings are only read once at startup and changing them while the program is running won't have any effect.

The various perl modules in the lib directory also have Config files where some options can be set.


Stopping The Program

Modify the dist/runtime.properties file so the 'stopProcessing' property is set to true.


Project Structure

project root
|
/build         # java class files generated by compiler - directory created automatically on build
build.xml      # ant build file
/config        # holds the config.properties file which contains project settings
/converters    # contains binaries and needed files for pdf to text converters
/cpy           # all files in here got copied to the dist directory on `ant jar` or `ant run` command 
/crfpp         # contains crf_learn and crf_test binaries as well as traindata folder. Used by parsCit I think
/dist          # where the built jar file is placed as well as working resources during runtime - generated on `ant jar`
/lib           # contains perl libraries for parsing, jar files required by the java program, and parseDocuments.pl script executed by jar
/logs          # contains log files from each run of jar 
/resources     # contains resources such as dictionaries used by perl scripts during parsing
/src           # contains java source code
/svm-light     # ??? holds stuff used for something 
/tmp           # holds inconsequential files used temporarily

Troubleshooting

If an error like this appears when trying to run TET:

/lib/ld-linux.so.2: bad ELF interpreter: No such file or directory

it's most likely caused by a lack of proper 32-but libraries. See http://stackoverflow.com/questions/8328250/centos-64-bit-bad-elf-interpreter for a solution

About

Legend single threaded extractor written in Perl. External libraries such as SVMheaderparse, ParsCit, and PDFBox are involked.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published