WkProcessor

A Java program to perform processing on Wiki database XML dump, espacially French Wikivoyage.

Example of command line: unxz < frwikivoyage-20140401-pages-meta-history.xml.xz | java -jar WkProcessor-1.0-SNAPSHOT.jar

This program will:

Build the hierarchy of French Wikivoyage pages (relationship to parent is done by the template "Dans" for the latest revision)
Collect the number of pages and characters for each countly.
If a directory "out" is present in the current directory, the XML for the pages will be saved there (takes a lot of disk space).
Only the pages in the main namespace are processed.

TODO (probably not anytime soon):

Find a better way to collect the stats.
Find out how to use xpath with decent performance.
Find a better way to map XML objects into Java objects. Maybe with JAXB, or by writing a Python pulldom equivalent.
Try to use the thread pool.
Write a complete framework
Load the page files back to Java objects
Provide methods to manipulate Wiki documents
Find templates
Parse sections, listings, lists…
Allow to load either from Database dump, Database or Wikipedia site
Allow to write pipelines and filters

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src/main/java/eu/fabsk/wkprocessor		src/main/java/eu/fabsk/wkprocessor
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback