Skip to content

A Java program to perform processing on Wiki database XML dump

License

Notifications You must be signed in to change notification settings

Fabiensk/WkProcessor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

WkProcessor

A Java program to perform processing on Wiki database XML dump, espacially French Wikivoyage.

Example of command line: unxz < frwikivoyage-20140401-pages-meta-history.xml.xz | java -jar WkProcessor-1.0-SNAPSHOT.jar

This program will:

  • Build the hierarchy of French Wikivoyage pages (relationship to parent is done by the template "Dans" for the latest revision)
  • Collect the number of pages and characters for each countly.
  • If a directory "out" is present in the current directory, the XML for the pages will be saved there (takes a lot of disk space).
  • Only the pages in the main namespace are processed.

TODO (probably not anytime soon):

  • Find a better way to collect the stats.
  • Find out how to use xpath with decent performance.
  • Find a better way to map XML objects into Java objects. Maybe with JAXB, or by writing a Python pulldom equivalent.
  • Try to use the thread pool.
  • Write a complete framework
  • Load the page files back to Java objects
  • Provide methods to manipulate Wiki documents
  • Find templates
  • Parse sections, listings, lists…
  • Allow to load either from Database dump, Database or Wikipedia site
  • Allow to write pipelines and filters

About

A Java program to perform processing on Wiki database XML dump

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages