Skip to content

Latest commit

 

History

History
14 lines (9 loc) · 1.35 KB

README.md

File metadata and controls

14 lines (9 loc) · 1.35 KB

Wikipedia Redirects

Java projects for extracting and searching for Wikipedia redirects (alternative titles)

Project created by Michael Gloger for school assignment at FIIT STU Bratislava http://vi.ikt.ui.sav.sk/User:Michael.Gloger?view=home

Main goal of this project was to implement parser for finding alternative titles for Wikipedia pages by parsing articles XML dump files. Amongst other detailed information, in each page record we can find page title and flag if this page is redirect to another page. If this page is redirect we can consider its title as alternative title of page it is referring to.

Please note that this project does not bring any new exciting functionality. Wikipedia provides online services such as "What links here" where you can find amongst other things pages referring to specified page. This project was more like a challenge because input XML files are larger than 50 GB of more than 14 mil pages records.

This repository contains two Java projects:

  1. Parser - parsing Wikipedia XML dumps and saving alternative titles data to CSV file
  2. Server - read alternative titles from file, index them in Lucene, provide REST services for page search and webpage to display results