red

Automatic Data Extraction from Result Pages

Jinsong Guo - University of Oxford
Valter Crescenzi - Università Roma Tre
Giovanni Grasso - University of Calabria, Italy / Meltwater - Wrapidity, UK
Tim Furche - University of Oxford / Meltwater - Wrapidity, UK
Georg Gottlob - University of Oxford / TU Wien

We present RED, an automatic approach and prototype system for extracting data records from result pages of websites that follow a widespread publishing pattern: the data about a set of objects (usually returned in response to query submitted by a web form) is presented as one or several paginated pages, each containing a list of result records. Every result record contains the main attributes about one single object, and links to another type of page publishing the details of the object possibly extending the set of attributes. The intra-site redundancy implied by this publishing patern is leveraged to design an effective fully-unsupervised and domain-independent method for extracting from result pages data that are also published in the corresponding detail pages. With respect to previous unsupervised methods, our method can achieve a significantly higher accuracy while automatically selecting only relevant attributes, a task which is out the scope of the traditional fully unsupervised approaches. With respect to previous supervised methods, RED can scale to a large number of websites reaching similar accuracy and covering various domains without requiring human intervention for each distinct domain.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
datasets		datasets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

red

About

Releases

Packages

Contributors 2

diadem/red

Folders and files

Latest commit

History

Repository files navigation

red

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages