Skip to content

USCDataScience/autoextractor

This branch is 70 commits ahead of, 2 commits behind thammegowda/autoextractor:master.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

218cb59 · Oct 27, 2021
Apr 5, 2016
May 5, 2016
Apr 10, 2016
May 7, 2016
Sep 16, 2016
Apr 4, 2016
Dec 25, 2015
Apr 10, 2016
Jan 14, 2016
Oct 27, 2021
Mar 15, 2016

Repository files navigation

Auto Extractor

An intelligent extractor library which learns the structures of the input web pages and then figures out a strategy for scraping the structured content.

Links

Developers:

Citation:

If you use this work, please cite: https://ieeexplore.ieee.org/abstract/document/7785739

@inproceedings{gowda2016clustering,
  title={Clustering Web Pages Based on Structure and Style Similarity (Application Paper)},
  author={Gowda, Thamme and Mattmann, Chris A},
  booktitle={Information Reuse and Integration (IRI), 2016 IEEE 17th International Conference on},
  pages={175--180},
  year={2016},
  organization={IEEE}
}

References :

  • K. Zhang and D. Shasha. 1989. "Simple fast algorithms for the editing distance between trees and related problems". SIAM J. Comput. 18, 6 (December 1989), 1245-1262.
  • Jarvis, R.A.; Patrick, Edward A., "Clustering Using a Similarity Measure Based on Shared Near Neighbors," in Computers, IEEE Transactions on , vol.C-22, no.11, pp.1025-1034, Nov. 1973

About

A toolkit for clustering web pages based on various similarity measures.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 68.1%
  • JavaScript 15.2%
  • Scala 12.8%
  • HTML 3.6%
  • CSS 0.3%