Web Parsing
Work in progress transmit from Google Code
JavaScript object that creates unique CSS selector for given element.
Python APTED algorithm for the Tree Edit Distance
An efficient approximation for tree edit-distance.
A PyTorch implementation of "SimGNN: A Neural Network Approach to Fast Graph Similarity Computation" (WSDM 2019).
SIGIR-2022 Webformer: Pre-training with Web Pages for Information Retrieval
WebRED is a large and diverse manually annotated dataset for extracting relationships from a variety of text found on the World Wide Web.
WebNav: A New Large-Scale Task for Natural Language based Sequential Decision Making
Simplified DOM Trees for Transferable Attribute Extraction from the Web
Python package (to be) for converting raw html files to IR vectors either pyhton Dictionaries or NArrays(numpy) - Aims to trasparently handle Encoding and HTML issues
Algorithm that converts an HTML to a vectorized object suitable for neural networks.
Formasaurus tells you the type of an HTML form and its fields using machine learning
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities