Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
-
Updated
Dec 23, 2024 - Python
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Crawler for linguistic corpora
Praaline is an open-source system to manage, annotate, visualise and analyse spoken language corpora
Collector and speech cutter for librivox audiobooks
Ebook Corpus - A parser and extractor for electronic books
Article title, authors, date and body extraction dataset.
Katya or The Liberated Corpus a text corpus that allows you to request and scrape any web resource!
A corpus builder for evaluation of plagiarism detection tools
The user interface for the Corpus & Repository of Writing, built in Angular
Crawl Ask.fm QA lists and create corpus for ML.
The canonical resources to build the backend for a corpus/repository management framework for Crow, the Corpus and Repository of Writing
This is a text corpus management system for the german linguistic department of the university of Basel.
App and Scripts working with the corpus-builder CorpusCook, to have a corpus updated with corrected wrong predictions
Chatbot in Polish language, trained on movie subtitles collected using web scraping, based on Transformer architecture.
Extract text from Vikidia/Wikipedia articles [fr]
Corpus Development Software for Machine Translation
Builds Wikipedia corpora in I5 (a TEI-based format)
Automated text preprocessing pipeline for large corpora. Features customizable filters for diacritics, stop words, punctuation, and regex.
Add a description, image, and links to the corpus-builder topic page so that developers can more easily learn about it.
To associate your repository with the corpus-builder topic, visit your repo's landing page and select "manage topics."