corpus-builder

Here are 19 public repositories matching this topic...

adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

Updated Dec 23, 2024
Python

google / corpuscrawler

Star

Crawler for linguistic corpora

crawling linguistics corpus-linguistics corpus-builder minority-language

Updated Dec 5, 2023
Python

praaline / Praaline

Star

Praaline is an open-source system to manage, annotate, visualise and analyse spoken language corpora

annotations corpus visualisation linguistics corpus-linguistics speech-processing corpus-builder corpus-tools speech-analysis spoken-language-processing

Updated Sep 21, 2022
C

carlfm01 / librivox-tools

Star

Collector and speech cutter for librivox audiobooks

data-collector speech-to-text corpus-builder corpus-tools librivox

Updated Dec 8, 2022
C#

uma-pi1 / OPIEC-pipeline

Star

dohliam / ebook-corpus

Star

Ebook Corpus - A parser and extractor for electronic books

corpus mobi epub ebooks corpus-linguistics fb2 corpus-builder ebook-parsing

Updated Aug 6, 2019
Ruby

AndyTheFactory / article-extraction-dataset

Star

Article title, authors, date and body extraction dataset.

text-mining news html-to-markdown scraping corpus news-aggregator text-extraction dataset web-scraping readability datasets scraping-websites html2text news-crawler corpus-builder corpus-tools article-extractor text-cleaning text-preprocessing

Updated Mar 26, 2024
HTML

thecsw / katya-dev

Star

Katya or The Liberated Corpus a text corpus that allows you to request and scrape any web resource!

corpus russian tagger corpus-linguistics corpus-generator corpus-builder text-corpus russian-literature corpus-processing corpus-analysis

Updated Mar 14, 2024
Go

FerreroJeremy / Plagiarized-Corpus-Generator

Star

A corpus builder for evaluation of plagiarism detection tools

plagiarism corpus-generator corpus-builder

Updated Dec 12, 2016
PHP

writecrow / crow_frontend

Star

The user interface for the Corpus & Repository of Writing, built in Angular

natural-language-processing angular corpus corpora corpus-linguistics corpus-builder

Updated Oct 16, 2024
TypeScript

tubone24 / askfm-qa-crawler

Star

Crawl Ask.fm QA lists and create corpus for ML.

crawler selenium chromedriver corpus-builder askfm

Updated Dec 15, 2023
Python

writecrow / crow_backend

Star

The canonical resources to build the backend for a corpus/repository management framework for Crow, the Corpus and Repository of Writing

api natural-language-processing backend corpus corpus-linguistics corpus-generator corpus-builder

Updated Nov 5, 2024
PHP

sorinmarti / fruechtekorb

Star

This is a text corpus management system for the german linguistic department of the university of Basel.

corpus linguistics corpus-linguistics corpus-builder

Updated Apr 15, 2020
PHP

c0ntradicti0n / CorpusCookApp

Star

App and Scripts working with the corpus-builder CorpusCook, to have a corpus updated with corrected wrong predictions

amp python3 twisted corpus-linguistics nlp-machine-learning corpus-builder kivy-application

Updated Mar 20, 2020
Python

adpaczek / chatbot

Star

Chatbot in Polish language, trained on movie subtitles collected using web scraping, based on Transformer architecture.

nlp chatbot transformer web-scraping corpus-builder polish-nlu

Updated Jun 30, 2024
Jupyter Notebook

CristinaGHolgado / vikitext

Star

Extract text from Vikidia/Wikipedia articles [fr]

corpus readability corpus-builder wikipedia-scraper text-simplification french-nlp vikidia

Updated Jul 20, 2021
Python

binayachaudari / Corpus-Development-Software

Star

Corpus Development Software for Machine Translation

machine-learning machine-translation corpus-builder

Updated Apr 23, 2024
JavaScript

IDS-Mannheim / Wikipedia-Corpus-Builder

Star

Builds Wikipedia corpora in I5 (a TEI-based format)

wikipedia xml tei corpus-builder wikipedia-corpus

Updated Jun 21, 2022
Java

jhlopesalves / CorpusAid

Star

Automated text preprocessing pipeline for large corpora. Features customizable filters for diacritics, stop words, punctuation, and regex.

python natural-language-processing regex corpus-linguistics data-cleaning corpus-builder corpus-tools corpus-processing text-preprocessing data-cleaning-automation

Updated Dec 16, 2024
Python

Improve this page

Add a description, image, and links to the corpus-builder topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the corpus-builder topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

corpus-builder

Here are 19 public repositories matching this topic...

adbar / trafilatura

google / corpuscrawler

praaline / Praaline

carlfm01 / librivox-tools

uma-pi1 / OPIEC-pipeline

dohliam / ebook-corpus

AndyTheFactory / article-extraction-dataset

thecsw / katya-dev

FerreroJeremy / Plagiarized-Corpus-Generator

writecrow / crow_frontend

tubone24 / askfm-qa-crawler

writecrow / crow_backend

sorinmarti / fruechtekorb

c0ntradicti0n / CorpusCookApp

adpaczek / chatbot

CristinaGHolgado / vikitext

binayachaudari / Corpus-Development-Software

IDS-Mannheim / Wikipedia-Corpus-Builder

jhlopesalves / CorpusAid

Improve this page

Add this topic to your repo