Add Wikipedia crawler ? (300+ languages) #78

hugolpz · 2021-02-25T15:36:39Z

A quick search shows you that CorpusCrawler does not crawl or use Wikipedia. I don't know Python but it seems feasible, either from scratch on Wikipedia API (1) or using existing server-side tools (2).

Assess interest

Assess how many Wikipedia languages are not in UNILEX. See Comparing languages of LinguaLibre vs UNILEX unicode-org/unilex#14 .
Assess quality of wikipedia raw text data in minority languages.
Compare gain to other available public corpora such Tatoeba (358 languages).

Crawling via API

By using and loading available list of articles per wikipedia, then scrap the sites. If too large, could be limited to max=n articles.

Given an iso code such as Ndonga's ng :

download List of page titles in main namespace archive (see below)
get the articles into a python list variable (python)
code a crawler in /Lib/corpuscrawler/util.py, following other crawler as examples 1, which query Wikipedia API, extract the valuable text, save the text. (python)
Update relevant crawlers /Lib/corpuscrawler/

Wikipedia API provides text

Various formats available:

format : The format of the output.
- jsont : Output data in JSON format.
- jsonfmt : Output data in JSON format (pretty-print in HTML).
- nonet : Output nothing.
- phpt : Output data in serialised PHP format.
- phpfmt : Output data in serialised PHP format (pretty-print in HTML).
- rawfmt : Output data, including debugging elements, in JSON format (pretty-print in HTML).
- xmlt : Output data in XML format.
- xmlfmt : Output data in XML format (pretty-print in HTML).

List of Wikipedia (~300)

List_of_Wikipedias
List of dumps - Wikipedia and others wiki projects.

List of articles per Wikipedia

For convenience, I use the tiny Ndonga (ng) Wikipedia (8 articles), easier to explore by hand.

For larger demo, you could also inspect similar URLs with the iso of :

Language	Native	iso	Articles
Ndonga	Oshiwambo	ng	8
Inuktitut	ᐃᓄᒃᑎᑐᑦ/inuktitut	iu	514
Samoan	Gagana Samoa	sm	985
Igbo	Igbo	ig	2,085
Central Bikol	Bikol Central	bcl	10,824

Namespaces

On all wikis. See also here

0: (main)
1: Talk:
2: User:
3: User_talk:

Dumps' & paths

List of dumps
- /ngwiki/20200220 - manual (change the date)
- /ngwiki/latest - directory
  - /ngwiki-latest-all-titles.gz
  - /ngwiki-latest-all-titles-in-ns0.gz - articles only

Using Wikipedia extractors ?

Hybrid approach

ISO: get the list of all local wiki's iso codes.
Downloads: loop over each language code, download the dump.
Extract: use extractor above, zip each language
Cloud: put text result online.
Crawl: in util.py, code a simple crawler which get just that .zip, convert back to txt content, add to the corpora.

cc: @brawer

The text was updated successfully, but these errors were encountered:

hugolpz · 2021-03-08T07:26:36Z

Discussion engaged with the Wikimedia Foundation's Dumb-Generation project.
See : phabricator.wikimedia.org > T276723

hugolpz · 2021-03-27T17:14:33Z

Python processing :

Wikicompiler is a fully extensible python library that compile and evaluate text from Wikipedia dump. You can extract text, do text analysis or even evaluate the AST(Abstract Syntax Tree) yourself.
Topics: python, compiler, mediawiki, wikipedia, wikitext, wikipedia-dump, wikitext-parser.

One presentation is in italian but has some interesting nugets: here. The gist:

GTOqaz · 2022-10-31T16:10:50Z

@hugolpz That's impressive.

hugolpz · 2022-11-01T21:53:16Z

@GTOqaz there are some upcoming Google crawling on 2000 languages, I hope they will make some data available, especially frequency lists.

hugolpz · 2024-02-17T19:47:27Z

There are ready-to-download open licence Wikipedia corpora available.

Project introduction	Type	Languages (2024)	Portal all	Language specific	Download link	Comments
OpenSubtitles 2016/2018	Subtitles Parallel sentences Monolingual sentences	75	Portal	`br&en`	bre (mono)	'''Source:''' * P. Lison and J. Tiedemann (2016), ''"OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles"'', http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf . '''Licence:''' unclear, "The corpora is made freely available to the research community on the OPUS website" − Lison and Tiedemann (2016).
Wortschatz by Leipzig	Sentences Monolingual	290+	-	bre	bre 100k sentences (2021)	List of sentences corpora : API reference > https://api.wortschatz-leipzig.de/ws/corpora
CC-100	Sentences Monolingual	115	Portal	n.a.	br (mono)	« No claims of intellectual property are made on the work of preparation of the corpus. »

hugolpz changed the title ~~Add wikipedia as a source ?~~ Add Wikipedia crawler ? Feb 25, 2021

hugolpz mentioned this issue Feb 25, 2021

harfbuzz-testing-wikipedia #8

Open

hugolpz changed the title ~~Add Wikipedia crawler ?~~ Add Wikipedia crawler ? (300+ languages) Feb 25, 2021

hugolpz mentioned this issue Mar 28, 2021

Add examples of output iwasingh/Wikicompiler#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Wikipedia crawler ? (300+ languages) #78

Add Wikipedia crawler ? (300+ languages) #78

hugolpz commented Feb 25, 2021 •

edited

Loading

hugolpz commented Mar 8, 2021

hugolpz commented Mar 27, 2021 •

edited

Loading

GTOqaz commented Oct 31, 2022

hugolpz commented Nov 1, 2022

hugolpz commented Feb 17, 2024

Add Wikipedia crawler ? (300+ languages) #78

Add Wikipedia crawler ? (300+ languages) #78

Comments

hugolpz commented Feb 25, 2021 • edited Loading

Assess interest

Crawling via API

Wikipedia API provides text

List of Wikipedia (~300)

List of articles per Wikipedia

Namespaces

Dumps' & paths

Using Wikipedia extractors ?

Hybrid approach

hugolpz commented Mar 8, 2021

hugolpz commented Mar 27, 2021 • edited Loading

GTOqaz commented Oct 31, 2022

hugolpz commented Nov 1, 2022

hugolpz commented Feb 17, 2024

hugolpz commented Feb 25, 2021 •

edited

Loading

hugolpz commented Mar 27, 2021 •

edited

Loading