Skip to content

A small tutorial to know the basics to start marking a Language Model in Khmer

License

Notifications You must be signed in to change notification settings

Jibril-Frej/khmer_gpt

Repository files navigation

khmer_gpt

A small tutorial to know the basics to start marking a Language Model in Khmer

Prerequisites

conda

Download Khmer Wikipedia

wget https://dumps.wikimedia.org/kmwiki/20231220/kmwiki-20231220-pages-articles.xml.bz2
wget bzip2 -dk kmwiki-20231220-pages-articles.xml.bz2

Use wikiextractor to get the text from XML dump. The argument -b sets the maximum file size, here we set it ot 200M to have a single file to analyse.

python -m wikiextractor.WikiExtractor kmwiki-20231220-pages-articles.xml --json -b 200M

About

A small tutorial to know the basics to start marking a Language Model in Khmer

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published