Skip to content

Commit

Permalink
Update README
Browse files Browse the repository at this point in the history
  • Loading branch information
adrianeboyd committed Jul 12, 2022
1 parent 0572e94 commit 8ff747e
Showing 1 changed file with 5 additions and 9 deletions.
14 changes: 5 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,6 @@
This project downloads, extracts and preprocesses texts from a number of
sources and trains vectors with [floret](https://github.com/explosion/floret).

By default, the project trains floret vectors for Korean for use in `md` and
`lg` spaCy pipelines.

Prerequisites:
- linux (it may largely work on osx but this is not tested or maintained)
- a large amount of hard drive space (e.g. ~100GB total for Korean, which has
Expand Down Expand Up @@ -43,8 +40,7 @@ language or switch to `"latest"`.

#### OSCAR 21.09

The dataset [`oscar-corpus/OSCAR-2109`](https://huggingface.co/datasets/oscar-corpus/OSCAR-2109)
requires you to:
The dataset [`oscar-corpus/OSCAR-2109`](https://huggingface.co/datasets/oscar-corpus/OSCAR-2109) requires you to:
- create a Hugging Face Hub account
- agree to the dataset terms to access: https://huggingface.co/datasets/oscar-corpus/OSCAR-2109
- authenticate with `huggingface-cli login`
Expand Down Expand Up @@ -170,7 +166,7 @@ inputs have changed.
| Workflow | Steps |
| --- | --- |
| `prepare-text` | `extract-wikipedia` → `tokenize-wikipedia` → `extract-opensubtitles` → `tokenize-opensubtitles` → `extract-newscrawl` → `tokenize-newscrawl` → `tokenize-oscar` → `create-input` |
| `train-vectors` | `compile-floret` → `train-floret-vectors-md` → `train-floret-vectors-lg` |
| `train-vectors` | `compile-floret` → `train-floret-vectors-md` → `train-floret-vectors-lg` → `train-fasttext-vectors` |

### 🗂 Assets

Expand All @@ -181,8 +177,8 @@ in the project directory.
| File | Source | Description |
| --- | --- | --- |
| `software/floret` | Git | |
| `/scratch/vectors/downloaded/wikipedia/kowiki-20220201-pages-articles.xml.bz2` | URL | |
| `/scratch/vectors/downloaded/opensubtitles/ko.txt.gz` | URL | |
| `/scratch/vectors/downloaded/newscrawl/ko/news.2020.ko.shuffled.deduped.gz` | URL | |
| `/scratch/vectors/downloaded/wikipedia/enwiki-20220301-pages-articles.xml.bz2` | URL | |
| `/scratch/vectors/downloaded/opensubtitles/en.txt.gz` | URL | |
| `/scratch/vectors/downloaded/newscrawl/en/news.2020.en.shuffled.deduped.gz` | URL | |

<!-- SPACY PROJECT: AUTO-GENERATED DOCS END (do not remove) -->

0 comments on commit 8ff747e

Please sign in to comment.