Skip to content

Commit

Permalink
move all corpora and resources to tests/files/
Browse files Browse the repository at this point in the history
  • Loading branch information
sir-kokabi committed Jul 7, 2023
1 parent a68c5ea commit 1a0262e
Show file tree
Hide file tree
Showing 59 changed files with 221 additions and 2,583 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,9 @@ jobs:
poetry lock
poetry install --with dev
- name: Download resources
- name: Download test files
run: |
git clone https://github.com/sir-kokabi/resources.git resources
git clone https://github.com/sir-kokabi/resources.git tests/files/
- name: Run tests
run: poetry run poe test
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -159,4 +159,8 @@ sample.py

# ipynb files

ss.ipynb
ss.ipynb

.ruff_cache

tests/files/
25 changes: 11 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,19 +42,16 @@ To install the latest version of Hazm, run the following command in your termina
Alternatively, you can install the latest update from GitHub (this version may be unstable and buggy):

pip install git+https://github.com/roshan-research/hazm.git

then **download [resources.zip (~4 MB)](https://github.com/sir-kokabi/resources/releases/download/0.9.0/resources.zip)** and extract it to a to a folder named `resources` in the root of your project.

Finally if you do not want to train and use your own model, you can download our pre-trained models:

| **Module name** | **Size** |
|:------------------------ |:-------- |
| [**Download WordEmbedding**](https://mega.nz/file/GqZUlbpS#XRYP5FHbPK2LnLZ8IExrhrw3ZQ-jclNSVCz59uEhrxY) | ~ 5 GB |
| [**Download SentEmbedding**](https://mega.nz/file/WzR0QChY#J1nG-HGq0UJP69VMY8I1YGl_MfEAFCo5iizpjofA4OY) | ~ 1 GB |
| [**Download DependencyParser**](https://drive.google.com/file/d/1Ww3xsZC5BXY5eN8-2TWo40G-WvppkXYD/view?usp=drive_link) | ~ 60 MB |
| [**Download POSTagger**](https://drive.google.com/file/d/1Q3JK4NVUC2t5QT63aDiVrCRBV225E_B3) | ~ 18 MB |
| [**Download Chunker**](https://drive.google.com/file/d/16hlAb_h7xdlxF4Ukhqk_fOV3g7rItVtk) | ~ 4 MB |
Finally if you do not want to train and use your own model, you can download our pre-trained models:

| **Module name** | **Size** |
| :--------------------------------------------------------------------------------------------------------------------- | :------- |
| [**Download WordEmbedding**](https://mega.nz/file/GqZUlbpS#XRYP5FHbPK2LnLZ8IExrhrw3ZQ-jclNSVCz59uEhrxY) | ~ 5 GB |
| [**Download SentEmbedding**](https://mega.nz/file/WzR0QChY#J1nG-HGq0UJP69VMY8I1YGl_MfEAFCo5iizpjofA4OY) | ~ 1 GB |
| [**Download DependencyParser**](https://drive.google.com/file/d/1Ww3xsZC5BXY5eN8-2TWo40G-WvppkXYD/view?usp=drive_link) | ~ 60 MB |
| [**Download POSTagger**](https://drive.google.com/file/d/1Q3JK4NVUC2t5QT63aDiVrCRBV225E_B3) | ~ 18 MB |
| [**Download Chunker**](https://drive.google.com/file/d/16hlAb_h7xdlxF4Ukhqk_fOV3g7rItVtk) | ~ 4 MB |

## Usage

Expand All @@ -77,16 +74,16 @@ Finally if you do not want to train and use your own model, you can download our
>>> lemmatizer.lemmatize('می‌روم')
'رفت#رو'

>>> tagger = POSTagger(model='resources/pos_tagger.model')
>>> tagger = POSTagger(model='pos_tagger.model')
>>> tagger.tag(word_tokenize('ما بسیار کتاب می‌خوانیم'))
[('ما', 'PRO'), ('بسیار', 'ADV'), ('کتاب', 'N'), ('می‌خوانیم', 'V')]

>>> chunker = Chunker(model='resources/chunker.model')
>>> chunker = Chunker(model='chunker.model')
>>> tagged = tagger.tag(word_tokenize('کتاب خواندن را دوست داریم'))
>>> tree2brackets(chunker.parse(tagged))
'[کتاب خواندن NP] [را POSTP] [دوست داریم VP]'

>>> word_embedding = WordEmbedding(model_type = 'fasttext', model_path = 'resources/word2vec.bin')
>>> word_embedding = WordEmbedding(model_type = 'fasttext', model_path = 'word2vec.bin')
>>> word_embedding.doesnt_match(['سلام' ,'درود' ,'خداحافظ' ,'پنجره'])
'پنجره'
>>> word_embedding.doesnt_match(['ساعت' ,'پلنگ' ,'شیر'])
Expand Down
10 changes: 0 additions & 10 deletions corpora/bijankhan.txt

This file was deleted.

13 changes: 0 additions & 13 deletions corpora/dadegan.conll

This file was deleted.

26 changes: 0 additions & 26 deletions corpora/dadegan.conllu

This file was deleted.

90 changes: 0 additions & 90 deletions corpora/degarbayan/corpus_pair.xml

This file was deleted.

26 changes: 0 additions & 26 deletions corpora/hamshahri/1996/ham2_960623.xml

This file was deleted.

Loading

0 comments on commit 1a0262e

Please sign in to comment.