Skip to content

Commit

Permalink
Merge branch 'roshan-research:master' into master
Browse files Browse the repository at this point in the history
  • Loading branch information
MortezaMahdaviMortazavi authored Feb 28, 2024
2 parents c613ce0 + 09886e2 commit 8e76490
Show file tree
Hide file tree
Showing 46 changed files with 3,166 additions and 16 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ jobs:

- name: Get changed files
id: changed-files
uses: tj-actions/changed-files@v35
uses: tj-actions/changed-files@v41
with:
files: |
**/*.py
Expand Down
3 changes: 0 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -124,9 +124,6 @@ venv.bak/
# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
Expand Down
45 changes: 45 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,26 @@
| Chunker | **93.4%** |
| Lemmatizer | **89.9%** |

| | Metric | Value |
| ------------------------------ | --------------- | ------- |
| **SpacyPOSTagger** | Precision | 0.99250 |
| | Recall | 0.99249 |
| | F1-Score | 0.99249 |
| **EZ Detection in SpacyPOSTagger** | Precision | 0.99301 |
| | Recall | 0.99297 |
| | F1-Score | 0.99298 |
| **SpacyChunker** | Accuracy | 96.53% |
| | F-Measure | 95.00% |
| | Recall | 95.17% |
| | Precision | 94.83% |
| **SpacyDependencyParser** | TOK Accuracy | 99.06 |
| | UAS | 92.30 |
| | LAS | 89.15 |
| | SENT Precision | 98.84 |
| | SENT Recall | 99.38 |
| | SENT F-Measure | 99.11 |


## Introduction

[**Hazm**](https://www.roshan-ai.ir/hazm/) is a python library to perform natural language processing tasks on Persian text. It offers various features for analyzing, processing, and understanding Persian text. You can use Hazm to normalize text, tokenize sentences and words, lemmatize words, assign part-of-speech tags, identify dependency relations, create word and sentence embeddings, or read popular Persian corpora.
Expand Down Expand Up @@ -62,6 +82,11 @@ Finally if you want to use our pretrained models, you can download it from the l
| [**Download POSTagger**](https://drive.google.com/file/d/1Q3JK4NVUC2t5QT63aDiVrCRBV225E_B3) | ~ 18 MB |
| [**Download DependencyParser**](https://drive.google.com/file/d/1MDapMSUXYfmQlu0etOAkgP5KDiWrNAV6/view?usp=share_link) | ~ 15 MB |
| [**Download Chunker**](https://drive.google.com/file/d/16hlAb_h7xdlxF4Ukhqk_fOV3g7rItVtk) | ~ 4 MB |
| [**Download spacy_pos_tagger_parsbertpostagger**](https://huggingface.co/roshan-research/spacy_pos_tagger_parsbertpostagger) | ~ 630 MB |
| [**Download spacy_pos_tagger_parsbertpostagger95**](https://huggingface.co/roshan-research/spacy_pos_tagger_parsbertpostagger95)| ~ 630 MB |
| [**Download spacy_chunker_uncased_bert**](https://huggingface.co/roshan-research/spacy_chunker_uncased_bert) | ~ 650 MB |
| [**Download spacy_chunker_parsbert**](https://huggingface.co/roshan-research/spacy_chunker_parsbert) | ~ 630 MB |
| [**Download spacy_dependency_parser**](https://huggingface.co/roshan-research/spacy_dependency_parser) | ~ 630 MB |

## Usage

Expand All @@ -88,11 +113,28 @@ Finally if you want to use our pretrained models, you can download it from the l
>>> tagger.tag(word_tokenize('ما بسیار کتاب می‌خوانیم'))
[('ما', 'PRO'), ('بسیار', 'ADV'), ('کتاب', 'N'), ('می‌خوانیم', 'V')]

>>> spacy_posTagger = SpacyPOSTagger(model_path = 'MODELPATH')
>>> spacy_posTagger.tag(tokens = ['من', 'به', 'مدرسه', 'ایران', 'رفته_بودم', '.'])
[('من', 'PRON'), ('به', 'ADP'), ('مدرسه', 'NOUN,EZ'), ('ایران', 'NOUN'), ('رفته_بودم', 'VERB'), ('.', 'PUNCT')]

>>> posTagger = POSTagger(model = 'pos_tagger.model', universal_tag = False)
>>> posTagger.tag(tokens = ['من', 'به', 'مدرسه', 'ایران', 'رفته_بودم', '.'])
[('من', 'PRON'), ('به', 'ADP'), ('مدرسه', 'NOUN'), ('ایران', 'NOUN'), ('رفته_بودم', 'VERB'), ('.', 'PUNCT')]

>>> chunker = Chunker(model='chunker.model')
>>> tagged = tagger.tag(word_tokenize('کتاب خواندن را دوست داریم'))
>>> tree2brackets(chunker.parse(tagged))
'[کتاب خواندن NP] [را POSTP] [دوست داریم VP]'

>>> spacy_chunker = SpacyChunker(model_path = 'model_path')
>>> tree = spacy_chunker.parse(sentence = [('نامه', 'NOUN,EZ'), ('ایشان', 'PRON'), ('را', 'ADP'), ('دریافت', 'NOUN'), ('داشتم', 'VERB'), ('.', 'PUNCT')])
>>> print(tree)
(S
(NP نامه/NOUN,EZ ایشان/PRON)
(POSTP را/ADP)
(VP دریافت/NOUN داشتم/VERB)
./PUNCT)

>>> word_embedding = WordEmbedding(model_type = 'fasttext', model_path = 'word2vec.bin')
>>> word_embedding.doesnt_match(['سلام' ,'درود' ,'خداحافظ' ,'پنجره'])
'پنجره'
Expand All @@ -103,6 +145,9 @@ Finally if you want to use our pretrained models, you can download it from the l
>>> parser.parse(word_tokenize('زنگ‌ها برای که به صدا درمی‌آید؟'))
<DependencyGraph with 8 nodes>

>>> spacy_parser = SpacyDependencyParser(tagger=tagger, lemmatizer=lemmatizer)
>>> spacy_parser.parse_sents([word_tokenize('زنگ‌ها برای که به صدا درمی‌آید؟')])

```

## Documentation
Expand Down
5 changes: 5 additions & 0 deletions docs/css/bootstrap.min.css

Large diffs are not rendered by default.

Loading

0 comments on commit 8e76490

Please sign in to comment.