Skip to content

Commit

Permalink
update README
Browse files Browse the repository at this point in the history
  • Loading branch information
howl-anderson committed Oct 18, 2024
1 parent cdafe86 commit 2744199
Showing 1 changed file with 26 additions and 0 deletions.
26 changes: 26 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,32 @@ Output:
['', '', '电话', '', '15555555555', '', '邮箱', '', '[email protected]', ',', '工作', '单位', '', 'Tokyo', 'University', '']
```

## How to train and use your own models

### How to Train

MicroTokenizer also provides tools to help you train your own models.

```python
from MicroTokenizer.training.train import train

# You can use multiple files as training data, provided as a list
train(["./corpus.txt"], "./model_data")
```

## How to Use Your Own Models

```python
# import your tokenizer, XXXTokenizer is just a placeholder
from MicroTokenizer import XXXTokenizer

model_dir = "path/to/your/model"
input_text = "Your text to be tokenized"

tokenizer = XXXTokenizer.load(model_dir)
tokens = tokenizer.segment(input_text)
print(f"{name} Tokenizer:", tokens)
```

## Algorithm Explanation

Expand Down

0 comments on commit 2744199

Please sign in to comment.