forked from PaddlePaddle/PaddleNLP
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
FasterTokenizer->FastTokenizer (PaddlePaddle#3719)
- Loading branch information
Showing
173 changed files
with
782 additions
and
755 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,105 @@ | ||
# FastTokenizer | ||
|
||
------------------------------------------------------------------------------------------ | ||
|
||
<p align="center"> | ||
<a href="./LICENSE"><img src="https://img.shields.io/badge/license-Apache%202-dfd.svg"></a> | ||
<a href="https://github.com/PaddlePaddle/PaddleNLP/releases"><img src="https://img.shields.io/github/v/release/PaddlePaddle/PaddleNLP?color=ffa"></a> | ||
<a href=""><img src="https://img.shields.io/badge/python-3.6.2+-aff.svg"></a> | ||
<a href=""><img src="https://img.shields.io/badge/os-linux%2C%20win%2C%20mac-pink.svg"></a> | ||
<a href="https://github.com/PaddlePaddle/PaddleNLP/graphs/contributors"><img src="https://img.shields.io/github/contributors/PaddlePaddle/PaddleNLP?color=9ea"></a> | ||
<a href="https://github.com/PaddlePaddle/PaddleNLP/commits"><img src="https://img.shields.io/github/commit-activity/m/PaddlePaddle/PaddleNLP?color=3af"></a> | ||
<a href="https://pypi.org/project/paddlenlp/"><img src="https://img.shields.io/pypi/dm/paddlenlp?color=9cf"></a> | ||
<a href="https://github.com/PaddlePaddle/PaddleNLP/issues"><img src="https://img.shields.io/github/issues/PaddlePaddle/PaddleNLP?color=9cc"></a> | ||
<a href="https://github.com/PaddlePaddle/PaddleNLP/stargazers"><img src="https://img.shields.io/github/stars/PaddlePaddle/PaddleNLP?color=ccf"></a> | ||
</p> | ||
FastTokenizerๆฏไธๆฌพ็ฎๅๆ็จใๅ่ฝๅผบๅคง็่ทจๅนณๅฐ้ซๆง่ฝๆๆฌ้ขๅค็ๅบ๏ผ้ๆไธ็ๅคไธชๅธธ็จ็Tokenizerๅฎ็ฐ๏ผๆฏๆไธๅNLPๅบๆฏไธ็ๆๆฌ้ขๅค็ๅ่ฝ๏ผๅฆๆๆฌๅ็ฑปใ้ ่ฏป็่งฃ๏ผๅบๅๆ ๆณจ็ญใ็ปๅPaddleNLP Tokenizerๆจกๅ๏ผไธบ็จๆทๅจ่ฎญ็ปใๆจ็้ถๆฎตๆไพ้ซๆ้็จ็ๆๆฌ้ขๅค็่ฝๅใ | ||
|
||
## ็นๆง | ||
|
||
- ้ซๆง่ฝใ็ฑไบๅบๅฑ้็จC++ๅฎ็ฐ๏ผๆไปฅๅ ถๆง่ฝ่ฟ้ซไบ็ฎๅๅธธ่งPythonๅฎ็ฐ็Tokenizerใๅจๆๆฌๅ็ฑปไปปๅกไธ๏ผFastTokenizerๅฏนๆฏPython็ๆฌTokenizerๅ ้ๆฏๆ้ซๅฏ่พพ20ๅใ | ||
- ่ทจๅนณๅฐใFastTokenizerๅฏๅจไธๅ็็ณป็ปๅนณๅฐไธไฝฟ็จ๏ผ็ฎๅๅทฒๆฏๆWindows x64๏ผLinux x64ไปฅๅMacOS 10.14+ๅนณๅฐไธไฝฟ็จใ | ||
- ๅค็ผ็จ่ฏญ่จๆฏๆใFastTokenizerๆไพๅจC++ใPython่ฏญ่จไธๅผๅ็่ฝๅใ | ||
- ็ตๆดปๆงๅผบใ็จๆทๅฏไปฅ้่ฟๆๅฎไธๅ็FastTokenizer็ปไปถๅฎๅถๆปก่ถณ้ๆฑ็Tokenizerใ | ||
|
||
## ๅฟซ้ๅผๅง | ||
|
||
ไธ้ขๅฐไป็ปPython็ๆฌFastTokenizer็ไฝฟ็จๆนๅผ๏ผC++็ๆฌ็ไฝฟ็จๆนๅผๅฏๅ่[FastTokenizer C++ Demo](./fast_tokenizer/demo/README.md)ใ | ||
|
||
### ๅ็ฝฎไพ่ต | ||
|
||
- Windows 64ไฝ็ณป็ป | ||
- Linux x64็ณป็ป | ||
- MacOS 10.14+็ณป็ป๏ผm1่ฏ็็MacOS๏ผ้่ฆไฝฟ็จx86_64็ๆฌ็Anacondaไฝไธบpython็ฏๅขๆนๅฏๅฎ่ฃ ไฝฟ็จ๏ผ | ||
- Python 3.6 ~ 3.9 | ||
|
||
### ๅฎ่ฃ FastTokenizer | ||
|
||
```python | ||
pip install fast_tokenizer | ||
``` | ||
|
||
### FastTokenizerไฝฟ็จ็คบไพ | ||
|
||
- ๅๅค่ฏ่กจ | ||
|
||
```shell | ||
# Linuxๆ่ Mac็จๆทๅฏ็ดๆฅๆง่กไปฅไธๅฝไปคไธ่ฝฝๆต่ฏ็่ฏ่กจ๏ผWindows ็จๆทๅฏๅจๆต่งๅจไธไธ่ฝฝๅฐๆฌๅฐใ | ||
wget https://bj.bcebos.com/paddlenlp/models/transformers/ernie/vocab.txt | ||
``` | ||
|
||
- ๅ่ฏ็คบไพ | ||
|
||
FastTokenizerๅบๅ ็ฝฎNLPไปปๅกๅธธ็จ็Tokenizer๏ผๅฆErnieFastTokenizerใไธ้ขๅฐๅฑ็คบFastTokenizer็็ฎๅ็จๆณใ | ||
|
||
```python | ||
from fast_tokenizer import ErnieFastTokenizer, models | ||
# 1. ๅ ่ฝฝ่ฏ่กจ | ||
vocab = models.WordPiece.read_file("ernie_vocab.txt") | ||
# 2. ๅฎไพๅErnieFastTokenizerๅฏน่ฑก | ||
fast_tokenizer = ErnieFastTokenizer(vocab) | ||
# 3. ๅ่ฏ | ||
output = fast_tokenizer.encode("ๆ็ฑไธญๅฝ") | ||
# 4. ่พๅบ็ปๆ | ||
print("ids: ", output.ids) | ||
print("type_ids: ", output.type_ids) | ||
print("tokens: ", output.tokens) | ||
print("offsets: ", output.offsets) | ||
print("attention_mask: ", output.attention_mask) | ||
``` | ||
|
||
### FastTokenizerๅจPaddleNLP Tokenizerๆจกๅๅ ้็คบไพ | ||
|
||
PaddleNLP Tokenizerๆจกๅๅฏ็ฎๅๅฐๅบ็จๅจๆจกๅ่ฎญ็ปไปฅๅๆจ็้จ็ฝฒ็ๆๆฌ้ขๅค็้ถๆฎต๏ผๅนถ้่ฟ`AutoTokenizer.from_pretrained`ๆนๅผๅฎไพๅ็ธๅบ็Tokenizerใๅ ถไธญ`AutoTokenizer`้ป่ฎคๅ ่ฝฝๅพๅฐ็Tokenizerๆฏๅธธ่งPythonๅฎ็ฐ็Tokenizer๏ผๅ ถๆง่ฝไผไฝไบC++ๅฎ็ฐ็FastTokenizerใไธบไบๆๅPaddleNLP Tokenizerๆจกๅๆง่ฝ๏ผ็ฎๅPaddleNLP Tokenizerๆจกๅๅทฒ็ปๆฏๆไฝฟ็จFastTokenizerไฝไธบTokenizer็ๅ็ซฏๅ ้ๅ่ฏ้ถๆฎตใๅจ็ฐๆ็Tokenizerๅ ่ฝฝๆฅๅฃไธญ๏ผไป ้ๆทปๅ `use_fast=True`่ฟไธๅ ณ้ฎ่ฏๅๆฐ๏ผๅ ถไฝไปฃ็ ไฟๆไธๅ๏ผๅณๅฏๅ ่ฝฝFast็ๆฌ็Tokenizer๏ผไปฃ็ ็คบไพๅฆไธ๏ผ | ||
|
||
```python | ||
from paddlenlp.transformers import AutoTokenizer | ||
|
||
# ้ป่ฎคๅ ่ฝฝPython็ๆฌ็Tokenizer | ||
tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh') | ||
# ๆๅผuse_fastๅผๅ ณ๏ผๅฏๅ ่ฝฝFast็ๆฌTokenizer | ||
fast_tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh', use_fast=True) | ||
|
||
text1 = tokenizer('่ช็ถ่ฏญ่จๅค็') | ||
text2 = fast_tokenizer('่ช็ถ่ฏญ่จๅค็') | ||
|
||
print(text1) | ||
print(text2) | ||
``` | ||
|
||
็ฎๅPaddleNLPๅทฒๆฏๆBERTใERNIEใTinyBERTไปฅๅERNIE-M 4็งTokenizer็Fast็ๆฌ๏ผๅ ถไฝๆจกๅ็TokenizerๆไธๆฏๆFast็ๆฌใ | ||
|
||
## FAQ | ||
|
||
Q๏ผๆๅจAutoTokenizer.from_pretrainedๆฅๅฃไธๅทฒ็ปๆๅผ`use_fast=True`ๅผๅ ณ๏ผไธบไปไนๆๆฌ้ขๅค็้ถๆฎตๆง่ฝไธๅฅฝๅๆฒกๆไปปไฝๅๅ๏ผ | ||
|
||
A๏ผๅจๆไธ็งๆ ๅตไธ๏ผๆๅผ`use_fast=True`ๅผๅ ณๅฏ่ฝๆ ๆณๆๅๆง่ฝ๏ผ | ||
1. ๆฒกๆๅฎ่ฃ fast_tokenizerใ่ฅๅจๆฒกๆๅฎ่ฃ fast_tokenizerๅบ็ๆ ๅตไธๆๅผ`use_fast`ๅผๅ ณ๏ผPaddleNLPไผ็ปๅบไปฅไธwarning๏ผ"Can't find the fast_tokenizer package, please ensure install fast_tokenizer correctly. "ใ | ||
|
||
2. ๅ ่ฝฝ็Tokenizer็ฑปๅๆไธๆฏๆFast็ๆฌใ็ฎๅๆฏๆ4็งTokenizer็Fast็ๆฌ๏ผๅๅซๆฏBERTใERNIEใTinyBERTไปฅๅERNIE-M Tokenizerใ่ฅๅ ่ฝฝไธๆฏๆFast็ๆฌ็Tokenizerๆ ๅตไธๆๅผ`use_fast`ๅผๅ ณ๏ผPaddleNLPไผ็ปๅบไปฅไธwarning๏ผ"The tokenizer XXX doesn't have the fast version. Please check the map paddlenlp.transformers.auto.tokenizer.FASTER_TOKENIZER_MAPPING_NAMES to see which fast tokenizers are currently supported." | ||
|
||
3. ๅพ ๅ่ฏๆๆฌ้ฟๅบฆ่ฟ็ญ๏ผๅฆๆๆฌๅนณๅ้ฟๅบฆๅฐไบ5๏ผใ่ฟ็งๆ ๅตไธๅ่ฏๅผ้ๅฏ่ฝไธๆฏๆดไธชๆๆฌ้ขๅค็็ๆง่ฝ็ถ้ข๏ผๅฏผ่ดๅจไฝฟ็จFastTokenizerๅไปๆ ๆณๆๅๆดไฝๆง่ฝใ | ||
|
||
## ็ธๅ ณๆๆกฃ | ||
|
||
[FastTokenizer็ผ่ฏๆๅ](docs/compile/README.md) |
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
6 changes: 3 additions & 3 deletions
6
faster_tokenizer/docs/compile/README.md โ fast_tokenizer/docs/compile/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.