Skip to content

Bump #9

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 40 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
2e2e755
Add CTC Decoder for Wave2Vec models (#693)
Narsil May 20, 2021
4b0dc6b
Fix SPM conversions (#686)
LysandreJik May 20, 2021
3cf957e
Bump handlebars from 4.7.6 to 4.7.7 in /bindings/node (#700)
dependabot[bot] May 20, 2021
7574349
Bump y18n from 4.0.0 to 4.0.3 in /bindings/node (#708)
dependabot[bot] May 20, 2021
8f639b4
Bump hosted-git-info from 2.8.8 to 2.8.9 in /bindings/node (#702)
dependabot[bot] May 20, 2021
bd19584
Bump lodash from 4.17.19 to 4.17.21 in /bindings/node (#701)
dependabot[bot] May 20, 2021
4b7f8c2
Fix CHANGELOG.md
n1t0 May 24, 2021
c046da7
Fix stripping strings containing Unicode characters (#707)
Narsil May 24, 2021
3a002c1
Python - prepare for release 0.10.3
n1t0 May 24, 2021
755e5f5
Remove support for Python 3.5 (#714)
n1t0 May 24, 2021
d83772d
Fixing tokenizers with 1.53 (updated some dependencies + clippy) (#764)
Narsil Jul 21, 2021
256a71c
Clippy 1.54. (#773)
Narsil Aug 11, 2021
96c122c
Bump ws from 7.3.1 to 7.4.6 in /bindings/node (#721)
dependabot[bot] Aug 12, 2021
5d1b0a9
Bump glob-parent from 5.1.1 to 5.1.2 in /bindings/node (#734)
dependabot[bot] Aug 12, 2021
ab3d3bc
Bump tar from 4.4.13 to 4.4.17 in /bindings/node (#775)
dependabot[bot] Aug 12, 2021
46bed54
Bump path-parse from 1.0.6 to 1.0.7 in /bindings/node (#774)
dependabot[bot] Aug 12, 2021
da4c7b1
Add a way to specify the unknown token in `SentencePieceUnigramTokeni…
SaulLu Aug 12, 2021
6616e69
Expand documentation of UnigramTrainer (#770)
sgugger Aug 12, 2021
71fb73e
update lexical-core because 0.7.4 doesn't compile (#758)
KoichiYasuoka Aug 12, 2021
c1100dc
Fix typo in documentation (#743)
kingyiusuen Aug 13, 2021
e2bf8da
Add SplitDelimiterBehavior to Punctuation constructor (#657)
vladdy Aug 13, 2021
5982498
Switch git dependencies in Cargo.toml back to regular versions (#728)
geofft Aug 13, 2021
e7dd643
Fix word level tokenizer determinism (#718)
lucacampanella Aug 13, 2021
e71e5be
Rust - Add from_pretrained on Tokenizer
n1t0 Aug 19, 2021
e44fdee
Python - Add bindings to Tokenizer.from_pretrained
n1t0 Aug 19, 2021
6f9e867
Better export for FromPretrainedParameters
n1t0 Aug 19, 2021
528c9a5
Node - Add bindings to Tokenizer.from_pretrained
n1t0 Aug 19, 2021
a4d0f3d
Update docs for from_pretrained
n1t0 Aug 19, 2021
ad7090a
Improve READMEs for from_pretrained
n1t0 Aug 19, 2021
35c96e5
Add tests for from_pretrained
n1t0 Aug 24, 2021
c65b72d
Rust - Prepare for release 0.11.0 (#789)
n1t0 Aug 31, 2021
e68aecc
Python - Update Cargo.lock
n1t0 Sep 2, 2021
23cf8c6
Bump tar from 4.4.17 to 4.4.19 in /bindings/node (#792)
dependabot[bot] Sep 2, 2021
b8b584d
Python - Pretty json saving defaults to true (#793)
n1t0 Sep 2, 2021
884bfb7
Prepare node release (#794)
n1t0 Sep 2, 2021
36204c8
Exclude node 15.x for windows
n1t0 Sep 2, 2021
fd316bd
Update esaxx-rs to 0.1.7 to fix building on windows
n1t0 Sep 2, 2021
2143a24
resolve merge conflicts
tscholak Sep 3, 2021
bd75835
bump
tscholak Sep 3, 2021
e061fe1
try to bump cayon-cond
tscholak Sep 3, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
901 changes: 886 additions & 15 deletions Cargo.lock

Large diffs are not rendered by default.

21 changes: 21 additions & 0 deletions bindings/node/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,24 @@
# [0.8.0](https://github.com/huggingface/tokenizers/compare/node-v0.7.0...node-v0.8.0) (2021-09-02)

### BREACKING CHANGES
- Many improvements on the Trainer ([#519](https://github.com/huggingface/tokenizers/pull/519)).
The files must now be provided first when calling `tokenizer.train(files, trainer)`.

### Features
- Adding the `TemplateProcessing`
- Add `WordLevel` and `Unigram` models ([#490](https://github.com/huggingface/tokenizers/pull/490))
- Add `nmtNormalizer` and `precompiledNormalizer` normalizers ([#490](https://github.com/huggingface/tokenizers/pull/490))
- Add `templateProcessing` post-processor ([#490](https://github.com/huggingface/tokenizers/pull/490))
- Add `digitsPreTokenizer` pre-tokenizer ([#490](https://github.com/huggingface/tokenizers/pull/490))
- Add support for mapping to sequences ([#506](https://github.com/huggingface/tokenizers/pull/506))
- Add `splitPreTokenizer` pre-tokenizer ([#542](https://github.com/huggingface/tokenizers/pull/542))
- Add `behavior` option to the `punctuationPreTokenizer` ([#657](https://github.com/huggingface/tokenizers/pull/657))
- Add the ability to load tokenizers from the Hugging Face Hub using `fromPretrained` ([#780](https://github.com/huggingface/tokenizers/pull/780))

### Fixes
- Fix a bug where long tokenizer.json files would be incorrectly deserialized ([#459](https://github.com/huggingface/tokenizers/pull/459))
- Fix RobertaProcessing deserialization in PostProcessorWrapper ([#464](https://github.com/huggingface/tokenizers/pull/464))

# [0.7.0](https://github.com/huggingface/tokenizers/compare/node-v0.6.2...node-v0.7.0) (2020-07-01)

### BREAKING CHANGES
Expand Down
2 changes: 1 addition & 1 deletion bindings/node/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ $(DATA_DIR)/small.txt : $(DATA_DIR)/big.txt

$(DATA_DIR)/roberta.json :
$(dir_guard)
wget https://storage.googleapis.com/tokenizers/roberta.json -O $@
wget https://huggingface.co/roberta-large/raw/main/tokenizer.json -O $@

$(DATA_DIR)/tokenizer-wiki.json :
$(dir_guard)
Expand Down
13 changes: 13 additions & 0 deletions bindings/node/lib/bindings/decoders.d.ts
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,16 @@ export function metaspaceDecoder(replacement?: string, addPrefixSpace?: boolean)
* This suffix will be replaced by whitespaces during the decoding
*/
export function bpeDecoder(suffix?: string): Decoder;

/**
* Instantiate a new CTC Decoder
* @param [pad_token='pad'] The pad token used by CTC to delimit a new token.
* @param [word_delimiter_token='|'] The word delimiter token. It will be replaced by a space
* @param [cleanup=true] Whether to cleanup some tokenization artifacts.
* Mainly spaces before punctuation, and some abbreviated english forms.
*/
export function ctcDecoder(
pad_token?: string,
word_delimiter_token?: string,
cleanup?: boolean
): Decoder;
1 change: 1 addition & 0 deletions bindings/node/lib/bindings/decoders.js
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,5 @@ module.exports = {
wordPieceDecoder: native.decoders_WordPiece,
metaspaceDecoder: native.decoders_Metaspace,
bpeDecoder: native.decoders_BPEDecoder,
ctcDecoder: native.decoders_CTC,
};
13 changes: 12 additions & 1 deletion bindings/node/lib/bindings/decoders.test.ts
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
import { bpeDecoder, metaspaceDecoder, wordPieceDecoder } from "./decoders";
import { bpeDecoder, ctcDecoder, metaspaceDecoder, wordPieceDecoder } from "./decoders";

describe("wordPieceDecoder", () => {
it("accepts `undefined` as first parameter", () => {
Expand Down Expand Up @@ -31,3 +31,14 @@ describe("bpeDecoder", () => {
expect(bpeDecoder(undefined)).toBeDefined();
});
});

describe("ctcDecoder", () => {
it("accepts `undefined` as parameter", () => {
expect(ctcDecoder(undefined)).toBeDefined();
});
it("encodes correctly", () => {
expect(
ctcDecoder().decode(["<pad>", "h", "h", "e", "e", "l", "l", "<pad>", "l", "l", "o"])
).toEqual("hello");
});
});
10 changes: 7 additions & 3 deletions bindings/node/lib/bindings/pre-tokenizers.d.ts
Original file line number Diff line number Diff line change
Expand Up @@ -90,10 +90,14 @@ export function charDelimiterSplitPreTokenizer(delimiter: string): PreTokenizer;

/**
* Returns a new Punctuation PreTokenizer.
* This pre-tokenizer splits tokens on punctuation.
* Each occurrence of a punctuation character will be treated separately.
* This pre-tokenizer splits tokens on punctuation according to the provided behavior.
* Each occurrence of a punctuation character is treated separately.
*
* @param [behavior="isolated"] The behavior to use when splitting.
* Choices: "removed", "isolated", "mergedWithPrevious", "mergedWithNext",
* "contiguous"
*/
export function punctuationPreTokenizer(): PreTokenizer;
export function punctuationPreTokenizer(behavior?: string): PreTokenizer;

/**
* Returns a new Sequence PreTokenizer.
Expand Down
5 changes: 5 additions & 0 deletions bindings/node/lib/bindings/pre-tokenizers.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,11 @@ describe("punctuationPreTokenizer", () => {
const processor = punctuationPreTokenizer();
expect(processor.constructor.name).toEqual("PreTokenizer");
});

it("instantiates correctly with non-default split delimeter", () => {
const processor = punctuationPreTokenizer("removed");
expect(processor.constructor.name).toEqual("PreTokenizer");
});
});

describe("splitPreTokenizer", () => {
Expand Down
22 changes: 22 additions & 0 deletions bindings/node/lib/bindings/tokenizer.d.ts
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,19 @@ import { PreTokenizer } from "./pre-tokenizers";
import { RawEncoding } from "./raw-encoding";
import { Trainer } from "./trainers";

export interface FromPretrainedOptions {
/**
* The revision to download
* @default "main"
*/
revision?: string;
/**
* The auth token to use to access private repositories on the Hugging Face Hub
* @default undefined
*/
authToken?: string;
}

export interface TruncationOptions {
/**
* The length of the previous sequence to be included in the overflowing sequence
Expand Down Expand Up @@ -123,6 +136,15 @@ export class Tokenizer {
*/
static fromString(s: string): Tokenizer;

/**
* Instantiate a new Tokenizer from an existing file on the
* Hugging Face Hub. Any model repo containing a `tokenizer.json`
* can be used here.
* @param identifier A model identifier on the Hub
* @param options Additional options
*/
static fromPretrained(s: string, options?: FromPretrainedOptions): Tokenizer;

/**
* Add the given tokens to the vocabulary
*
Expand Down
1 change: 1 addition & 0 deletions bindings/node/lib/bindings/tokenizer.js
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ const native = require("./native");
class Tokenizer extends native.tokenizer_Tokenizer {
static fromString = native.tokenizer_Tokenizer_from_string;
static fromFile = native.tokenizer_Tokenizer_from_file;
static fromPretrained = native.tokenizer_Tokenizer_from_pretrained;
}

module.exports = {
Expand Down
28 changes: 28 additions & 0 deletions bindings/node/lib/bindings/tokenizer.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ describe("Tokenizer", () => {

expect(typeof Tokenizer.fromFile).toBe("function");
expect(typeof Tokenizer.fromString).toBe("function");
expect(typeof Tokenizer.fromPretrained).toBe("function");

expect(typeof tokenizer.addSpecialTokens).toBe("function");
expect(typeof tokenizer.addTokens).toBe("function");
Expand Down Expand Up @@ -94,6 +95,33 @@ describe("Tokenizer", () => {
expect(typeof tokenizer.train).toBe("function");
});

it("can be instantiated from the hub", async () => {
let tokenizer: Tokenizer;
let encode: (
sequence: InputSequence,
pair?: InputSequence | null,
options?: EncodeOptions | null
) => Promise<RawEncoding>;
let output: RawEncoding;

tokenizer = Tokenizer.fromPretrained("bert-base-cased");
encode = promisify(tokenizer.encode.bind(tokenizer));
output = await encode("Hey there dear friend!", null, { addSpecialTokens: false });
expect(output.getTokens()).toEqual(["Hey", "there", "dear", "friend", "!"]);

tokenizer = Tokenizer.fromPretrained("anthony/tokenizers-test");
encode = promisify(tokenizer.encode.bind(tokenizer));
output = await encode("Hey there dear friend!", null, { addSpecialTokens: false });
expect(output.getTokens()).toEqual(["hey", "there", "dear", "friend", "!"]);

tokenizer = Tokenizer.fromPretrained("anthony/tokenizers-test", {
revision: "gpt-2",
});
encode = promisify(tokenizer.encode.bind(tokenizer));
output = await encode("Hey there dear friend!", null, { addSpecialTokens: false });
expect(output.getTokens()).toEqual(["Hey", "Ġthere", "Ġdear", "Ġfriend", "!"]);
});

describe("addTokens", () => {
it("accepts a list of string as new tokens when initial model is empty", () => {
const model = BPE.empty();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -62,9 +62,7 @@ type SentencePieceBPETokenizerConfig = SentencePieceBPETokenizerOptions &
/**
* Represents the BPE algorithm, with the pretokenization used by SentencePiece
*/
export class SentencePieceBPETokenizer extends BaseTokenizer<
SentencePieceBPETokenizerConfig
> {
export class SentencePieceBPETokenizer extends BaseTokenizer<SentencePieceBPETokenizerConfig> {
private static readonly defaultOptions: SentencePieceBPETokenizerConfig = {
addPrefixSpace: true,
replacement: "▁",
Expand Down
Loading