You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The API surface is considered stable now (except the parts explicitly marked as experimental).
Support for netstandard2.0 was added.
Native AOT is fully supported.
FastBertTokenizer is now almost allocation free. That makes single-threaded encoding a bit faster and leads to larger improvements when encoding multi-threaded.
Breaking Changes
Method signature changed: The Encode overload that returned ReadOnlyMemory returns Memory now instead. The old design made sense as the Memory points to a buffer internal to FastBertTokenizer. onnxruntime requires Memory instead of ReadOnlyMemory though. Writing to the buffer from outside doesn't break FastBertTokenizer, so it's okay to expose the buffer as Memory instead of ReadOnlyMemory to simplify usage with onnxruntime.
- public (ReadOnlyMemory<long> InputIds, ReadOnlyMemory<long> AttentionMask, ReadOnlyMemory<long> TokenTypeIds) Encode(string input, int maximumTokens = 512, int? padTo = null)+ public (Memory<long> InputIds, Memory<long> AttentionMask, Memory<long> TokenTypeIds) Encode(string input, int maximumTokens = 512, int? padTo = null)
Some APIs are marked as experimental now. None were before, so it might be required to add <NoWarn>FBERTTOK001</NoWarn> <!-- Experimental FastBertTokenizer features --> to your csproj if you use them.
The methods with the name Tokenize that were marked as obsolete before and were just renamed to Encode are removed.
Other
Fixed #39 Add Decode support for input_id sequences that don't start at a word prefix