New nvtext::wordpiece_tokenizer APIs #17600

davidwendt · 2024-12-16T20:15:41Z

Description

Creates a new word-piece-tokenizer which replaces the existing subword-tokenizer in nvtext.
The subword-tokenizer logic is to split out and specialized to perform basic tokenizing with the word-piece logic only.
The normalizing part is already a separate API. The output will be a lists column of tokens only.

The first change is that the new API uses wordpiece instead of subword. Here are the 2 C++ API declarations:

std::unique_ptr<wordpiece_vocabulary> load_wordpiece_vocabulary(
  cudf::strings_column_view const& input,
  rmm::cuda_stream_view stream,
  rmm::device_async_resource_ref mr);

The vocabulary is loaded as a strings column and the returned object can be used on multiple calls to the next API:

std::unique_ptr<cudf::column> wordpiece_tokenize(
  cudf::strings_column_view const& input,
  wordpiece_vocabulary const& vocabulary,
  cudf::size_type max_words_per_row,
  rmm::cuda_stream_view stream,
  rmm::device_async_resource_ref mr);

This will return a lists column of integers which represent the tokens for each row. The max_words_per_row will stop the tokenizing process for each row once the number of input words (characters delimited by space) has been reached. This means you may get more tokens than max_words_per_row for a row if a single word produces multiple tokens.

Note, that this API expects the input string to already be normalized -- processed by the nvtext::normalize_characters API which is also being reworked in #17818

The Python interface has the following pattern:

from cudf.core.wordpiece_tokenize import WordPieceVocabulary

input_string = .... # output of the normalizer
vocab_file = os.path.join(datadir, "bert_base_cased_sampled/vocab.txt")
vc = cudf.read_text(vocab_file, delimiter="\n", strip_delimiters=True)
wpt = WordPieceVocabulary(vc)
wpr = wpt.tokenize(input_string)

The output is a lists column of the tokens and no longer the tensor-data and metadata format.
If this format is needed, then we can consider a 3rd API that converts the output here to that format.

Closes #17507

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2024-12-16T20:15:45Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

…wordpiece-tokenize

davidwendt · 2025-02-10T20:32:18Z

/ok to test

davidwendt · 2025-02-11T15:37:05Z

/ok to test

…wordpiece-tokenize

Recent build failure in #17600 indicated undefined `std::iota`. This PR adds the appropriate `#include <numeric>` in the source files where this is called. Authors: - David Wendt (https://github.com/davidwendt) - Nghia Truong (https://github.com/ttnghia) Approvers: - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) - Muhammad Haseeb (https://github.com/mhaseeb123) URL: #17983

davidwendt · 2025-02-12T14:42:54Z

/ok to test

davidwendt · 2025-02-12T19:41:29Z

Adding reference here to issue #12403 as well.
This may be closed in a future PR when the current subword-tokenizer API is deprecated.

…wordpiece-tokenize

New nvtext::wordpiece_tokenizer APIs

67a0785

davidwendt added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels Dec 16, 2024

davidwendt self-assigned this Dec 16, 2024

github-actions bot added the CMake CMake build issue label Dec 16, 2024

davidwendt added 20 commits December 16, 2024 21:20

Merge branch 'branch-25.02' into wordpiece-tokenize

0f769da

Merge branch 'branch-25.02' into wordpiece-tokenize

2e00e4f

Merge branch 'branch-25.02' into wordpiece-tokenize

dd54427

basic kernel

96046f4

Merge branch 'wordpiece-tokenize' of github.com:davidwendt/cudf into …

467d39b

…wordpiece-tokenize

Merge branch 'branch-25.02' into wordpiece-tokenize

526954c

add block/string logic

4b47be9

Merge branch 'branch-25.02' into wordpiece-tokenize

c62ee59

Merge branch 'branch-25.02' into wordpiece-tokenize

4f7e5a3

Merge branch 'branch-25.02' into wordpiece-tokenize

9de8a16

Merge branch 'branch-25.02' into wordpiece-tokenize

0775463

Merge branch 'branch-25.02' into wordpiece-tokenize

c07be6e

fix output building

923348c

Merge branch 'wordpiece-tokenize' of github.com:davidwendt/cudf into …

a69c914

…wordpiece-tokenize

Merge branch 'branch-25.02' into wordpiece-tokenize

e6bf74e

add more tests

2297d55

Merge branch 'branch-25.02' into wordpiece-tokenize

c6a21a2

Merge branch 'branch-25.02' into wordpiece-tokenize

3fb2a5a

add python interfaces

5ee0410

Merge branch 'branch-25.02' into wordpiece-tokenize

d3b0dd2

github-actions bot added Python Affects Python cuDF API. pylibcudf Issues specific to the pylibcudf package labels Jan 3, 2025

davidwendt added 2 commits February 10, 2025 15:30

refactor code and add more comments

dc1c47b

Merge branch 'branch-25.04' into wordpiece-tokenize

f299040

davidwendt added 3 commits February 11, 2025 06:30

Merge branch 'branch-25.04' into wordpiece-tokenize

3b59286

Merge branch 'branch-25.04' into wordpiece-tokenize

c3d681f

fix some var decls

47c1a0d

davidwendt mentioned this pull request Feb 11, 2025

Add missing include for calling std::iota() #17983

Merged

3 tasks

davidwendt added 3 commits February 11, 2025 17:10

Merge branch 'branch-25.04' into wordpiece-tokenize

fdeea5d

Merge branch 'wordpiece-tokenize' of github.com:davidwendt/cudf into …

0963311

…wordpiece-tokenize

Merge branch 'branch-25.04' into wordpiece-tokenize

fe4c9bb

Merge branch 'branch-25.04' into wordpiece-tokenize

817b2b7

davidwendt marked this pull request as ready for review February 12, 2025 16:46

davidwendt requested review from a team as code owners February 12, 2025 16:46

davidwendt requested review from wence-, brandon-b-miller, bdice and mythrocks February 12, 2025 16:46

davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Feb 12, 2025

Merge branch 'branch-25.04' into wordpiece-tokenize

c7f6b45

davidwendt added 3 commits February 12, 2025 16:04

Merge branch 'wordpiece-tokenize' of github.com:davidwendt/cudf into …

1a8f973

…wordpiece-tokenize

Merge branch 'branch-25.04' into wordpiece-tokenize

c170a2a

Merge branch 'branch-25.04' into wordpiece-tokenize

2d5cfae

shrshi self-requested a review February 13, 2025 22:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New nvtext::wordpiece_tokenizer APIs #17600

New nvtext::wordpiece_tokenizer APIs #17600

davidwendt commented Dec 16, 2024 •

edited

Loading

copy-pr-bot bot commented Dec 16, 2024

davidwendt commented Feb 10, 2025

davidwendt commented Feb 11, 2025

davidwendt commented Feb 12, 2025

davidwendt commented Feb 12, 2025

New nvtext::wordpiece_tokenizer APIs #17600

Are you sure you want to change the base?

New nvtext::wordpiece_tokenizer APIs #17600

Conversation

davidwendt commented Dec 16, 2024 • edited Loading

Description

Checklist

copy-pr-bot bot commented Dec 16, 2024

davidwendt commented Feb 10, 2025

davidwendt commented Feb 11, 2025

davidwendt commented Feb 12, 2025

davidwendt commented Feb 12, 2025

davidwendt commented Dec 16, 2024 •

edited

Loading