Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New nvtext::wordpiece_tokenizer APIs #17600

Open
wants to merge 95 commits into
base: branch-25.04
Choose a base branch
from

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Dec 16, 2024

Description

Creates a new word-piece-tokenizer which replaces the existing subword-tokenizer in nvtext.
The subword-tokenizer logic is to split out and specialized to perform basic tokenizing with the word-piece logic only.
The normalizing part is already a separate API. The output will be a lists column of tokens only.

The first change is that the new API uses wordpiece instead of subword. Here are the 2 C++ API declarations:

std::unique_ptr<wordpiece_vocabulary> load_wordpiece_vocabulary(
  cudf::strings_column_view const& input,
  rmm::cuda_stream_view stream,
  rmm::device_async_resource_ref mr);

The vocabulary is loaded as a strings column and the returned object can be used on multiple calls to the next API:

std::unique_ptr<cudf::column> wordpiece_tokenize(
  cudf::strings_column_view const& input,
  wordpiece_vocabulary const& vocabulary,
  cudf::size_type max_words_per_row,
  rmm::cuda_stream_view stream,
  rmm::device_async_resource_ref mr);

This will return a lists column of integers which represent the tokens for each row. The max_words_per_row will stop the tokenizing process for each row once the number of input words (characters delimited by space) has been reached. This means you may get more tokens than max_words_per_row for a row if a single word produces multiple tokens.

Note, that this API expects the input string to already be normalized -- processed by the nvtext::normalize_characters API which is also being reworked in #17818

The Python interface has the following pattern:

from cudf.core.wordpiece_tokenize import WordPieceVocabulary

input_string = .... # output of the normalizer
vocab_file = os.path.join(datadir, "bert_base_cased_sampled/vocab.txt")
vc = cudf.read_text(vocab_file, delimiter="\n", strip_delimiters=True)
wpt = WordPieceVocabulary(vc)
wpr = wpt.tokenize(input_string)

The output is a lists column of the tokens and no longer the tensor-data and metadata format.
If this format is needed, then we can consider a 3rd API that converts the output here to that format.

Closes #17507

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels Dec 16, 2024
@davidwendt davidwendt self-assigned this Dec 16, 2024
Copy link

copy-pr-bot bot commented Dec 16, 2024

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added the CMake CMake build issue label Dec 16, 2024
@github-actions github-actions bot added Python Affects Python cuDF API. pylibcudf Issues specific to the pylibcudf package labels Jan 3, 2025
@davidwendt
Copy link
Contributor Author

/ok to test

@davidwendt
Copy link
Contributor Author

/ok to test

rapids-bot bot pushed a commit that referenced this pull request Feb 12, 2025
Recent build failure in #17600 indicated undefined `std::iota`. This PR adds the appropriate `#include <numeric>` in the source files where this is called.

Authors:
  - David Wendt (https://github.com/davidwendt)
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)
  - Muhammad Haseeb (https://github.com/mhaseeb123)

URL: #17983
@davidwendt
Copy link
Contributor Author

/ok to test

@davidwendt davidwendt marked this pull request as ready for review February 12, 2025 16:46
@davidwendt davidwendt requested review from a team as code owners February 12, 2025 16:46
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Feb 12, 2025
@davidwendt
Copy link
Contributor Author

Adding reference here to issue #12403 as well.
This may be closed in a future PR when the current subword-tokenizer API is deprecated.

@shrshi shrshi self-requested a review February 13, 2025 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API. strings strings issues (C++ and Python)
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

[FEA] Switch Subword Tokenizer to use text file instead of hash file
1 participant