Skip to content

Commit

Permalink
Fix hang on invalid UTF-8 data in string_view iterator (#18039)
Browse files Browse the repository at this point in the history
The `cudf::string_view::const_iterator` provides functions that navigate through UTF-8 variable-length characters appropriately. 
This PR fixes the iterator increment logic to prevent a possible infinite loop when the iterator wraps invalid UTF-8 encoded memory,

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Shruti Shivakumar (https://github.com/shrshi)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #18039
  • Loading branch information
davidwendt authored Feb 20, 2025
1 parent cc5626b commit 8bef542
Showing 1 changed file with 6 additions and 3 deletions.
9 changes: 6 additions & 3 deletions cpp/include/cudf/strings/string_view.cuh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2019-2024, NVIDIA CORPORATION.
* Copyright (c) 2019-2025, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -159,8 +159,11 @@ __device__ inline string_view::const_iterator::const_iterator(string_view const&

__device__ inline string_view::const_iterator& string_view::const_iterator::operator++()
{
if (byte_pos < bytes)
byte_pos += strings::detail::bytes_in_utf8_byte(static_cast<uint8_t>(p[byte_pos]));
if (byte_pos < bytes) {
// max is used to prevent an infinite loop on invalid UTF-8 data
byte_pos +=
cuda::std::max(1, strings::detail::bytes_in_utf8_byte(static_cast<uint8_t>(p[byte_pos])));
}
++char_pos;
return *this;
}
Expand Down

0 comments on commit 8bef542

Please sign in to comment.