Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-source reading to JSON reader benchmarks #17688

Merged

Conversation

shrshi
Copy link
Contributor

@shrshi shrshi commented Jan 7, 2025

Description

Depends on #17708
Enables benchmarking of multi-source multi-batch JSON reader by adding another axis for number of input sources in the benchmark.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Jan 7, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jan 7, 2025
@shrshi shrshi changed the title Add multi-source reading to JSON reader benchmarks [DNR] Add multi-source reading to JSON reader benchmarks Jan 7, 2025
@shrshi
Copy link
Contributor Author

shrshi commented Jan 7, 2025

Null probability has been set to zero in the random table generator so that the JSON reader benchmark does not fail for multi-source multi-batch reading. Once issue #17689 is resolved, the change to the table generator will be reverted.

@shrshi shrshi added cuIO cuIO issue Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 28, 2025
@shrshi shrshi marked this pull request as ready for review January 28, 2025 23:15
@shrshi shrshi requested a review from a team as a code owner January 28, 2025 23:15
@shrshi shrshi changed the title [DNR] Add multi-source reading to JSON reader benchmarks Add multi-source reading to JSON reader benchmarks Jan 28, 2025
Copy link
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small question otherwise looks good


json_read_common(source_sink, num_rows, state, comptype, data_size);
std::vector<char> out_buffer;
auto sink = cudf::io::sink_info(&out_buffer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing a pointer to a vector in modern C++ is generally odd. Are you referring to out_buffer.data() or just out_buffer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using the data sink constructor that accepts a pointer to a output host vector -

explicit sink_info(std::vector<char>* buffer) : _type(io_type::HOST_BUFFER), _buffers({buffer}) {}

It is indeed a little awkward to pass a pointer to a vector, but I'm not sure if there's a cleaner way to construct a sink_info object.
@vuule do you have any suggestions on a better approach?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the _buffer data member of sink_info should be a vector of char spans, but addressing this is beyond the scope of the current PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did I miss these comments?
The writer resizes the vector, as the output size is not known in advance. Passing a span unfortunately does not work here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK, then the buffer owner could use a shared pointer to avoid passing raw pointers around.

@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Jan 29, 2025
@shrshi
Copy link
Contributor Author

shrshi commented Jan 29, 2025

/merge

@rapids-bot rapids-bot bot merged commit 33a6a09 into rapidsai:branch-25.02 Jan 29, 2025
116 checks passed
@shrshi shrshi deleted the json-multithreaded-compio-perf branch January 29, 2025 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants