Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove static column vectors from window function tests. #18099

Merged

Conversation

mythrocks
Copy link
Contributor

Fixes #18079.

This commit fixes the failures reported in #18079, where the use of static column vector objects in the tests causes the use of a CUDA runtime context before it's been initialized, causing the tests to fail with:

parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle

The solution is to switch the static column vectors to runtime, as a member of the test utility class rolling_runner.

@mythrocks mythrocks requested a review from a team as a code owner February 26, 2025 00:10
Copy link

copy-pr-bot bot commented Feb 26, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Feb 26, 2025
@mythrocks mythrocks added bug Something isn't working tests Unit testing for project non-breaking Non-breaking change labels Feb 26, 2025
Fixes rapidsai#18079.

This commit fixes the failures reported in rapidsai#18079, where the use of static
column vector objects in the tests causes the use of a CUDA runtime context
before it's been initialized, causing the tests to fail with:
```
parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle
```

Signed-off-by: MithunR <[email protected]>
@mythrocks mythrocks force-pushed the remove-static-vectors-window-tests branch from e9182bc to 0a44371 Compare February 26, 2025 00:14
@bdice
Copy link
Contributor

bdice commented Feb 26, 2025

We've seen static initialization issues like this before, but it's been a while. It's too bad this has bitten us again.

@pxLi
Copy link
Member

pxLi commented Feb 26, 2025

Can we target this branch at 25.02? We will need this for the 25.02.01 release thanks

@mythrocks
Copy link
Contributor Author

mythrocks commented Feb 26, 2025

Sorry, chaps. It looks like there's another failure unearthed by this fix. Something jitify related:

/home/mithunr/workspace/dev/spark-rapids-jni/2/thirdparty/cudf/cpp/tests/rolling/grouped_rolling_test.cpp:131: Failure
Expected: output = cudf::grouped_rolling_window( keys, input, preceding_window, following_window, min_periods, op) doesn't throw an exception.
  Actual: it throws std::runtime_error with description "Jitify fatal error: Deserialization failed".

I'm checking if this turns up only with -DBUILD_SHARED_LIBS=OFF as we have with spark-rapids-jni.

This is a red herring. Please ignore.

@davidwendt
Copy link
Contributor

davidwendt commented Feb 26, 2025

So this pattern looks suspect

  CUDF_TEST_EXPECT_COLUMNS_EQUAL(*run_rolling(*AGG_COUNT_NON_NULL),
                                 ints_column{{0, 1, 2, 2, 2, 2, 0, 1, 2, 2}, nulls_at({0, 6})});

The run_rolling operator returns std::unique_ptr<cudf::column> and then the * returns a column_view and once the column_view is created the cudf::column is destroyed and so the column_view points to garbage.
Also the same for the ints_column which is destroyed after being converted to a column_view.

This pattern needs to be changed to hold onto the data for the life of the call and check

  auto result = run_rolling(*AGG_COUNT_NON_NULL);
  auto ints = ints_column{{0, 1, 2, 2, 2, 2, 0, 1, 2, 2}, nulls_at({0, 6})};
  CUDF_TEST_EXPECT_COLUMNS_EQUAL(*result, ints);

The previous static implementation may have been saving us here a bit.
Now that the static has been removed these are now failing.

@mythrocks
Copy link
Contributor Author

mythrocks commented Feb 26, 2025

Argh, you're right, @davidwendt. That needs fixing. I'll take another crack at it.

Actually, no, I think I misunderstood what you were pointing out.

the * returns a column_view and once the column_view is created the cudf::column is destroyed and so the column_view points to garbage.

I remember this now. This should be alright. The temporary object (unique_ptr<column>) should remain alive until the encompassing full expression has been evaluated:
https://en.cppreference.com/w/cpp/language/lifetime

In this case, the destructor wouldn't be called until CUDF_TEST_EXPECT_COLUMNS_EQUAL has returned. (Or, cudf::test::detail::expect_columns_equal has returned, per the macro definition.)

I don't think the jitify failure is because of this.

@mythrocks
Copy link
Contributor Author

/ok to test

@mythrocks
Copy link
Contributor Author

The jitify error I'm seeing is a red herring. It's happening on my box with a fresh checkout and build. This must be an artifact on my workstation. It's irrelevant to this PR.

I've kicked off the build, to see how it fares. If this change looks good to the reviewers, and CI passes, we should be alright.

@mythrocks
Copy link
Contributor Author

I could use advice to address @pxLi's request: Does it make sense to retarget this to branch-25.02 and then merge forward to branch-25.04, or the other way around?

@davidwendt
Copy link
Contributor

I could use advice to address @pxLi's request: Does it make sense to retarget this to branch-25.02 and then merge forward to branch-25.04, or the other way around?

Since this is a fix to the test only, I don't see a reason to backport this to 25.02.
There is no change to libcudf and I believe we do not package/publish tests anywhere.

@davidwendt
Copy link
Contributor

/ok to test

@sameerz
Copy link
Contributor

sameerz commented Feb 26, 2025

I could use advice to address @pxLi's request: Does it make sense to retarget this to branch-25.02 and then merge forward to branch-25.04, or the other way around?

Since this is a fix to the test only, I don't see a reason to backport this to 25.02. There is no change to libcudf and I believe we do not package/publish tests anywhere.

This test is failing in our gb100 tests. We are targeting gb100 for 25.02.1. Hence the request to backport to 25.02.

@mythrocks
Copy link
Contributor Author

It doesn't look like the narwhals test failure is related, AFAICT:

    def test_select_duplicates(constructor: Constructor) -> None:
        df = nw.from_native(constructor({"a": [1, 2]})).lazy()
        with pytest.raises(ValueError, match="Expected unique|duplicate|more than one"):
>           df.select("a", nw.col("a") + 1).collect()

The other checks seem to have gone through.

@davidwendt, thank you for re-kicking the tests. I was wondering if you might have another look at this fix.

Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@davidwendt
Copy link
Contributor

The narwhals failure should not prevent this from merging.
image

@mythrocks
Copy link
Contributor Author

mythrocks commented Feb 26, 2025

Any objection to our merging this to 25.04?

I'll see about getting this on to 25.02.1.
We need not rock the boat on 25.02 on this account. This is only a test fix.

@mythrocks
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 54e740a into rapidsai:branch-25.04 Feb 26, 2025
104 of 105 checks passed
@mythrocks mythrocks deleted the remove-static-vectors-window-tests branch February 26, 2025 19:12
@mythrocks
Copy link
Contributor Author

Thank you for the reviews, all. I've merged this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change tests Unit testing for project
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] ROLLING_TEST branch 25.02 fails when built with static cuda runtime
5 participants