Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bazel: upgrade to clang-19 and fix sanitizers #25120

Open
wants to merge 11 commits into
base: dev
Choose a base branch
from

Conversation

rockwotj
Copy link
Contributor

  • bazel: only enable batch cache test if using seastar allocator
  • rpc: don't move service::_methods
  • datalake/translation: include what you use
  • iobuf/fuzz: include what you use
  • ct: make pipeline stage id move assignable
  • bazel: suppress leaks in openssl-fips
  • bazel: upgrade to clang-19
  • datalake: make benchmark test fast

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

  • none

Comment on lines -243 to +244
"@seastar//:use_system_allocator": [],
"//conditions:default": ["@platforms//:incompatible"],
"@seastar//:use_system_allocator": ["@platforms//:incompatible"],
"//conditions:default": [],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, i wonder why it was working before... that's concerning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default build uses the seastar allocator still, so it was just always disabled

Comment on lines 72 to 73
{{service_name}}_service_base({{service_name}}_service_base&& o) noexcept
: _sc(std::move(o._sc)), _ssg(std::move(o._ssg)), _methods(std::move(o._methods)) {}
: _sc(std::move(o._sc)), _ssg(std::move(o._ssg)) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must never move these RPC services because this would be very
buggy otherwise.

was the intention to instead delete the move consturctor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm game for deleting both the move constructor and move assignment operator - I don't think either are sound for _methods. I do think we don't need to move the methods and it's fine with this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah what I was trying to say was, today in the code base there is no way we in practice move any services today, because there would be mass bugs with rpcs invoking the wrong object. Maybe we should formalize this and delete the move ctor and assignment operator

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should formalize this and delete the move ctor and assignment operator

Yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a commit to remove them.

@rockwotj rockwotj marked this pull request as ready for review February 20, 2025 14:56
@rockwotj rockwotj enabled auto-merge February 20, 2025 15:03
"linux-x86_64": ["http://redpanda-core-toolchain.s3-website-us-east-1.amazonaws.com/llvm-19.1.1-ubuntu-22.04-x86_64.tar.gz"],
},
llvm_version = "19.1.7",
# Currently our custom build of clang causes false positives with sanitizers, so we use the default clang distribution.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, bummer. I guess something about the way we build it causes this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah exactly, I'm not sure what. We may have to go back to custom builds and figure out what flag is calling this, because in CI we need older libraries, but locally I need newer ones

@@ -1 +1,2 @@
leak:seastar::net::conntrack::handle::~handle
leak:fips/self_test.c
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: self tests are run once per shard and, unfortunately, are within the validated section of the OpenSSL FIPS module which means we cannot touch it and maintain validation.

@michael-redpanda
Copy link
Contributor

It looks like CI cpp lint (which I assume uses clang-format-18) passes against the changes in 35318ed. Do we need to ensure that core knows to use clang-format in bazel rather than in vtools? Should we remove task lint:cpp in vtools to ensure it isn't accidentally used and change CI to use bazel?

@rockwotj
Copy link
Contributor Author

Should we remove task lint:cpp in vtools to ensure it isn't accidentally used and change CI to use bazel?

Yeah people should use bazel run //tools:clang_format now

@@ -240,8 +240,8 @@ redpanda_cc_btest(
"batch_cache_reclaim_test.cc",
],
target_compatible_with = select({
"@seastar//:use_system_allocator": [],
"//conditions:default": ["@platforms//:incompatible"],
"@seastar//:use_system_allocator": ["@platforms//:incompatible"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a WHY to the commit messages 🙇‍♂️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@michael-redpanda
Copy link
Contributor

Should we remove task lint:cpp in vtools to ensure it isn't accidentally used and change CI to use bazel?

Yeah people should use bazel run //tools:clang_format now

Should apply as well to buildkite?

@dotnwat
Copy link
Member

dotnwat commented Feb 20, 2025

Should apply as well to buildkite?

yeh i think it runs as a github action?

@rockwotj
Copy link
Contributor Author

Should apply as well to buildkite?

yeh i think it runs as a github action?

bazel run --lockfile_mode=off //tools:clang_format

This was incorrectly configured, the test relies on the seastar
allocator's reclaim ability, which is required to pass. As is this
always is disabled on CI but we only want to disable the test if the
system allocator is being used because the reclaim ability is not
available.
_methods is an array of non_copyable functions that have `[this]`
pointers captured, we cannot move them into a new object because
the `this` pointer become invalidated and pointing to the old object.
We must never move these RPC services because this would be very
buggy otherwise.
clang 19 is more strict on this
clang 19 is more strict here
libc++ 19 is more strict, but it requires types are copy assignable, and
according to the standard copy assignable types must be move asssignable
as well.

https://en.cppreference.com/w/cpp/named_req/CopyAssignable
In getting the sanitizers work in openssl tests with fips enabled we
need to suppress leaks there. The benchmarks are full of them, and I
assume these are not caught in CMake because benchmarks don't run with
sanitizers.
Upgrade to the latest clang. The real motivation here is that we want
the sanitizers to work, however for some reason the custom built clang
we use has false positives with coroutines and the sanitizers.

However clang-18 is compiled against a super old version of ubuntu and
missing libraries are missing (namely `libtinfo.5.so`) and not available
on recent ubuntu and fedora distributions.

Since we're upgrading the clang-19 we need to address a few deprecations
in libc++, so we upgrade seastar for this.
In bazel we test benchmarks by running them once, well 1000 batches is
really slow in debug mode, so running once doesn't do anything. Instead
change the number of batches in debug mode to make the benchmark fast.
This isn't really safe because pointers are taken to `this`, we don't
use these move operators anyways and they are a big footgun, so just
remove them.
@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#62070
test_id test_kind job_url test_status passed
rptest.tests.partition_movement_test.SIPartitionMovementTest.test_shadow_indexing.num_to_upgrade=0.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/62070#0195252c-680c-4d5e-a96c-890949790246 FLAKY 1/2
rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=True.mixed_versions=False.with_tiered_storage=False.with_iceberg=True.with_chunked_compaction=True.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/62070#0195252c-680b-45fd-98e5-6ef8af855866 FLAKY 1/2
rptest.tests.scaling_up_test.ScalingUpTest.test_scaling_up_with_recovered_topic ducktape https://buildkite.com/redpanda/redpanda/builds/62070#0195252c-680c-4d5e-a96c-890949790246 FLAKY 1/2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants