Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GTEST/PROTO: Enforce short protocol in proto mock #10397

Closed

Conversation

iyastreb
Copy link
Contributor

What?

test_ucp_proto_mock_rcx.mock_iface_attr may fail when iface does not have UCT_IFACE_FLAG_AM_SHORT capability.
failing job
raw log

In this case short protocol is missing:

Key op flags: 2, attr: 0
0-8246 desc: copy-in, config: rc_mlx5/mlx5_2:1
8247-377094 desc: multi-frag copy-in, config: rc_mlx5/mlx5_2:1
377095-18446744073709551615 desc: rendezvous zero-copy read from remote, config: rc_mlx5/mlx5_2:1

To solve this issue we can enforce UCT_IFACE_FLAG_AM_SHORT in mock iface

@iyastreb iyastreb requested a review from tvegas1 December 20, 2024 15:51
Copy link
Contributor

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems the fix is for rc_mlx5 mock, why would it affect EFA @tvegas1 ?

@tvegas1
Copy link
Contributor

tvegas1 commented Dec 20, 2024

actually, it's an issue in EFA PR because rc_x am.iov=3. Pushed 4->3 for now, but mlx5 really does not support more than 3?

@yosefe
Copy link
Contributor

yosefe commented Dec 22, 2024

actually, it's an issue in EFA PR because rc_x am.iov=3. Pushed 4->3 for now, but mlx5 really does not support more than 3?

Yes, UCT_IB_MLX5_AM_ZCOPY_MAX_IOV is 3, but it should not affect EFA

@rakhmets
Copy link
Collaborator

rakhmets commented Dec 31, 2024

@evgeny-leksikov
This looks like a valid error: https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=93167&view=logs&j=f82f4351-a795-5e93-2b8c-a35dae1a1866&t=37d2334b-da4d-5c74-8723-b3c332df9c8f&l=35770

Direct leak of 48 byte(s) in 1 object(s) allocated from:
#0 0xffffb9a9a4b4 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0xffffb87c2078  (/lib/aarch64-linux-gnu/libmlx5.so.1+0x62078)
#2 0xffffb8e32308 in uct_ib_mlx5_devx_obj_create src/uct/ib/mlx5/dv/ib_mlx5_dv.c:422
#3 0xffffb8e45810 in uct_ib_mlx5_devx_mem_attach src/uct/ib/mlx5/dv/ib_mlx5dv_md.c:3021
#4 0xffffb8e9bbe0 in uct_gga_mlx5_rkey_resolve src/uct/ib/mlx5/gga/gga_mlx5.c:219
#5 0xffffb8e9bbe0 in uct_gga_mlx5_ep_get_zcopy src/uct/ib/mlx5/gga/gga_mlx5.c:537
#6 0xffffb944a270 in uct_ep_get_zcopy src/uct/api/uct.h:3004

uct_ib_mlx5_devx_obj_create is called from uct_gga_mlx5_ep_get_zcopy via uct_gga_mlx5_rkey_resolve. And I could not find where we clean the allocated resources.

@evgeny-leksikov
Copy link
Contributor

@evgeny-leksikov This looks like a valid error: https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=93167&view=logs&j=f82f4351-a795-5e93-2b8c-a35dae1a1866&t=37d2334b-da4d-5c74-8723-b3c332df9c8f&l=35770

Direct leak of 48 byte(s) in 1 object(s) allocated from:
#0 0xffffb9a9a4b4 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0xffffb87c2078  (/lib/aarch64-linux-gnu/libmlx5.so.1+0x62078)
#2 0xffffb8e32308 in uct_ib_mlx5_devx_obj_create src/uct/ib/mlx5/dv/ib_mlx5_dv.c:422
#3 0xffffb8e45810 in uct_ib_mlx5_devx_mem_attach src/uct/ib/mlx5/dv/ib_mlx5dv_md.c:3021
#4 0xffffb8e9bbe0 in uct_gga_mlx5_rkey_resolve src/uct/ib/mlx5/gga/gga_mlx5.c:219
#5 0xffffb8e9bbe0 in uct_gga_mlx5_ep_get_zcopy src/uct/ib/mlx5/gga/gga_mlx5.c:537
#6 0xffffb944a270 in uct_ep_get_zcopy src/uct/api/uct.h:3004

uct_ib_mlx5_devx_obj_create is called from uct_gga_mlx5_ep_get_zcopy via uct_gga_mlx5_rkey_resolve. And I could not find where we clean the allocated resources.

This resource should be freed on rkey release. Need to check the test, gga transport does not support short (only zcopy), probably it's related somehow.

@yosefe
Copy link
Contributor

yosefe commented Jan 6, 2025

Per offline discussion, need to check why SHORT capability is not present for rc_mlx5 in some cases

@iyastreb
Copy link
Contributor Author

iyastreb commented Jan 6, 2025

UCP/PROTO: Consider RNDV_PERF_DIFF #10401

Here is the failing job from another PR, unrelated to EFA
Investigating...

@iyastreb
Copy link
Contributor Author

iyastreb commented Jan 6, 2025

UCP/PROTO: Consider RNDV_PERF_DIFF #10401

Here is the failing job from another PR, unrelated to EFA Investigating...

Sorry, that one was still caused by EFA.
And other failures from #10401 are caused by changed perf calculation logic

log1, log2, log3

It seems we do not need this fix, as the EFA problem is fixed, and others are unrelated

@iyastreb iyastreb closed this Jan 6, 2025
@iyastreb
Copy link
Contributor Author

iyastreb commented Jan 6, 2025

@evgeny-leksikov @rakhmets

This memory leak is unrelated to this PR, and btw I see that failure in other jobs regularly.
It seems to be caused by the concurrent mem_attach from multiple threads, and is fixed by locking a mutex in #10401 (still not merged)

    pthread_mutex_lock(&mem_attach_lock);
    status = uct_ib_mlx5_devx_mem_attach(uct_md, &rkey_handle->packed_mkey,
                                         &attach_params,
                                         (uct_mem_h*)&md_rkey_handle->memh);
    pthread_mutex_unlock(&mem_attach_lock);

@iyastreb iyastreb deleted the gtest-proto-mock-enforce-short branch January 8, 2025 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants