-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/sharp: Implement sharp_domain2() using sharp_coll API #9
prov/sharp: Implement sharp_domain2() using sharp_coll API #9
Conversation
Some collective providers do not support all collective operations. In such a case unsupported operations will not affect the final test result. fi_query_collective() is used to check operation support. A new function, test_query(), call fi_query_collective() with parameters defined in extended coll_test structure. The test is silently skipped if fi_query_collective returns -FI_ENOSYS. Signed-off-by: Tomasz Gromadzki <[email protected]>
Collective providers must be closed during rxm_ep closing to avoid error on AV closing: "off_coll:av:ofi_av_close_lightweight():400<warn> AV is busy". Signed-off-by: Tomasz Gromadzki <[email protected]>
rxm_mc added to hide implementation details of multicast address of collective operations. Signed-off-by: Tomasz Gromadzki <[email protected]>
Collective provider (i.e. SHARP) requires to execute collective operation on the supporting collective provider while running join() operation. Such operation is executed on peer_ep with peer_mc as address of the operation. Signed-off-by: Tomasz Gromadzki <[email protected]>
rxm provider uses peer_mc_context to deliver reference to rmx_mc.mc_fid to any collective operation - fi_join(). peer_mc_context.mc_fid is used as a fi_join() completion event's context. Signed-off-by: Tomasz Gromadzki <[email protected]>
util_coll:fi_join() called with the FI_PEER flag restores peer_mc_context.mc_fid and uses it as the actual context of fi_join() operation. This includes also reporting the join operation completion with mc_fid as event's context. Signed-off-by: Tomasz Gromadzki <[email protected]>
coll_cq implementation can be reused by other collective providers. Signed-off-by: Tomasz Gromadzki <[email protected]>
Peer provider must create peer_eq for offload provider, to allow offload provider reporting events to peer provider. Signed-off-by: Tomasz Gromadzki <[email protected]>
…initialization It is rxm provider responsability to initialize collective offload provider's fabric. Otherwise collective offload functionality will not be available Signed-off-by: Tomasz Gromadzki <[email protected]>
…rity Collective offload capabilities reported if offload provider is available otherwise util collective provider capabilities are reported. Signed-off-by: Tomasz Gromadzki <[email protected]>
…_ONLY is set fi_query_collective() reports only collective offload provider capabilities if OFI_OFFLOAD_PROV_ONLY flag is set. Otherwise the sum of both providers' capabilities is reported. Signed-off-by: Tomasz Gromadzki <[email protected]>
offload_coll_mask value is calculated based on the actual offload capabilities confirmed by fi_query_collective(). Signed-off-by: Tomasz Gromadzki <[email protected]>
TODO.txt files defines next development steps. The list is ordered by priority. Signed-off-by: Tomasz Gromadzki <[email protected]>
…er is available To let all collective tests pass let's force rxm to provide only offload collective domain capabilities. THIS IS ONLY FOR TEST PPURPOSE. In the future, tests for offload provider shall be executed with OFI_OFFLOAD_PROV_ONLY flag set in fi_query_collective() call. Signed-off-by: Tomasz Gromadzki <[email protected]>
Signed-off-by: Tomasz Gromadzki <[email protected]>
Signed-off-by: Tomasz Gromadzki <[email protected]>
peer_mc_context is used as fi_join() context when FI_PEER flag is set. Signed-off-by: Tomasz Gromadzki <[email protected]>
Signed-off-by: Tomasz Gromadzki <[email protected]>
Both collective operation implemented ba colling peer collective operation and transparently passing completion back to peer CQ Signed-off-by: Tomasz Gromadzki <[email protected]>
Signed-off-by: Tomasz Gromadzki <[email protected]>
Add mocks for sharp_coll_init and sharp_coll_finalize. Signed-off-by: Lukasz Dorau <[email protected]>
…_coll_init-and-sharp_coll_finalize prov/sharp: Add mocks for sharp_coll_init and sharp_coll_finalize
58f44d7
to
1965577
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 20 of 26 files at r1, 1 of 2 files at r2, 1 of 2 files at r3, 2 of 3 files at r4, all commit messages.
Reviewable status: 23 of 27 files reviewed, 3 unresolved discussions (waiting on @ldorau)
prov/sharp/src/sharp_domain.c
line 264 at r3 (raw file):
err_free_domain: free(domain); return ret;
Return value to be converted to libfabric specific error code FI_E...
Code quote:
ret
prov/sharp/src/sharp_domain.c
line 233 at r4 (raw file):
struct sharp_coll_config config = { /* XXX */ /* ??? */ .ib_dev_list = NULL, /**< IB device name, port list. (const char *) */ /* ??? */ .user_progress_num_polls = 0, /**< Number of polls to do before calling user progress. (int) */
Suggestion:
.user_progress_num_polls = -1
prov/sharp/src/sharp_domain.c
line 247 at r4 (raw file):
.world_local_rank = 0, /**< relative rank of this process on this node within its job. */ /* ??? */ .enable_thread_support = 0, /**< enable multi threaded support. */ /* ??? */ .oob_ctx = context, /**< context for OOB functions in sharp_coll_init */
Suggestion:
.oob_ctx = NULL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 26 files at r1.
Reviewable status: 25 of 27 files reviewed, 3 unresolved discussions (waiting on @ldorau)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 25 of 27 files reviewed, 3 unresolved discussions (waiting on @grom72)
prov/sharp/src/sharp_domain.c
line 264 at r3 (raw file):
Previously, grom72 (Tomasz Gromadzki) wrote…
Return value to be converted to libfabric specific error code
FI_E...
How do you want to convert enum sharp_error_no
to FI_E...
?
prov/sharp/src/sharp_domain.c
line 233 at r4 (raw file):
struct sharp_coll_config config = { /* XXX */ /* ??? */ .ib_dev_list = NULL, /**< IB device name, port list. (const char *) */ /* ??? */ .user_progress_num_polls = 0, /**< Number of polls to do before calling user progress. (int) */
Done.
prov/sharp/src/sharp_domain.c
line 247 at r4 (raw file):
.world_local_rank = 0, /**< relative rank of this process on this node within its job. */ /* ??? */ .enable_thread_support = 0, /**< enable multi threaded support. */ /* ??? */ .oob_ctx = context, /**< context for OOB functions in sharp_coll_init */
Done.
1965577
to
3803925
Compare
Signed-off-by: Lukasz Dorau <[email protected]>
3803925
to
3791eb7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to:
https://reviewable.io/reviews/grom72/libfabric/4
grom72#4
Reviewable status: 25 of 38 files reviewed, 3 unresolved discussions (waiting on @grom72)
Done |
If a posted receive matches with a saved receive, we may need to increment the rx counter. Set the rx counter increment callback to match that of the posted receive. This fixes an assert in xnet_cntr_inc() accessing a NULL cntr_inc function pointer. Program received signal SIGABRT, Aborted. 0x0000155552d4d37f in raise () from /lib64/libc.so.6 #0 0x0000155552d4d37f in raise () from /lib64/libc.so.6 #1 0x0000155552d37db5 in abort () from /lib64/libc.so.6 #2 0x0000155552d37c89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6 #3 0x0000155552d45a76 in __assert_fail () from /lib64/libc.so.6 #4 0x00001555522967f9 in xnet_cntr_inc (ep=0x6e4c70, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:347 #5 0x0000155552296836 in xnet_report_cntr_success (ep=0x6e4c70, cq=0x6ca930, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:354 #6 0x000015555229970d in xnet_complete_saved (saved_entry=0x6f7a30) at prov/tcp/src/xnet_progress.c:153 #7 0x0000155552299961 in xnet_recv_saved (saved_entry=0x6f7a30, rx_entry=0x6f7840) at prov/tcp/src/xnet_progress.c:188 #8 0x00001555522946f8 in xnet_srx_tag (srx=0x6dd1c0, recv_entry=0x6f7840) at prov/tcp/src/xnet_srx.c:445 #9 0x0000155552294bb1 in xnet_srx_trecv (ep_fid=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_srx.c:558 ofiwg#10 0x000015555228f60e in fi_trecv (ep=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at ./include/rdma/fi_tagged.h:91 ofiwg#11 0x00001555522900a7 in xnet_rdm_trecv (ep_fid=0x6d9fe0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_rdm.c:212 Signed-off-by: Sean Hefty <[email protected]>
Depends on:
This change is