Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/sharp: Implement sharp_domain2() using sharp_coll API #9

Closed

Conversation

ldorau
Copy link
Member

@ldorau ldorau commented Dec 19, 2022

Depends on:


This change is Reviewable

grom72 and others added 25 commits December 3, 2022 19:59
Some collective providers do not support all collective operations.
In such a case unsupported operations will not affect the final test result.

fi_query_collective() is used to check operation support.

A new function, test_query(), call fi_query_collective() with parameters defined in
extended coll_test structure.
The test is silently skipped if fi_query_collective returns -FI_ENOSYS.

Signed-off-by: Tomasz Gromadzki <[email protected]>
Collective providers must be closed during rxm_ep closing to avoid error on AV closing:
"off_coll:av:ofi_av_close_lightweight():400<warn> AV is busy".

Signed-off-by: Tomasz Gromadzki <[email protected]>
rxm_mc added to hide implementation details of multicast address of collective
operations.

Signed-off-by: Tomasz Gromadzki <[email protected]>
Collective provider (i.e. SHARP) requires to execute collective
operation on the supporting collective provider while running join() operation.
Such operation is executed on peer_ep with peer_mc as address of the operation.

Signed-off-by: Tomasz Gromadzki <[email protected]>
rxm provider uses peer_mc_context to deliver reference to rmx_mc.mc_fid to
any collective operation - fi_join(). peer_mc_context.mc_fid is used as
a fi_join() completion event's context.

Signed-off-by: Tomasz Gromadzki <[email protected]>
util_coll:fi_join() called with the FI_PEER flag restores peer_mc_context.mc_fid and
uses it as the actual context of fi_join() operation. This includes also reporting
the join operation completion with mc_fid as event's context.

Signed-off-by: Tomasz Gromadzki <[email protected]>
coll_cq implementation can be reused by other collective providers.

Signed-off-by: Tomasz Gromadzki <[email protected]>
Peer provider must create peer_eq for offload provider, to allow offload provider
reporting events to peer provider.

Signed-off-by: Tomasz Gromadzki <[email protected]>
…initialization

It is rxm provider responsability to initialize collective offload provider's fabric.
Otherwise collective offload functionality will not be available

Signed-off-by: Tomasz Gromadzki <[email protected]>
…rity

Collective offload capabilities reported if offload provider is available
otherwise util collective provider capabilities are reported.

Signed-off-by: Tomasz Gromadzki <[email protected]>
…_ONLY is set

fi_query_collective() reports only collective offload provider capabilities if
OFI_OFFLOAD_PROV_ONLY flag is set. Otherwise the sum of both providers' capabilities
is reported.

Signed-off-by: Tomasz Gromadzki <[email protected]>
offload_coll_mask value is calculated based on the actual offload capabilities
confirmed by fi_query_collective().

Signed-off-by: Tomasz Gromadzki <[email protected]>
TODO.txt files defines next development steps. The list is ordered by priority.

Signed-off-by: Tomasz Gromadzki <[email protected]>
…er is available

To let all collective tests pass let's force rxm to provide only offload collective
domain capabilities. THIS IS ONLY FOR TEST PPURPOSE. In the future, tests for
offload provider shall be executed with OFI_OFFLOAD_PROV_ONLY flag set in
fi_query_collective() call.

Signed-off-by: Tomasz Gromadzki <[email protected]>
peer_mc_context is used as fi_join() context when FI_PEER flag is set.

Signed-off-by: Tomasz Gromadzki <[email protected]>
Signed-off-by: Tomasz Gromadzki <[email protected]>
Both collective operation implemented ba colling peer collective operation and
transparently passing completion back to peer CQ

Signed-off-by: Tomasz Gromadzki <[email protected]>
Add mocks for sharp_coll_init and sharp_coll_finalize.

Signed-off-by: Lukasz Dorau <[email protected]>
…_coll_init-and-sharp_coll_finalize

prov/sharp: Add mocks for sharp_coll_init and sharp_coll_finalize
@ldorau ldorau marked this pull request as draft December 19, 2022 10:32
@ldorau ldorau force-pushed the prov-sharp-sharp_domain2-with-sharp_coll-API branch 2 times, most recently from 58f44d7 to 1965577 Compare December 21, 2022 07:09
Copy link
Collaborator

@grom72 grom72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 20 of 26 files at r1, 1 of 2 files at r2, 1 of 2 files at r3, 2 of 3 files at r4, all commit messages.
Reviewable status: 23 of 27 files reviewed, 3 unresolved discussions (waiting on @ldorau)


prov/sharp/src/sharp_domain.c line 264 at r3 (raw file):

err_free_domain:
	free(domain);
	return ret;

Return value to be converted to libfabric specific error code FI_E...

Code quote:

ret

prov/sharp/src/sharp_domain.c line 233 at r4 (raw file):

	struct sharp_coll_config config = { /* XXX */
/* ??? */	.ib_dev_list = NULL,		/**< IB device name, port list. (const char *) */
/* ??? */	.user_progress_num_polls = 0,	/**< Number of polls to do before calling user progress. (int) */

Suggestion:

.user_progress_num_polls = -1

prov/sharp/src/sharp_domain.c line 247 at r4 (raw file):

		.world_local_rank = 0,			/**< relative rank of this process on this node within its job. */
/* ??? */	.enable_thread_support = 0,		/**< enable multi threaded support. */
/* ??? */	.oob_ctx = context,			/**< context for OOB functions in sharp_coll_init */

Suggestion:

.oob_ctx = NULL

Copy link
Collaborator

@grom72 grom72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 26 files at r1.
Reviewable status: 25 of 27 files reviewed, 3 unresolved discussions (waiting on @ldorau)

Copy link
Member Author

@ldorau ldorau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 25 of 27 files reviewed, 3 unresolved discussions (waiting on @grom72)


prov/sharp/src/sharp_domain.c line 264 at r3 (raw file):

Previously, grom72 (Tomasz Gromadzki) wrote…

Return value to be converted to libfabric specific error code FI_E...

How do you want to convert enum sharp_error_no to FI_E... ?


prov/sharp/src/sharp_domain.c line 233 at r4 (raw file):

	struct sharp_coll_config config = { /* XXX */
/* ??? */	.ib_dev_list = NULL,		/**< IB device name, port list. (const char *) */
/* ??? */	.user_progress_num_polls = 0,	/**< Number of polls to do before calling user progress. (int) */

Done.


prov/sharp/src/sharp_domain.c line 247 at r4 (raw file):

		.world_local_rank = 0,			/**< relative rank of this process on this node within its job. */
/* ??? */	.enable_thread_support = 0,		/**< enable multi threaded support. */
/* ??? */	.oob_ctx = context,			/**< context for OOB functions in sharp_coll_init */

Done.

@ldorau ldorau force-pushed the prov-sharp-sharp_domain2-with-sharp_coll-API branch from 1965577 to 3803925 Compare December 21, 2022 11:07
@ldorau ldorau force-pushed the prov-sharp-sharp_domain2-with-sharp_coll-API branch from 3803925 to 3791eb7 Compare December 21, 2022 11:45
Copy link
Member Author

@ldorau ldorau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to:
https://reviewable.io/reviews/grom72/libfabric/4
grom72#4

Reviewable status: 25 of 38 files reviewed, 3 unresolved discussions (waiting on @grom72)

@ldorau
Copy link
Member Author

ldorau commented Dec 21, 2022

@ldorau ldorau closed this Dec 21, 2022
@ldorau
Copy link
Member Author

ldorau commented Dec 21, 2022

Reviewed 20 of 26 files at r1, 1 of 2 files at r2, 1 of 2 files at r3, 2 of 3 files at r4, all commit messages.
Reviewable status: 23 of 27 files reviewed, 3 unresolved discussions (waiting on @ldorau)

prov/sharp/src/sharp_domain.c line 264 at r3 (raw file):

err_free_domain:
	free(domain);
	return ret;

Return value to be converted to libfabric specific error code FI_E...

Code quote:

ret

Done

grom72 pushed a commit that referenced this pull request Mar 24, 2023
If a posted receive matches with a saved receive, we may need to
increment the rx counter.  Set the rx counter increment callback
to match that of the posted receive.  This fixes an assert in
xnet_cntr_inc() accessing a NULL cntr_inc function pointer.

Program received signal SIGABRT, Aborted.
0x0000155552d4d37f in raise () from /lib64/libc.so.6
#0  0x0000155552d4d37f in raise () from /lib64/libc.so.6
#1  0x0000155552d37db5 in abort () from /lib64/libc.so.6
#2  0x0000155552d37c89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
#3  0x0000155552d45a76 in __assert_fail () from /lib64/libc.so.6
#4  0x00001555522967f9 in xnet_cntr_inc (ep=0x6e4c70, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:347
#5  0x0000155552296836 in xnet_report_cntr_success (ep=0x6e4c70, cq=0x6ca930, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:354
#6  0x000015555229970d in xnet_complete_saved (saved_entry=0x6f7a30) at prov/tcp/src/xnet_progress.c:153
#7  0x0000155552299961 in xnet_recv_saved (saved_entry=0x6f7a30, rx_entry=0x6f7840) at prov/tcp/src/xnet_progress.c:188
#8  0x00001555522946f8 in xnet_srx_tag (srx=0x6dd1c0, recv_entry=0x6f7840) at prov/tcp/src/xnet_srx.c:445
#9  0x0000155552294bb1 in xnet_srx_trecv (ep_fid=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_srx.c:558
ofiwg#10 0x000015555228f60e in fi_trecv (ep=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at ./include/rdma/fi_tagged.h:91
ofiwg#11 0x00001555522900a7 in xnet_rdm_trecv (ep_fid=0x6d9fe0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_rdm.c:212

Signed-off-by: Sean Hefty <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants