Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

comm: create local_group/remote_group beform comm commit #7237

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

hzhou
Copy link
Contributor

@hzhou hzhou commented Dec 11, 2024

Pull Request Description

[Dependency PR #7235 ]

First there is only MPI_COMM_WORLD. Later we added dynamic processes, MPI groups, and lately, MPI sessions. Because these later concepts are not part of original vision, they implemented by hacking rather than by design. For example, instead of we locally create group then create communicator from group, we do it in reverse. We assume the communicator is always there -- an MPI_COMM_WORLD which can later split and recombine -- and we derive groups from existing communicators. In the original design, all the process addressing system is based on communicators. It is a mess! The latest addition of MPI session throw a wrench to this mess because now we have a situation that communicators are not always there.

The current situation:

  • MPIR comm uses mapper -- an address systems based parent communicators
    • MPIR_Comm_map_t
  • Device layer constructs its own address system from the mapper
    • ch3 constructs MPIDI_VCRT table for each communicator based on the mapper
      • Optimizations to reuse vcrt in the dup case
    • ch4 constructs MPIDI_rank_map_t which refers to a global avt_mgr (av table manager)
  • The device layer constructs a process id lpid, accessed using MPID_Comm_get_lpid
  • When needed, MPI groups are constructed from a communicator using lpids

This convoluted mess is because we designed lpid to be device-layer opaque and mysterious. Within the current upstream code base, we have 4 address systems -

  • MPIR Comm mapper
  • ch3 VCRT
  • ch4 av table manager
  • MPI Group
    And we are about to add "MPI Session PSET", yet another address system

I propose to unite all into a single system and make MPI Group first-class citizen.
We can use a universal address system using (world_idx, world_rank) combination.

In the hind sight, we should design it the session way --

  • process manager create the world
  • process init discover the world
  • process create MPI groups
  • process create communicators from groups
  • device layer make communicators function by establish the communication layer

The PR tries to do just that.

[skip warnings]

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

hzhou added 10 commits December 10, 2024 19:40
This test requires to access MPICH internals, thus won't be used with
the current design.
We no longer use this file.
Hide the internal fields of MPIR_Group from unnecessary access.

Outside group_util.c and group_impl.c, it only need assume the MPIR_Lpid
integer type, creation routines based on lpid map or lpid stride
description, and access routine to look up lpid from a group rank.
For most external usages, we only need MPIR_Group_rank_to_lpid.
Avoid access group internal fields.
Group similar functions together to facilitate refactoring.
There is no changes in this commit other than moving functions around.

The 4 incl/excl functions are very similar.

The 3 difference/intersection/union functions are very similar.
Use MPIR_Group_{rank_to_lpid,lpid_to_rank} to avoid directly access
MPIR_Group internal fields.

For most group creation routines, just populate an lpid lookup map and
call MPIR_Group_create_map to create the group.
* add option to use stride to describe group composition
* remove the linked list design
This is the same as MPID_Comm_get_lpid.

NOTE: we'll will remove MPID_Comm_get_lpid as well once we move the
ownership of lpid to the MPIR-layer.
@hzhou hzhou force-pushed the 2412_group_comm branch 2 times, most recently from f3257ba to 0cf5832 Compare December 12, 2024 16:41
There is no real difference between lpid and gpid. Thus rename gpid in
the device layer to lpid for clarification.

Replace the usage of uint64_t as the type of lpid to MPIR_Lpid. This
improves consistency.
@hzhou hzhou force-pushed the 2412_group_comm branch 3 times, most recently from b204a9a to acda531 Compare December 12, 2024 21:59
We need a device-independent way of identifying processes. One way is to
use the combination of (world_idx, world_rank). Thus, we need maintain a
list of worlds so that the world_idx points to the world record.

This may not fit in the concept of MPI group, but since the group need a
ways of id processes, thus it seems most closely related.

The first world, world_idx 0, is always initialized at init.

Due to session re-init, we need make sure to reset num_worlds to 0 at
finalize.

New worlds will be added upon spawning or connecting dynamic processes
(to-be-implemented).
Add builtin MPIR_GROUP_WORLD and MPIR_GROUP_SELF, so we can create
builtin communicators from builtin groups.
Internally the only reason to duplicate a group is to copy from NULL session
to a new session.  Otherwise, we can just use the same group and increment the
reference count.
Since builting groups can be returned to users, they should be allowed
to free. They are reference counted anyway.
To make MPI group a first-class citizen, we will always have group
before creating communicators, so that when device layer activate
communiators, e.g. in MPID_Comm_commit_pre_hook, it can rely on the
group to look up the involved processes. It also removes the necessity
to maintain any other process addressing schems.
Many places we just return MPIR_Group_empty without increment the
ref_count. This is fixable. But for now, let's avoid freeing it.
The init_comm does the release manually.
Add assertions to make sure the local_group and remote_group (for
inter communicators) are always set before MPID_Comm_commit_pre_hook.
@hzhou
Copy link
Contributor Author

hzhou commented Dec 13, 2024

test:mpich/ch3/most
test:mpich/ch4/most

All ✔️

@hzhou hzhou marked this pull request as ready for review December 13, 2024 18:56
@hzhou hzhou changed the title comm: create communicator from group comm: create local_group/remote_group beform comm commit Dec 22, 2024
@hzhou
Copy link
Contributor Author

hzhou commented Dec 26, 2024

test:mpich/ch4/ofi

@hzhou hzhou requested a review from yfguo January 2, 2025 19:58
if (sizeof(MPIR_Lpid) == 8) {
lpid_datatype = MPI_UINT64_T;
} else {
MPIR_Assert(sizeof(MPIR_Lpid) == 4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a case MPIR_Lpid is defined as uint32_t? I thought it is all uint64_t since last #7235.

@@ -30,6 +30,9 @@ int MPIR_init_comm_world(void)
MPIR_Process.comm_world->remote_size = MPIR_Process.size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the remote_group is NULL, the remote_size should probably be 0.
Or, if we want to keep local_size == remote_size for now to lessen the impact on existing codes, we should update the comment in the struct definition and maybe add a TODO for future cleanup.

@@ -30,6 +30,9 @@ int MPIR_init_comm_world(void)
MPIR_Process.comm_world->remote_size = MPIR_Process.size;
MPIR_Process.comm_world->local_size = MPIR_Process.size;

MPIR_Process.comm_world->local_group = MPIR_GROUP_WORLD_PTR;
MPIR_Group_add_ref(MPIR_GROUP_WORLD_PTR);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe explicitly set remote_group to NULL to avoid uninitialized value.

@@ -494,6 +496,7 @@ int MPIR_Comm_create_inter(MPIR_Comm * comm_ptr, MPIR_Group * group_ptr, MPIR_Co

MPIR_Assert(remote_size >= 0);


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove blank line.

* to initialize the init_comm, e.g. to eliminate potential
* runtime features for stability during init */
* to initialize the init_comm, e.g. to eliminate potential
* runtime features for stability during init */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is changed here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants