Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu: misc fixes #5176

Merged
merged 4 commits into from
Mar 29, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions src/include/mpiimpl.h
Original file line number Diff line number Diff line change
Expand Up @@ -238,6 +238,8 @@ typedef struct MPIR_Session MPIR_Session;
#include "mpir_nettopo.h"
#include "mpir_impl.h"

#include "mpir_gpu_util.h"

/*****************************************************************************/
/******************** PART 6: DEVICE "POST" FUNCTIONALITY ********************/
/*****************************************************************************/
Expand Down
82 changes: 82 additions & 0 deletions src/include/mpir_gpu_util.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
/*
* Copyright (C) by Argonne National Laboratory
* See COPYRIGHT in top-level directory
*/

#ifndef MPIR_GPU_UTIL_H_INCLUDED
#define MPIR_GPU_UTIL_H_INCLUDED

/* If buf is device memory, allocate a host buffer that can hold the data,
* return the host buffer, or NULL if buf is not device memory */
MPL_STATIC_INLINE_PREFIX void *MPIR_gpu_host_alloc(const void *buf,
MPI_Aint count, MPI_Datatype datatype)
{
MPL_pointer_attr_t attr;
MPIR_GPU_query_pointer_attr(buf, &attr);

if (attr.type != MPL_GPU_POINTER_DEV) {
return NULL;
} else {
MPI_Aint extent, true_lb, true_extent;
MPIR_Datatype_get_extent_macro(datatype, extent);
MPIR_Type_get_true_extent_impl(datatype, &true_lb, &true_extent);

MPI_Aint shift = true_lb;
MPI_Aint size = true_extent;
if (count > 1) {
if (extent >= 0) {
size += extent * (count - 1);
} else {
MPI_Aint size_before = (-extent) * (count - 1);
/* extra counts are being packed *before*, so need *increase* the size */
size += size_before;
/* the allocated pointer will be *already* shifted by this amount */
shift -= size_before;
}
}
void *host_buf = MPL_malloc(size, MPL_MEM_OTHER);
MPIR_Assert(host_buf);

host_buf = (char *) host_buf - shift;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So true_lb is always <= 0 when count == 1 or extent >= 0? Am I reading this right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. true_lb works whether it's positive or negative, it is an adjustment from the buffer pointer and the arithmetic will work for either positive or negative. However, the true_lb returned from MPIR_Type_get_true_extent_impl is the true_lb for a single count datatype. When extent is positive, the true_lb remain the same for multiple count. so no adjustment is needed. However, when extent is negative, the true_lb with multiple count will be further extended into the negative direction, thus requires further adjustment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm trying to understand is we are returning an allocated buffer, but with the address potentially shifted outside of the malloc'd region in the case of positive true_lb. Wouldn't writing to or dereferencing that address lead to a potential crash? How do we guard against these scenarios? Is it because MPIR_Localcopy will only be used with a datatype that would not do this?

Copy link
Contributor Author

@hzhou hzhou Mar 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't writing to or dereferencing that address lead to a potential crash?

The dataype packing/unpacking routine should only access the actual data location specified by the MPI datatype. Only a bug will result in a wrong access.

PS: think about MPI_BOTTOM, it is okay as long as the datatype is correctly constructed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, only accessing this with a datatype-aware copy routine makes sense. I think it should be OK. I'd still like to see it go thru Jenkins again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code has already passed Jenkins both here #5176 (comment) and in the gpu testing #5000 (comment) . The last push is only a rebase with PR #5005

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my previous comments, I am going to go ahead merge this, then get #5000 rebased. Any issues should pop up in that PR anyway.

return host_buf;
Copy link
Contributor Author

@hzhou hzhou Mar 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raffenet Thought about it a bit more. The previous fix didn't account for the case when extent is negative, which will put the extra count data before the first data, thus we may need to allocate extra size to account for.

Could you review this commit again? Meanwhile, I'll rebase and test it in PR #5000 -- although I don't think DTPool will generate any datatype with negative extent.

}
}

MPL_STATIC_INLINE_PREFIX void MPIR_gpu_host_free(void *host_buf,
MPI_Aint count, MPI_Datatype datatype)
{
MPI_Aint extent, true_lb, true_extent;
MPIR_Datatype_get_extent_macro(datatype, extent);
MPIR_Type_get_true_extent_impl(datatype, &true_lb, &true_extent);

MPI_Aint shift = true_lb;
if (count > 1 && extent < 0) {
MPI_Aint size_before = (-extent) * (count - 1);
shift -= size_before;
}
/* The pointers was previously shifted */
host_buf = (char *) host_buf + shift;
MPL_free(host_buf);
}

/* If buf is device memory, allocate a host buffer and copy the content, and return the host
* buffer. Return NULL if buf is not a device memory */
MPL_STATIC_INLINE_PREFIX void *MPIR_gpu_host_swap(const void *buf,
MPI_Aint count, MPI_Datatype datatype)
{
void *host_buf = MPIR_gpu_host_alloc(buf, count, datatype);
if (host_buf) {
MPIR_Localcopy(buf, count, datatype, host_buf, count, datatype);
}

return host_buf;
}

MPL_STATIC_INLINE_PREFIX void MPIR_gpu_swap_back(void *host_buf, void *gpu_buf,
MPI_Aint count, MPI_Datatype datatype)
{
MPIR_Localcopy(host_buf, count, datatype, gpu_buf, count, datatype);
MPIR_gpu_host_free(host_buf, count, datatype);
}

#endif /* MPIR_GPU_UTIL_H_INCLUDED */
68 changes: 27 additions & 41 deletions src/mpi/coll/src/coll_impl.c
Original file line number Diff line number Diff line change
Expand Up @@ -209,45 +209,25 @@ int MPII_Coll_comm_cleanup(MPIR_Comm * comm)
goto fn_exit;
}

/* For reduction-type collective, this routine swaps the (potential) device buffers
* with allocated host-buffer */
void MPIR_Coll_host_buffer_alloc(const void *sendbuf, const void *recvbuf, MPI_Aint count,
MPI_Datatype datatype, void **host_sendbuf, void **host_recvbuf)
{
MPL_pointer_attr_t attr;
*host_sendbuf = NULL;
*host_recvbuf = NULL;
MPI_Aint extent = 0;

void *tmp;
if (sendbuf != MPI_IN_PLACE) {
MPIR_GPU_query_pointer_attr(sendbuf, &attr);
if (attr.type == MPL_GPU_POINTER_DEV) {
MPI_Aint true_extent;
MPI_Aint true_lb;

MPIR_Datatype_get_extent_macro(datatype, extent);
MPIR_Type_get_true_extent_impl(datatype, &true_lb, &true_extent);
extent = MPL_MAX(extent, true_extent);

*host_sendbuf = MPL_malloc(extent * count, MPL_MEM_COLL);
MPIR_Assert(*host_sendbuf);
MPIR_Localcopy(sendbuf, count, datatype, *host_sendbuf, count, datatype);
}
tmp = MPIR_gpu_host_swap(sendbuf, count, datatype);
*host_sendbuf = tmp;
} else {
*host_sendbuf = NULL;
}

MPIR_GPU_query_pointer_attr(recvbuf, &attr);
if (attr.type == MPL_GPU_POINTER_DEV) {
if (!extent) {
MPI_Aint true_extent;
MPI_Aint true_lb;

MPIR_Datatype_get_extent_macro(datatype, extent);
MPIR_Type_get_true_extent_impl(datatype, &true_lb, &true_extent);
extent = MPL_MAX(extent, true_extent);
}

*host_recvbuf = MPL_malloc(extent * count, MPL_MEM_COLL);
MPIR_Assert(*host_recvbuf);
if (sendbuf == MPI_IN_PLACE)
MPIR_Localcopy(recvbuf, count, datatype, *host_recvbuf, count, datatype);
if (sendbuf == MPI_IN_PLACE) {
tmp = MPIR_gpu_host_swap(recvbuf, count, datatype);
*host_recvbuf = tmp;
} else {
tmp = MPIR_gpu_host_alloc(recvbuf, count, datatype);
*host_recvbuf = tmp;
}
}

Expand All @@ -260,23 +240,29 @@ void MPIR_Coll_host_buffer_free(void *host_sendbuf, void *host_recvbuf)
void MPIR_Coll_host_buffer_swap_back(void *host_sendbuf, void *host_recvbuf, void *in_recvbuf,
MPI_Aint count, MPI_Datatype datatype, MPIR_Request * request)
{
if (host_recvbuf == NULL) {
/* no copy at completion necessary, just return */
if (!host_sendbuf && !host_recvbuf) {
/* no copy (or free) at completion necessary, just return */
return;
}

if (request == NULL || MPIR_Request_is_complete(request)) {
/* operation is complete, copy the data and return */
MPIR_Localcopy(host_recvbuf, count, datatype, in_recvbuf, count, datatype);
MPIR_Coll_host_buffer_free(host_sendbuf, host_recvbuf);
if (host_sendbuf) {
MPIR_gpu_host_free(host_sendbuf, count, datatype);
}
if (host_recvbuf) {
MPIR_gpu_swap_back(host_recvbuf, in_recvbuf, count, datatype);
}
return;
}

/* data will be copied later during request completion */
request->u.nbc.coll.host_sendbuf = host_sendbuf;
request->u.nbc.coll.host_recvbuf = host_recvbuf;
request->u.nbc.coll.user_recvbuf = in_recvbuf;
request->u.nbc.coll.count = count;
request->u.nbc.coll.datatype = datatype;
MPIR_Datatype_add_ref_if_not_builtin(datatype);
if (host_recvbuf) {
request->u.nbc.coll.user_recvbuf = in_recvbuf;
request->u.nbc.coll.count = count;
request->u.nbc.coll.datatype = datatype;
MPIR_Datatype_add_ref_if_not_builtin(datatype);
}
}
12 changes: 8 additions & 4 deletions src/mpi/request/mpir_request.c
Original file line number Diff line number Diff line change
Expand Up @@ -78,12 +78,16 @@ int MPIR_Request_completion_processing(MPIR_Request * request_ptr, MPI_Status *
case MPIR_REQUEST_KIND__COLL:
{
MPII_Coll_req_t *coll = &request_ptr->u.nbc.coll;

if (coll->host_sendbuf) {
MPIR_gpu_host_free(coll->host_sendbuf, coll->count, coll->datatype);
}

if (coll->host_recvbuf) {
MPIR_Localcopy(coll->host_recvbuf, coll->count, coll->datatype,
coll->user_recvbuf, coll->count, coll->datatype);
MPIR_gpu_swap_back(coll->host_recvbuf, coll->user_recvbuf,
coll->count, coll->datatype);
MPIR_Datatype_release_if_not_builtin(coll->datatype);
}
MPIR_Coll_host_buffer_free(coll->host_sendbuf, coll->host_recvbuf);
MPIR_Datatype_release_if_not_builtin(coll->datatype);

MPIR_Request_extract_status(request_ptr, status);
mpi_errno = request_ptr->status.MPI_ERROR;
Expand Down
16 changes: 15 additions & 1 deletion src/mpid/ch4/generic/am/mpidig_am_recv_utils.h
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,24 @@
rreq->status.MPI_SOURCE, rreq->status.MPI_TAG, \
(int) data_sz, (int) in_data_sz)

MPL_STATIC_INLINE_PREFIX void MPIDIG_recv_set_buffer_attr(MPIR_Request * rreq)
{
MPL_pointer_attr_t attr;
MPIR_GPU_query_pointer_attr(MPIDIG_REQUEST(rreq, buffer), &attr);

MPIDIG_rreq_async_t *p = &(MPIDIG_REQUEST(rreq, req->recv_async));
p->is_device_buffer = (attr.type == MPL_GPU_POINTER_DEV);
}

/* caching recv buffer information */
MPL_STATIC_INLINE_PREFIX void MPIDIG_recv_type_init(MPI_Aint in_data_sz, MPIR_Request * rreq)
{
MPIDIG_rreq_async_t *p = &(MPIDIG_REQUEST(rreq, req->recv_async));
p->recv_type = MPIDIG_RECV_DATATYPE;
p->in_data_sz = in_data_sz;

MPIDIG_recv_set_buffer_attr(rreq);

MPI_Aint max_data_size;
MPIR_Datatype_get_size_macro(MPIDIG_REQUEST(rreq, datatype), max_data_size);
max_data_size *= MPIDIG_REQUEST(rreq, count);
Expand All @@ -48,6 +59,8 @@ MPL_STATIC_INLINE_PREFIX void MPIDIG_recv_init(int is_contig, MPI_Aint in_data_s
p->iov_ptr = data;
p->iov_num = data_sz;
}

MPIDIG_recv_set_buffer_attr(rreq);
}

MPL_STATIC_INLINE_PREFIX void MPIDIG_recv_finish(MPIR_Request * rreq)
Expand All @@ -62,7 +75,7 @@ MPL_STATIC_INLINE_PREFIX void MPIDIG_recv_finish(MPIR_Request * rreq)
* Providing helper routine keeps the internal of MPIDIG_rreq_async_t here.
*/
MPL_STATIC_INLINE_PREFIX void MPIDIG_convert_datatype(MPIR_Request * rreq);
MPL_STATIC_INLINE_PREFIX void MPIDIG_get_recv_data(int *is_contig, void **p_data,
MPL_STATIC_INLINE_PREFIX void MPIDIG_get_recv_data(int *is_contig, int *is_gpu, void **p_data,
MPI_Aint * p_data_sz, MPIR_Request * rreq)
{
MPIDIG_rreq_async_t *p = &(MPIDIG_REQUEST(rreq, req->recv_async));
Expand All @@ -80,6 +93,7 @@ MPL_STATIC_INLINE_PREFIX void MPIDIG_get_recv_data(int *is_contig, void **p_data
*p_data = p->iov_ptr;
*p_data_sz = p->iov_num;
}
*is_gpu = p->is_device_buffer;
}

/* Sometime the transport just need info to make algorithm choice */
Expand Down
1 change: 1 addition & 0 deletions src/mpid/ch4/include/mpidpre.h
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,7 @@ typedef enum {

typedef struct MPIDIG_req_async {
MPIDIG_recv_type recv_type;
bool is_device_buffer;
MPI_Aint in_data_sz;
MPI_Aint offset;
struct iovec *iov_ptr; /* used with MPIDIG_RECV_IOV */
Expand Down
10 changes: 7 additions & 3 deletions src/mpid/ch4/netmod/ofi/ofi_am_events.h
Original file line number Diff line number Diff line change
Expand Up @@ -431,11 +431,15 @@ MPL_STATIC_INLINE_PREFIX int do_long_am_recv(MPI_Aint in_data_sz, MPIR_Request *
/* noncontig data with mostly tiny segments */
mpi_errno = do_long_am_recv_unpack(in_data_sz, rreq, lmt_msg);
} else {
int is_contig;
int is_contig, is_gpu;
void *p_data;
MPI_Aint data_sz;
MPIDIG_get_recv_data(&is_contig, &p_data, &data_sz, rreq);
if (is_contig) {
MPIDIG_get_recv_data(&is_contig, &is_gpu, &p_data, &data_sz, rreq);

if (is_gpu) {
/* Use unpack for GPU buffer */
mpi_errno = do_long_am_recv_unpack(in_data_sz, rreq, lmt_msg);
} else if (is_contig) {
mpi_errno = do_long_am_recv_contig(p_data, data_sz, in_data_sz, rreq, lmt_msg);
} else {
mpi_errno = do_long_am_recv_iov(p_data, data_sz, in_data_sz, rreq, lmt_msg);
Expand Down
29 changes: 8 additions & 21 deletions src/mpid/ch4/src/ch4_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -1024,23 +1024,12 @@ MPL_STATIC_INLINE_PREFIX int MPIDIG_compute_acc_op(void *source_buf, int source_
/* --END ERROR HANDLING-- */
}

void *in_targetbuf = target_buf;
void *save_targetbuf = NULL;
void *host_targetbuf = NULL;
MPL_pointer_attr_t attr;
MPIR_GPU_query_pointer_attr(target_buf, &attr);
/* FIXME: use typerep/yaksa GPU-aware accumulate when available */
if (attr.type == MPL_GPU_POINTER_DEV) {
MPI_Aint extent, true_extent;
MPI_Aint true_lb;

MPIR_Datatype_get_extent_macro(target_dtp, extent);
MPIR_Type_get_true_extent_impl(target_dtp, &true_lb, &true_extent);
extent = MPL_MAX(extent, true_extent);

host_targetbuf = MPL_malloc(extent * target_count, MPL_MEM_RMA);
MPIR_Assert(host_targetbuf);
MPIR_Localcopy(target_buf, target_count, target_dtp, host_targetbuf, target_count,
target_dtp);

host_targetbuf = MPIR_gpu_host_swap(target_buf, target_count, target_dtp);
if (host_targetbuf) {
save_targetbuf = target_buf;
target_buf = host_targetbuf;
}

Expand Down Expand Up @@ -1142,11 +1131,9 @@ MPL_STATIC_INLINE_PREFIX int MPIDIG_compute_acc_op(void *source_buf, int source_
MPL_free(typerep_vec);
}

if (host_targetbuf) {
target_buf = in_targetbuf;
MPIR_Localcopy(host_targetbuf, target_count, target_dtp, target_buf, target_count,
target_dtp);
MPL_free(host_targetbuf);
if (save_targetbuf) {
MPIR_gpu_swap_back(target_buf, save_targetbuf, target_count, target_dtp);
target_buf = save_targetbuf;
}

fn_exit:
Expand Down