-
Notifications
You must be signed in to change notification settings - Fork 879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues running OMPI with address sanitizer #11826
Comments
in function opal_argv_count() int opal_argv_count(char **argv)
{
char **p;
int i;
if (NULL == argv) {
return 0;
}
for (i = 0, p = argv; *p; i++, p++) { //<=
continue;
}
return i;
} it accesses the array until it gets to its end. This causes a heap overflow that prohibits address sanitizer from running. |
When it counts the args (*p) after the third it accesses a memory out on the range. |
@nivShpak It's still not clear to me. Can you please a) post the output of address sanitizer; and b) explain how the added enum member changes the behavior of the loop you posted? |
@jsquyres @devreal The problem is the following. Currently, the size of the array is exactly the number of colors. ompi/opal/runtime/opal_params_core.h Line 50 in 1717930
However, the code snippet provided by @nivShpak for (i = 0, p = argv; *p; i++, p++) { //<= Assumes that there is a NULL element in the end. |
Are you saying that It feels weird to me to add a meaningless enum that has a side effect of lengthening an array that may have a further side effect of fixing a memory error. I think the real questions are:
Or am I missing something? |
Correct
Yes it's not. typedef enum {
...
OPAL_VAR_DUMP_COLOR_KEY_COUNT
} opal_var_dump_color_key_t;
extern char *opal_var_dump_color[OPAL_VAR_DUMP_COLOR_KEY_COUNT]; Maybe using a different name would be better, like If the naming space is precious and you don't want to pollute it with the NULL_TERM element, the other solution would be to define the macro to calculate the array size; typedef enum {
OPAL_VAR_DUMP_COLOR_VAR_NAME = 0,
OPAL_VAR_DUMP_COLOR_VAR_VALUE = 1,
OPAL_VAR_DUMP_COLOR_VALID_VALUES = 2,
OPAL_VAR_DUMP_COLOR_KEY_COUNT
} opal_var_dump_color_key_t;
#define OPAL_VAR_DUMP_COLOR_SIZE (OPAL_VAR_DUMP_COLOR_KEY_COUNT + 1)
extern char *opal_var_dump_color[OPAL_VAR_DUMP_COLOR_SIZE];
No. As long as it's long enough I believe that the static nature of the array allows it to be initialized correctly. |
@nivShpak also, I assume we should add a more meaningful description to the commit message to detail what is the fix. |
I think declaring the array like this is fine: char *opal_var_dump_color[OPAL_VAR_DUMP_COLOR_KEY_COUNT + 1]; It's well known/understood in C that if you have a NULL-terminated array, then you need to have an additional element. I think having an otherwise-meaningless enum would be a little weird, especially since there's no other precedent for that in OMPI (i.e., if the OMPI code base commonly used that convention, that would probably be acceptable -- but as a one-off, it's a little weird). Alternatively, if using |
Ok, @nivShpak please change accordingly.
That sounds like a more disruptive task that @nivShpak don't have cycles for. So I'd stick to the short-term fix and have that other item in the TODO list for later. |
since opal_var_dump_color_keys is used with opal_argv_count() and friends, make sure is is long enough and NULL terminated Refs. open-mpi#11826 Signed-off-by: Gilles Gouaillardet <[email protected]>
I just issued #11828 since I think |
@ggouaillardet I was hoping you guys will let @nivShpak to commit this one as he spent time debugging this. |
My comment was aimed at the fact that all other entries in the enum have explicit values assigned to them, but not the last one. This raises the question of someone having forgotten to assign it - and someone mistakenly assigning it the next integer value in the list. It also opens the door to someone later removing the Explicitly setting the value seems like simple defensive programming. Just a suggestion. |
He will fix this one. He doesn’t have cycles to change the behavior of the function as @jsquyres suggested |
We will update the pr based on the discussion here |
@jsquyres carefully wrote "if using My opinion is this is the right thing to do here, so the correct fix is to make |
I think that the 0-initialization is in C from the very beginning (C89):
Per the link above that cites the standard, it doesn't matter if it's static or just global - the remaining is set to zeros. Even for auto vars if the first few items are initialized "the remainder of the aggregate shall be initialized implicitly the same as objects that have static storage duration". So relying on this should be safe in any circumstance.
I wasn't arguing for or against it. I've just said "So I'd stick to the short-term fix and have that other item in the TODO list for later." referring to #11827. Sorry for the confusion. |
Hi, Re: the root cause: It looks like the culprit are the calls to ompi/opal/runtime/opal_params_core.c Line 243 in a8fbb4d
ompi/opal/runtime/opal_params_core.c Line 477 in a8fbb4d
Both of these calls specify the size of the array, but (apparently, oops!) Line 349 in a8fbb4d
Note that there are also some other calls to |
My $0.02: I think we came to the conclusion here that the "add another enum value" solution that was (at least initially?) in #11827 was a little too indirect, and the commit message and lack of comments didn't make the problem or the solution clear. Upon further discussion, the rationale for the additional enum came out, and alternatives were suggested. @ggouaillardet coded up one of these alternatives, based on perhaps a misunderstanding that NVIDIA didn't have time to pursue this further. I don't think anyone is at fault here. |
since opal_var_dump_color_keys is used with opal_argv_count() and friends, make sure is is long enough and NULL terminated Thanks to Niv Shpak for reporting this and identifying the root cause. Refs. open-mpi#11826 Signed-off-by: Gilles Gouaillardet <[email protected]>
Refs. open-mpi#11826 Signed-off-by: Gilles Gouaillardet <[email protected]>
#11827 is updated per the discussion here incorporating the initialization fix from @ggouaillardet |
Refs. open-mpi#11826 Signed-off-by: Gilles Gouaillardet <[email protected]>
Refs. open-mpi#11826 Signed-off-by: Gilles Gouaillardet <[email protected]>
Refs. open-mpi#11826 Signed-off-by: Gilles Gouaillardet <[email protected]>
Refs. open-mpi#11826 Signed-off-by: Gilles Gouaillardet <[email protected]>
I created #11829 as another alternative. It does away with (@nivShpak could you confirm if it solves the initial issue?) |
I ran into this issue again yesterday and today. After reviewing the proposed solutions, I'm going to merge #11828, @ggouaillardet's solution.
|
since opal_var_dump_color_keys is used with opal_argv_count() and friends, make sure is is long enough and NULL terminated Thanks to Niv Shpak for reporting this and identifying the root cause. Refs. open-mpi#11826 Signed-off-by: Gilles Gouaillardet <[email protected]>
since opal_var_dump_color_keys is used with opal_argv_count() and friends, make sure is is long enough and NULL terminated Thanks to Niv Shpak for reporting this and identifying the root cause. Refs. open-mpi#11826 Signed-off-by: Gilles Gouaillardet <[email protected]> (cherry picked from commit 7b325e3)
fixed with #12030 closing |
since opal_var_dump_color_keys is used with opal_argv_count() and friends, make sure is is long enough and NULL terminated Thanks to Niv Shpak for reporting this and identifying the root cause. Refs. open-mpi#11826 Signed-off-by: Gilles Gouaillardet <[email protected]>
mpicc and ompirun fail to run when compiled with "-fsanitize=address" due to a memory overflow in OPAL arg parsing.
The text was updated successfully, but these errors were encountered: