Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues running OMPI with address sanitizer #11826

Closed
nivShpak opened this issue Jul 17, 2023 · 26 comments · Fixed by #12030
Closed

Issues running OMPI with address sanitizer #11826

nivShpak opened this issue Jul 17, 2023 · 26 comments · Fixed by #12030
Assignees
Labels

Comments

@nivShpak
Copy link

mpicc and ompirun fail to run when compiled with "-fsanitize=address" due to a memory overflow in OPAL arg parsing.

@devreal
Copy link
Contributor

devreal commented Jul 17, 2023

@nivShpak Could you please provide the error you encounter? It's not clear what #11827 really fixes.

@nivShpak
Copy link
Author

nivShpak commented Jul 17, 2023

in function opal_argv_count()

int opal_argv_count(char **argv)
{
    char **p;
    int i;

    if (NULL == argv) {
        return 0;
    }

    for (i = 0, p = argv; *p; i++, p++) {  //<=
        continue;
    }

    return i;
}

it accesses the array until it gets to its end. This causes a heap overflow that prohibits address sanitizer from running.
When I try to compile an ompi application using mpicc the mpicc is crushing at that point.
the fix avoids this overflow access so we can debug with -fsanitize=address flag.

@jsquyres
Copy link
Member

@nivShpak I agree with @devreal -- I don't see how the code snipit you showed above (opal_argv_count()) is related to the enum that you added in #11827. Can you explain in more detail? Thanks.

@nivShpak
Copy link
Author

When it counts the args (*p) after the third it accesses a memory out on the range.
This access fails the program if it was compiled with address sanitizer
In order to run an OMPI app with address sanitizer, we need to compile ompi with it, but mpicc and prirun cannot run with address sanitizer because of this.

@devreal
Copy link
Contributor

devreal commented Jul 17, 2023

@nivShpak It's still not clear to me. Can you please a) post the output of address sanitizer; and b) explain how the added enum member changes the behavior of the loop you posted?

@artpol84
Copy link
Contributor

artpol84 commented Jul 17, 2023

@jsquyres @devreal
we encountered the issue when trying to use Address Sanitizer as @nivShpak mentioned.

The problem is the following. Currently, the size of the array is exactly the number of colors.

extern char *opal_var_dump_color[OPAL_VAR_DUMP_COLOR_KEY_COUNT];

However, the code snippet provided by @nivShpak

  for (i = 0, p = argv; *p; i++, p++) {  //<=

Assumes that there is a NULL element in the end.
The idea of the fix is to introduce one extra element at the end of the array to address that.

@jsquyres
Copy link
Member

Are you saying that opal_var_dump_color[OPAL_VAR_DUMP_COLOR_KEY_COUNT - 1] != NULL? If so, isn't that what should be fixed?

It feels weird to me to add a meaningless enum that has a side effect of lengthening an array that may have a further side effect of fixing a memory error. I think the real questions are:

  • Is opal_var_dump_color[] not long enough?
    • If it's not long enough, add 1 to the length -- don't add a meaningless enum.
  • Is opal_var_dump_color[] not initialized properly?
    • If it's not initialized properly, then initialize it properly.

Or am I missing something?

@artpol84
Copy link
Contributor

artpol84 commented Jul 17, 2023

Are you saying that opal_var_dump_color[OPAL_VAR_DUMP_COLOR_KEY_COUNT - 1] != NULL?

Correct

It feels weird to me to add a meaningless enum that has a side effect of lengthening an array that may have a further side effect of fixing a memory error. I think the real questions are:

  • Is opal_var_dump_color[] not long enough?

Yes it's not.
We chose to use ENUM because the enum element is currently used to identify its length:

typedef enum {
...
    OPAL_VAR_DUMP_COLOR_KEY_COUNT
} opal_var_dump_color_key_t;
extern char *opal_var_dump_color[OPAL_VAR_DUMP_COLOR_KEY_COUNT];

Maybe using a different name would be better, like OPAL_VAR_DUMP_COLOR_VALID_NULL_TERM would make it cleaner as it will explain that there's actually an element in the array that has to be there to ensure correctness.
As ENUM is used for indexing it might be a good idea to cover all of the cases explicitly for maintenance purpose

If the naming space is precious and you don't want to pollute it with the NULL_TERM element, the other solution would be to define the macro to calculate the array size;

typedef enum {
    OPAL_VAR_DUMP_COLOR_VAR_NAME = 0,
    OPAL_VAR_DUMP_COLOR_VAR_VALUE = 1,
    OPAL_VAR_DUMP_COLOR_VALID_VALUES = 2,
    OPAL_VAR_DUMP_COLOR_KEY_COUNT
} opal_var_dump_color_key_t;

#define OPAL_VAR_DUMP_COLOR_SIZE (OPAL_VAR_DUMP_COLOR_KEY_COUNT + 1)
extern char *opal_var_dump_color[OPAL_VAR_DUMP_COLOR_SIZE];
  • If it's not long enough, add 1 to the length -- don't add a meaningless enum.
  • Is opal_var_dump_color[] not initialized properly?

No. As long as it's long enough I believe that the static nature of the array allows it to be initialized correctly.

@artpol84
Copy link
Contributor

@nivShpak also, I assume we should add a more meaningful description to the commit message to detail what is the fix.
The confusion here is confirming that.

@jsquyres
Copy link
Member

jsquyres commented Jul 17, 2023

I think declaring the array like this is fine:

char *opal_var_dump_color[OPAL_VAR_DUMP_COLOR_KEY_COUNT + 1];

It's well known/understood in C that if you have a NULL-terminated array, then you need to have an additional element. I think having an otherwise-meaningless enum would be a little weird, especially since there's no other precedent for that in OMPI (i.e., if the OMPI code base commonly used that convention, that would probably be acceptable -- but as a one-off, it's a little weird).

Alternatively, if using opal_argv_count() is the Wrong Thing, and we should instead be explicitly passing around the length of the array, or know that the length of the array is OPAL_VAR_DUMP_COLOR_KEY_COUNT, that would also be fine. The usage would need to be updated in all the appropriate places.

@artpol84
Copy link
Contributor

I think declaring the array like this is fine:

char *opal_var_dump_color[OPAL_VAR_DUMP_COLOR_KEY_COUNT + 1];

Ok, @nivShpak please change accordingly.

Alternatively, if using opal_argv_count() is the Wrong Thing, and we should instead be explicitly passing around the length of the array, or know that the length of the array is OPAL_VAR_DUMP_COLOR_KEY_COUNT, that would also be fine. The usage would need to be updated in all the appropriate places.

That sounds like a more disruptive task that @nivShpak don't have cycles for. So I'd stick to the short-term fix and have that other item in the TODO list for later.

ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Jul 18, 2023
since opal_var_dump_color_keys is used with opal_argv_count()
and friends, make sure is is long enough and NULL terminated

Refs. open-mpi#11826

Signed-off-by: Gilles Gouaillardet <[email protected]>
@ggouaillardet
Copy link
Contributor

I just issued #11828 since I think opal_var_dump_color should be long enough and NULL terminated

@artpol84
Copy link
Contributor

@ggouaillardet I was hoping you guys will let @nivShpak to commit this one as he spent time debugging this.
The fix is trivial, but he's the one to discover that and to bring this up

@artpol84
Copy link
Contributor

I just issued #11828 since I think opal_var_dump_color should be long enough and NULL terminated

This is exactly what #11827 was fixing

@ggouaillardet
Copy link
Contributor

I thought @nivShpak did not have the cycles to fix the root cause.

#11827 relies on the compiler initializing the last element to NULL. That might be true since C99 but I do not think we rely on this in other parts of the code.

@rhc54
Copy link
Contributor

rhc54 commented Jul 18, 2023

My comment was aimed at the fact that all other entries in the enum have explicit values assigned to them, but not the last one. This raises the question of someone having forgotten to assign it - and someone mistakenly assigning it the next integer value in the list. It also opens the door to someone later removing the static declarative and suddenly letting that entry default to whatever was in the memory space.

Explicitly setting the value seems like simple defensive programming. Just a suggestion.

@artpol84
Copy link
Contributor

I thought @nivShpak did not have the cycles to fix the root cause.

#11827 relies on the compiler initializing the last element to NULL. That might be true since C99 but I do not think we rely on this in other parts of the code.

He will fix this one. He doesn’t have cycles to change the behavior of the function as @jsquyres suggested

@artpol84
Copy link
Contributor

We will update the pr based on the discussion here

@ggouaillardet
Copy link
Contributor

@jsquyres carefully wrote "if using opal_argv_count() is the Wrong Thing"

My opinion is this is the right thing to do here, so the correct fix is to make opal_var_dump_color NULL-terminated.

@artpol84
Copy link
Contributor

artpol84 commented Jul 18, 2023

#11827 relies on the compiler initializing the last element to NULL. That might be true since C99 but I do not think we rely on this in other parts of the code.

I think that the 0-initialization is in C from the very beginning (C89):
https://stackoverflow.com/questions/16015656/are-global-variables-always-initialized-to-zero-in-c

It also opens the door to someone later removing the static declarative and suddenly letting that entry default to whatever was in the memory space.

Per the link above that cites the standard, it doesn't matter if it's static or just global - the remaining is set to zeros. Even for auto vars if the first few items are initialized "the remainder of the aggregate shall be initialized implicitly the same as objects that have static storage duration". So relying on this should be safe in any circumstance.
However, I agree it's not bad when it's explicit.

@jsquyres carefully wrote "if using opal_argv_count() is the Wrong Thing"

I wasn't arguing for or against it. I've just said "So I'd stick to the short-term fix and have that other item in the TODO list for later." referring to #11827. Sorry for the confusion.

@gkatev
Copy link
Contributor

gkatev commented Jul 18, 2023

Hi, Re: the root cause:

It looks like the culprit are the calls to opal_argv_join_range(), in these spots:

string = opal_argv_join_range(opal_var_dump_color_keys, 0, OPAL_VAR_DUMP_COLOR_KEY_COUNT, ',');

char *valid_keys = opal_argv_join_range(key_names, 0, key_count, ',');

Both of these calls specify the size of the array, but (apparently, oops!) opal_argv_join_range() calls opal_argv_count() nontheless for error checking. I suppose this is reasonable for opal_argv_join_range()'s implementation to do, as it's meant to be used with NULL-terminated arrays (?).

if (NULL == argv || NULL == argv[0] || (int) start >= opal_argv_count(argv)) {

opal_var_dump_color_keys wasn't initially meant to be a NULL-terminated array or be used with argv functions, but perhaps it's a good fit. Personally I think #11828 is okay as a solution. An alternative is two replace both of the above calls to opal_argv_join_range() with an opal_asprintf loop; I could whip up a patch.

Note that there are also some other calls to opal_argv_* functions inside parse_color_string(), besides join_range, but these operate not on the key names array, but on arrays returned from opal_argv_split().

@jsquyres
Copy link
Member

My $0.02: I think we came to the conclusion here that the "add another enum value" solution that was (at least initially?) in #11827 was a little too indirect, and the commit message and lack of comments didn't make the problem or the solution clear. Upon further discussion, the rationale for the additional enum came out, and alternatives were suggested. @ggouaillardet coded up one of these alternatives, based on perhaps a misunderstanding that NVIDIA didn't have time to pursue this further.

I don't think anyone is at fault here.

ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Jul 18, 2023
since opal_var_dump_color_keys is used with opal_argv_count()
and friends, make sure is is long enough and NULL terminated

Thanks to Niv Shpak for reporting this and identifying the root cause.

Refs. open-mpi#11826

Signed-off-by: Gilles Gouaillardet <[email protected]>
nivShpak pushed a commit to nivShpak/ompi that referenced this issue Jul 18, 2023
@artpol84
Copy link
Contributor

#11827 is updated per the discussion here incorporating the initialization fix from @ggouaillardet

nivShpak pushed a commit to nivShpak/ompi that referenced this issue Jul 18, 2023
nivShpak pushed a commit to nivShpak/ompi that referenced this issue Jul 18, 2023
nivShpak pushed a commit to nivShpak/ompi that referenced this issue Jul 18, 2023
nivShpak pushed a commit to nivShpak/ompi that referenced this issue Jul 18, 2023
@gkatev
Copy link
Contributor

gkatev commented Jul 19, 2023

I created #11829 as another alternative. It does away with opal_argv_join_range(). With these calls gone, I don't see another reason that opal_var_dump_color_keys[] need be converted to a NULL-terminated array, solving the dilemma for the most elegant way of doing it.

(@nivShpak could you confirm if it solves the initial issue?)

@jsquyres
Copy link
Member

jsquyres commented Oct 29, 2023

I ran into this issue again yesterday and today. After reviewing the proposed solutions, I'm going to merge #11828, @ggouaillardet's solution.

  • It doesn't add a new #define name
  • It NULL-terminates the array, which is a good defensive-programming measure (slightly better than just removing the current need to have the array be NULL-terminated)
  • It is only 1 commit (vs. 2)
  • It has a clear PR title and commit message (which is good for history)
  • It properly credits the original submitter and proposed fix

jsquyres pushed a commit to ggouaillardet/ompi that referenced this issue Oct 29, 2023
since opal_var_dump_color_keys is used with opal_argv_count()
and friends, make sure is is long enough and NULL terminated

Thanks to Niv Shpak for reporting this and identifying the root cause.

Refs. open-mpi#11826

Signed-off-by: Gilles Gouaillardet <[email protected]>
jsquyres pushed a commit to jsquyres/ompi that referenced this issue Oct 29, 2023
since opal_var_dump_color_keys is used with opal_argv_count()
and friends, make sure is is long enough and NULL terminated

Thanks to Niv Shpak for reporting this and identifying the root cause.

Refs. open-mpi#11826

Signed-off-by: Gilles Gouaillardet <[email protected]>
(cherry picked from commit 7b325e3)
@jsquyres jsquyres linked a pull request Oct 29, 2023 that will close this issue
@janjust
Copy link
Contributor

janjust commented Nov 9, 2023

fixed with #12030 closing

@janjust janjust closed this as completed Nov 9, 2023
bosilca pushed a commit to bosilca/ompi that referenced this issue Feb 14, 2024
since opal_var_dump_color_keys is used with opal_argv_count()
and friends, make sure is is long enough and NULL terminated

Thanks to Niv Shpak for reporting this and identifying the root cause.

Refs. open-mpi#11826

Signed-off-by: Gilles Gouaillardet <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
8 participants