Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dist tests #147

Merged
merged 44 commits into from
May 2, 2020
Merged

Conversation

l-hoang
Copy link
Member

@l-hoang l-hoang commented Apr 24, 2020

reducing amount of compute dist tests do; PR here to run CI tests

@roshandathathri
Copy link
Member

LGTM

@l-hoang
Copy link
Member Author

l-hoang commented Apr 24, 2020

Actually, still some failures I need to resolve according to some of these CI test results.

Have we found a way around the -env flag yet Roshan? It's causing some configs to completely fail.

@roshandathathri
Copy link
Member

@ddn0 has pushed a fix for env flag (not sure if it has been merged to master yet).

@insertinterestingnamehere
Copy link
Member

Someone ran into this over at dmlc/cxxnet#211 (comment). It looks like with openmpi the -x flag can be used instead of -env. Offhand I'm not sure about any formatting differences between the two.

@ddn0
Copy link
Contributor

ddn0 commented Apr 24, 2020

Yeah the fix (just merged into master) is to not set anything at all: 97e9839

Spooky

@insertinterestingnamehere
Copy link
Member

Okay, just noticed that Donald fixed this already. That looks good.

@l-hoang
Copy link
Member Author

l-hoang commented Apr 24, 2020

Cool; will merge and apply some more fixes then and try CI again.

@l-hoang
Copy link
Member Author

l-hoang commented Apr 24, 2020

The -env fix revealed another class of errors that I'll be looking into. Some single process runs are failing as well, so that's good (easier to debug).

@insertinterestingnamehere
Copy link
Member

Great, thanks for looking in to these failures!

@l-hoang
Copy link
Member Author

l-hoang commented Apr 24, 2020

Did some more investigating.

Seems to me like the issue is that the machine runs out of memory to support the amount of processes spawning in parallel. Each process takes up 1GB (seems to be preserved by Galois; tried a shared memory app as well), so if you have multiple tests running in parallel + each one potentially using 2-3 processes, memory runs out and segfaults occur.

I tested some of this by ssh'ing into the machine and spawning runs with conditions. Not having enough memory can either crash or even hang.

Any idea where I can change the amount Galois preallocs when it initializes the system to verify this hypothesis? Not too familiar with the memory system.

@insertinterestingnamehere
Copy link
Member

I don't know what could be causing that off the top of my head. 1GB is a massive memory footprint though, so we should try to get that down.

@l-hoang
Copy link
Member Author

l-hoang commented Apr 28, 2020

There are currently 2 issues:

  1. Stats generation fails non-deterministically sometimes; this is exceedingly rare, but needs to be fixed. It's not just a dist-tests issue either. @roshandathathri is currently on that.

  2. Figured out yesterday that ThreadPool is allocating a decent chunk of memory which is causing the memory issues. Currently looking into that now.

Gurbinder singh Gill and others added 13 commits April 28, 2020 13:44
I used the more modern PEP 518 (pyproject.toml) packaging standard
rather than setuptools, requirements.txt, etc. directly.

Instead of using Cython.Build.cythonize and distuils.Extensions in
setup.py, a typical method to create cython extensions, I opted to use
the scikit-build package, which handles gluing setup.py with CMake.

In particular, some changes to the directory structure, imports and
Cython modules themselves needed to be made to work with scikit-build
because it uses the cython executable rather than the
Cython.Build.cythonize API. This seems a net positive anyway because one
can reproduce the build steps manually for debugging purposes with:

  cython -3 --cplus -I .. -I .. file.pyx

Development can be done in the traditional Python way with some
adjustments for the underlying CMake build system:

  python setup.py build_ext --inplace -- -DCMAKE_OPTIONS=...
  # or
  export GALOIS_CMAKE_ARGS=-DCMAKE_OPTIONS=...
  pip install -e . --no-build-isolation

The python directory structure was reorganized to ease packaging as
well. All Galois Python packages are now under the package galois. The
libgalois_shem bindings are in galois.shmem. To avoid conflicts with
existing CMake target names, the test applications are under
galois._bfs, ... with import aliases galois.bfs, ... respectively.
Need a better way to initialize shared memory runtime.
Using explicit "new" as Default constructors are not being
called.
- Rather than create new symbols like USE_GPU follow the pattern of other
  libgpu/libdist which uses ENABLE_HETRO_GALOIS
  (-D__GALOIS_HET_CUDA__=1) to signal GPU configurations.

- Namespace MAX_SIZE macro to avoid clobbering MAX_SIZE symbols defined
  elsewhere.

- Depend on libraries to provide their include directories and other
  configuration rather than obsolete app function.

- Move third-party dependencies to distinguished external directory
  rather than burying them in our own code.

- Delete unreferenced and non-building code.
Without prefixes it is difficult to determine where to find the source
of includes. Also it makes installation difficult because there there
may be filename conflicts among installations (e.g., util.h vs util.h).

Add "pangolin/" as a directory prefix to all pangolin includes. I didn't
go further to distinguish GPU and CPU versions of pangolin; though that
is an obvious next step. The pangolin library relies on include tricks
to work so defer further refactoring until the library is more separated
from its applications.
@l-hoang
Copy link
Member Author

l-hoang commented Apr 29, 2020

For multihost runs on CentOS, doing this fixes it:
https://github.com/openucx/ucx/wiki/OpenMPI-and-OpenSHMEM-installation-with-UCX#running-open-mpi-with-ucx

Still unclear how to get it fixed on Alpine, but I think the issue is probably similar as both fail the moment MPI is initialized.

@insertinterestingnamehere any idea how do deal with this?

@l-hoang
Copy link
Member Author

l-hoang commented Apr 30, 2020

Will wait for Ian to look into the weird malloc/MPI issue.

Note that CentOS-8-clang passing is a lie: if you check the logs in question there's quite a bit of failures, but the fix for that is mentioned in another comment.

@insertinterestingnamehere
Copy link
Member

The MPI crash is not likely to be something I can figure out quickly. If you're not seeing a solution, I propose we stop building/testing dist-galois in those two build configs for the time being so we can get this merged. Thoughts?

@roshandathathri
Copy link
Member

Two questions:

(1) Are all dist-tests passing in some configuration?

(2) In those two build configs in which dist-tests are failing, are all dist-tests failing?

If (1) is true, then I'm fine with merging this to master.

If (2) is true, then I'm fine with not running dist-tests in those configs. However, let's continue to check if all the dist apps build fine in those configs if that is feasible.

@l-hoang
Copy link
Member Author

l-hoang commented Apr 30, 2020

@roshandathathri
(1) Yes.
(2) Yes, except on Alpine where 1 process runs pass if I recall correctly.

I'm fine with merging this in since it seems the major issues were resolved.

@roshandathathri
Copy link
Member

LGTM

@l-hoang
Copy link
Member Author

l-hoang commented May 1, 2020

I think Donald put himself on as a reviewer, so will wait for that + for CI to check.

Copy link
Contributor

@ddn0 ddn0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of these days, it would be good to get in the practice of git rebase --interactive to clean up all the unnecessary WIP/reverts/etc. If you want to go through the trouble of taking the final diff and rebasing onto master (+ force push) that would be great (to me). This is up to you since we haven't gotten to a group consensus that having a cleaner history is important.

libgalois/src/PerThreadStorage.cpp Outdated Show resolved Hide resolved
libgalois/src/PerThreadStorage.cpp Outdated Show resolved Hide resolved
@roshandathathri
Copy link
Member

+1 I'm all in for cleaner history

@l-hoang
Copy link
Member Author

l-hoang commented May 1, 2020

Will make the suggested changes.

If I understand what you're saying about the rebase, then I guess you want something like this:

  1. take diff from this PR
  2. apply to master
  3. commit to master
  4. merge back to dist-dev

This will make it so master's history is clean.

The only problem is I already merged these changes into dist-dev after yesterday's meeting about workflow, so once dist-dev gets merged into master I don't see much of a difference (I think?).

@roshandathathri
Copy link
Member

Hehe, yes. I think it's fine for now.

@ddn0
Copy link
Contributor

ddn0 commented May 1, 2020

The only problem is I already merged these changes into dist-dev after yesterday's meeting about workflow, so once dist-dev gets merged into master I don't see much of a difference (I think?).

If this is already merged to some "visible" branch, then yes, it is probably too much to fix this. The time to fix this was when this change was merged into dist-dev. Though if one follows the tree analogy, we really should be merging dist-dev (unless fix_dist_tests is dist-dev?).

@l-hoang
Copy link
Member Author

l-hoang commented May 1, 2020

fix_dist_tests was forked from master unfortunately, so following the tree analogy it's actually master then dist-dev.

And yes, the cleanup should have been done when I merged to dist-dev yesterday. Haven't heard of rebase --interactive before; will have to try it out.

For now I'll just follow the tree I guess.

@ddn0
Copy link
Contributor

ddn0 commented May 1, 2020

Haven't heard of rebase --interactive before

it's 🔥

@l-hoang
Copy link
Member Author

l-hoang commented May 1, 2020

Guess I'll have to try it. :P

@l-hoang
Copy link
Member Author

l-hoang commented May 2, 2020

Once these CI tests pass will merge into master and percolate down into dist-dev.

llvm separation is done in dist-dev; pull request for that will be opening soon after this PR is merged.

@l-hoang l-hoang merged commit 2e01b91 into IntelligentSoftwareSystems:master May 2, 2020
@l-hoang l-hoang deleted the fix_dist_tests branch May 2, 2020 03:15
@l-hoang l-hoang mentioned this pull request May 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants