-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix dist tests #147
Fix dist tests #147
Conversation
LGTM |
Actually, still some failures I need to resolve according to some of these CI test results. Have we found a way around the -env flag yet Roshan? It's causing some configs to completely fail. |
@ddn0 has pushed a fix for env flag (not sure if it has been merged to master yet). |
Someone ran into this over at dmlc/cxxnet#211 (comment). It looks like with openmpi the |
Yeah the fix (just merged into master) is to not set anything at all: 97e9839 Spooky |
Okay, just noticed that Donald fixed this already. That looks good. |
Cool; will merge and apply some more fixes then and try CI again. |
The -env fix revealed another class of errors that I'll be looking into. Some single process runs are failing as well, so that's good (easier to debug). |
Great, thanks for looking in to these failures! |
Did some more investigating. Seems to me like the issue is that the machine runs out of memory to support the amount of processes spawning in parallel. Each process takes up 1GB (seems to be preserved by Galois; tried a shared memory app as well), so if you have multiple tests running in parallel + each one potentially using 2-3 processes, memory runs out and segfaults occur. I tested some of this by ssh'ing into the machine and spawning runs with conditions. Not having enough memory can either crash or even hang. Any idea where I can change the amount Galois preallocs when it initializes the system to verify this hypothesis? Not too familiar with the memory system. |
I don't know what could be causing that off the top of my head. 1GB is a massive memory footprint though, so we should try to get that down. |
There are currently 2 issues:
|
I used the more modern PEP 518 (pyproject.toml) packaging standard rather than setuptools, requirements.txt, etc. directly. Instead of using Cython.Build.cythonize and distuils.Extensions in setup.py, a typical method to create cython extensions, I opted to use the scikit-build package, which handles gluing setup.py with CMake. In particular, some changes to the directory structure, imports and Cython modules themselves needed to be made to work with scikit-build because it uses the cython executable rather than the Cython.Build.cythonize API. This seems a net positive anyway because one can reproduce the build steps manually for debugging purposes with: cython -3 --cplus -I .. -I .. file.pyx Development can be done in the traditional Python way with some adjustments for the underlying CMake build system: python setup.py build_ext --inplace -- -DCMAKE_OPTIONS=... # or export GALOIS_CMAKE_ARGS=-DCMAKE_OPTIONS=... pip install -e . --no-build-isolation The python directory structure was reorganized to ease packaging as well. All Galois Python packages are now under the package galois. The libgalois_shem bindings are in galois.shmem. To avoid conflicts with existing CMake target names, the test applications are under galois._bfs, ... with import aliases galois.bfs, ... respectively.
Need a better way to initialize shared memory runtime. Using explicit "new" as Default constructors are not being called.
- Rather than create new symbols like USE_GPU follow the pattern of other libgpu/libdist which uses ENABLE_HETRO_GALOIS (-D__GALOIS_HET_CUDA__=1) to signal GPU configurations. - Namespace MAX_SIZE macro to avoid clobbering MAX_SIZE symbols defined elsewhere. - Depend on libraries to provide their include directories and other configuration rather than obsolete app function. - Move third-party dependencies to distinguished external directory rather than burying them in our own code. - Delete unreferenced and non-building code.
Without prefixes it is difficult to determine where to find the source of includes. Also it makes installation difficult because there there may be filename conflicts among installations (e.g., util.h vs util.h). Add "pangolin/" as a directory prefix to all pangolin includes. I didn't go further to distinguish GPU and CPU versions of pangolin; though that is an obvious next step. The pangolin library relies on include tricks to work so defer further refactoring until the library is more separated from its applications.
For multihost runs on CentOS, doing this fixes it: Still unclear how to get it fixed on Alpine, but I think the issue is probably similar as both fail the moment MPI is initialized. @insertinterestingnamehere any idea how do deal with this? |
Will wait for Ian to look into the weird malloc/MPI issue. Note that CentOS-8-clang passing is a lie: if you check the logs in question there's quite a bit of failures, but the fix for that is mentioned in another comment. |
The MPI crash is not likely to be something I can figure out quickly. If you're not seeing a solution, I propose we stop building/testing dist-galois in those two build configs for the time being so we can get this merged. Thoughts? |
Two questions: (1) Are all dist-tests passing in some configuration? (2) In those two build configs in which dist-tests are failing, are all dist-tests failing? If (1) is true, then I'm fine with merging this to master. If (2) is true, then I'm fine with not running dist-tests in those configs. However, let's continue to check if all the dist apps build fine in those configs if that is feasible. |
@roshandathathri I'm fine with merging this in since it seems the major issues were resolved. |
LGTM |
I think Donald put himself on as a reviewer, so will wait for that + for CI to check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of these days, it would be good to get in the practice of git rebase --interactive
to clean up all the unnecessary WIP/reverts/etc. If you want to go through the trouble of taking the final diff and rebasing onto master (+ force push) that would be great (to me). This is up to you since we haven't gotten to a group consensus that having a cleaner history is important.
+1 I'm all in for cleaner history |
Will make the suggested changes. If I understand what you're saying about the rebase, then I guess you want something like this:
This will make it so master's history is clean. The only problem is I already merged these changes into dist-dev after yesterday's meeting about workflow, so once dist-dev gets merged into master I don't see much of a difference (I think?). |
Hehe, yes. I think it's fine for now. |
If this is already merged to some "visible" branch, then yes, it is probably too much to fix this. The time to fix this was when this change was merged into dist-dev. Though if one follows the tree analogy, we really should be merging dist-dev (unless fix_dist_tests is dist-dev?). |
fix_dist_tests was forked from master unfortunately, so following the tree analogy it's actually master then dist-dev. And yes, the cleanup should have been done when I merged to dist-dev yesterday. Haven't heard of rebase --interactive before; will have to try it out. For now I'll just follow the tree I guess. |
it's 🔥 |
Guess I'll have to try it. :P |
Once these CI tests pass will merge into master and percolate down into dist-dev. llvm separation is done in dist-dev; pull request for that will be opening soon after this PR is merged. |
reducing amount of compute dist tests do; PR here to run CI tests