-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: de-duplicate deps by docs/requirements.txt and tests/requirements.txt and update CI images #519
ci: de-duplicate deps by docs/requirements.txt and tests/requirements.txt and update CI images #519
Conversation
1b420ef
to
07c076d
Compare
763f2e4
to
188b488
Compare
188b488
to
9ec4a39
Compare
:) |
@martindurant I think I figured it out, it was a newline character in I've updated the top comment with links to PRs etc to fix it. |
9ec4a39
to
7fd663a
Compare
a20b908
to
20183f2
Compare
@martindurant @jacobtomlinson @jcrist this is ready for review now. I've updated the original PR description to be more succinct about the changes. |
@jcrist I've verified that the wheel with skein 0.8.2 that was recently released functions as it should as part of this PR! |
86c5df3
to
b8d57a8
Compare
b8d57a8
to
d6cdf33
Compare
@martindurant @jacobtomlinson @jcrist I'd love to get this merged and iterate from there, this includes too many changes as it is. Next work item is to make tests more reliable, the main tests, the pbs tests, and now also the hadoop tests. I don't know what makes them be unreliable though besides guesses that it at least partially involves a OOMKiller sending SIGKILL causing exit code 137 for the PBS test, which then shows symptoms of "ValueError, 404 NOT FOUND" etc, which is what I also see in main/hadoop test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have looked through much of the YARN stuff, and I have very little problem with anything, it all looks solid. Of course, there are loads of configs, so I could easily have missed something.
@@ -45,24 +45,4 @@ | |||
<value>org.apache.hadoop.security.AuthenticationFilterInitializer</value> | |||
</property> | |||
|
|||
<property> | |||
<name>hadoop.http.authentication.type</name> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You disabled HTTP auth altogether? I agree with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw it failing anyhow, so I figured removing it reduced complexity without downside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. This is the default value anyhow so this didn't change much I think, but what does this config really do? I think it influences something that we may not be using?
continuous_integration/docker/hadoop/files/etc/hadoop/conf.kerberos/mapred-site.xml
Outdated
Show resolved
Hide resolved
<name>yarn.nodemanager.resource.cpu-vcores</name> | ||
<value>16</value> | ||
<name>yarn.nodemanager.resource.memory-mb</name> | ||
<value>4096</value> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably know, but YARN doesn't actually respect these as limits, it's only used for internal bookkeeping; so it doesn't matter is the worker actually had this much memory available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh, that is critical knowledge. I wanted to stay within memory limits and constrain this to avoid crashing things during tests. I had seen a warning about possible "thrashing" of workers or similar.
Ideas on how to make us limit ourselves better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this config also pointless?
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>32</value>
</property>
<property>
<name>yarn.resource-types.memory-mb.increment-allocation</name>
<value>${yarn.scheduler.minimum-allocation-mb}</value>
</property>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did get tripped up by this once, where where the default total memory resource of the cluster was 4096 and the minimum allocation was 1024, so you would get no more than 4 containers no matter what the actual memory requirements of requests were. The default minimum still seems to be 1024 https://hadoop.apache.org/docs/r3.0.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
I would make these numbers small, or else just wait in case something ever goes wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martindurant I found that in the test_yarn_backend.py file, the YarnBackend is configured with memory for the workers. I lowered it from 512 to 128 and things may have improved, not sure yet - intermittency =/
f2de44d
to
9cd7c9e
Compare
@martindurant status update: This PR not longer introduces a regression in the amount of intermittent failures! By lowering the memory requirements, we resolved the regression seen previously. So far hadoop tests have succeeded 4/4 since memory limits were stricted, and pbs test has succeeded 3/4 since - but errored during pip install during setup for testing within the container due to memory issues. This is the previous intermittent error documented in #422. @martindurant I consider this ready for merge at this point - if you agree, lets do it! |
@martindurant thanks for review efforts!!! 🎉 ❤️ - to know you were around giving it a look made a big difference on my motivation to get this done! |
b8a515a
to
02996f6
Compare
I'm not sure when I'll have the time to look through again :| If you think things are in a good state, I trust you! |
71a8d4a
to
9041d13
Compare
Thank you! I think this is good to go then based on my understanding that also @jcrist thought these were acceptable changes by brief inspection as discussed in the video meet we scheduled. Btw, did you receive an email about this meet? I sent one to your address declared in LinkedIn. |
Important
While this is a big PR, it is about CI - it only touches two
setup.py
files which is directly related to end use. Overall, this is reducing complexity but since I've added a lot of comments and documentation there are a lot of additional lines.Summary
Closes De-duplicate package dependencies #516
I thought it was a maintenance problem that there were so many of the same dependencies declared in multiple locations. I got confused by that overall and decided to address it before digging into ci: use chartpress to build/test/publish images and Helm chart #514.
I've created docs/requirements.txt and tests/requirements.txt, that alongside dependencies declared in setup.py files makes for the only places we declare Python dependencies.
Closes paywalling of cloudera repo #401
I wanted to rebuild the images that no longer included duplicated deps so I took on paywalling of cloudera repo #401 here as well which was a major undertaking. We now test against Hadoop 3 / Yarn 2 instead of Hadoop 2 and Yarn 1.
This PR unblocks work on Drop support for python 3.7 #498 that would otherwise break our Yarn/Slurm/PBS tests. This is because if we start require a newer version than Python 3.7, we would still be stuck with our old images we run tests in for Yarn/Slurm/PBS which had a Python 3.7 version.
This PR helped me debug CI failures for main/k8s tests has been introduced without changes in this repo #522 and create maint: avoid regression/breaking change in
click
and declare our dependency to the library explicitly #525 as I could updateclick
pinnings in fewer places and not miss a place which would have made me draw the wrong conclusions as I did initially.Intermittent failures
This PR initially introduced intermittent failures for the hadoop test, but they are now gone by restricting memory settings on scheduler/workers started during tests. See #422 for details on already existing intermittent test failures though.