ci: de-duplicate deps by docs/requirements.txt and tests/requirements.txt and update CI images #519

consideRatio · 2022-03-27T21:25:11Z

Important

While this is a big PR, it is about CI - it only touches two setup.py files which is directly related to end use. Overall, this is reducing complexity but since I've added a lot of comments and documentation there are a lot of additional lines.

Summary

Closes De-duplicate package dependencies #516
I thought it was a maintenance problem that there were so many of the same dependencies declared in multiple locations. I got confused by that overall and decided to address it before digging into ci: use chartpress to build/test/publish images and Helm chart #514.

I've created docs/requirements.txt and tests/requirements.txt, that alongside dependencies declared in setup.py files makes for the only places we declare Python dependencies.
Closes paywalling of cloudera repo #401

I wanted to rebuild the images that no longer included duplicated deps so I took on paywalling of cloudera repo #401 here as well which was a major undertaking. We now test against Hadoop 3 / Yarn 2 instead of Hadoop 2 and Yarn 1.
This PR unblocks work on Drop support for python 3.7 #498 that would otherwise break our Yarn/Slurm/PBS tests. This is because if we start require a newer version than Python 3.7, we would still be stuck with our old images we run tests in for Yarn/Slurm/PBS which had a Python 3.7 version.
This PR helped me debug CI failures for main/k8s tests has been introduced without changes in this repo #522 and create maint: avoid regression/breaking change in click and declare our dependency to the library explicitly #525 as I could update click pinnings in fewer places and not miss a place which would have made me draw the wrong conclusions as I did initially.

Intermittent failures

This PR initially introduced intermittent failures for the hadoop test, but they are now gone by restricting memory settings on scheduler/workers started during tests. See #422 for details on already existing intermittent test failures though.

martindurant · 2022-04-04T14:51:31Z

which was a major undertaking.

:)
I will look it over, not sure how long it will take or how authoritative I can be!

consideRatio · 2022-04-04T22:54:52Z

@martindurant I think I figured it out, it was a newline character in CLASSPATH as set by skein after reading what yarn classpath reported which made at least the last entry - but not all entries - in the CLASSPATH fail to function properly. Figuring this out took me a full day of effort ;D

I've updated the top comment with links to PRs etc to fix it.

consideRatio · 2022-04-05T18:24:24Z

@martindurant @jacobtomlinson @jcrist this is ready for review now. I've updated the original PR description to be more succinct about the changes.

consideRatio · 2022-04-05T19:28:17Z

@jcrist I've verified that the wheel with skein 0.8.2 that was recently released functions as it should as part of this PR!

consideRatio · 2022-04-06T12:06:48Z

@martindurant @jacobtomlinson @jcrist I'd love to get this merged and iterate from there, this includes too many changes as it is.

Next work item is to make tests more reliable, the main tests, the pbs tests, and now also the hadoop tests. I don't know what makes them be unreliable though besides guesses that it at least partially involves a OOMKiller sending SIGKILL causing exit code 137 for the PBS test, which then shows symptoms of "ValueError, 404 NOT FOUND" etc, which is what I also see in main/hadoop test.

martindurant

I have looked through much of the YARN stuff, and I have very little problem with anything, it all looks solid. Of course, there are loads of configs, so I could easily have missed something.

continuous_integration/docker/base/Dockerfile

continuous_integration/docker/hadoop/_print_logs.sh

continuous_integration/docker/hadoop/_test.sh

martindurant · 2022-04-09T02:16:14Z

continuous_integration/docker/hadoop/files/etc/hadoop/conf.kerberos/core-site.xml

@@ -45,24 +45,4 @@
        <value>org.apache.hadoop.security.AuthenticationFilterInitializer</value>
    </property>

-    <property>
-        <name>hadoop.http.authentication.type</name>


You disabled HTTP auth altogether? I agree with that.

I saw it failing anyhow, so I figured removing it reduced complexity without downside.

I'm confused. This is the default value anyhow so this didn't change much I think, but what does this config really do? I think it influences something that we may not be using?

continuous_integration/docker/hadoop/files/etc/hadoop/conf.kerberos/mapred-site.xml

martindurant · 2022-04-09T02:18:34Z

continuous_integration/docker/hadoop/files/etc/hadoop/conf.kerberos/yarn-site.xml

-        <name>yarn.nodemanager.resource.cpu-vcores</name>
-        <value>16</value>
+        <name>yarn.nodemanager.resource.memory-mb</name>
+        <value>4096</value>


You probably know, but YARN doesn't actually respect these as limits, it's only used for internal bookkeeping; so it doesn't matter is the worker actually had this much memory available.

Ooh, that is critical knowledge. I wanted to stay within memory limits and constrain this to avoid crashing things during tests. I had seen a warning about possible "thrashing" of workers or similar.

Ideas on how to make us limit ourselves better?

Is this config also pointless?

<property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>32</value> </property> <property> <name>yarn.resource-types.memory-mb.increment-allocation</name> <value>${yarn.scheduler.minimum-allocation-mb}</value> </property>

I did get tripped up by this once, where where the default total memory resource of the cluster was 4096 and the minimum allocation was 1024, so you would get no more than 4 containers no matter what the actual memory requirements of requests were. The default minimum still seems to be 1024 https://hadoop.apache.org/docs/r3.0.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml

I would make these numbers small, or else just wait in case something ever goes wrong.

@martindurant I found that in the test_yarn_backend.py file, the YarnBackend is configured with memory for the workers. I lowered it from 512 to 128 and things may have improved, not sure yet - intermittency =/

consideRatio · 2022-04-15T18:56:09Z

@martindurant status update:

This PR not longer introduces a regression in the amount of intermittent failures! By lowering the memory requirements, we resolved the regression seen previously.

So far hadoop tests have succeeded 4/4 since memory limits were stricted, and pbs test has succeeded 3/4 since - but errored during pip install during setup for testing within the container due to memory issues. This is the previous intermittent error documented in #422.

@martindurant I consider this ready for merge at this point - if you agree, lets do it!

consideRatio · 2022-04-15T19:22:46Z

@martindurant thanks for review efforts!!! 🎉 ❤️ - to know you were around giving it a look made a big difference on my motivation to get this done!

martindurant · 2022-04-16T02:40:15Z

I'm not sure when I'll have the time to look through again :| If you think things are in a good state, I trust you!

consideRatio · 2022-04-16T13:37:41Z

I'm not sure when I'll have the time to look through again :| If you think things are in a good state, I trust you!

Thank you! I think this is good to go then based on my understanding that also @jcrist thought these were acceptable changes by brief inspection as discussed in the video meet we scheduled.

Btw, did you receive an email about this meet? I sent one to your address declared in LinkedIn.

consideRatio marked this pull request as draft March 27, 2022 21:29

consideRatio mentioned this pull request Mar 27, 2022

Actions and input needed towards CI/CD for upcoming release #484

Closed

25 tasks

consideRatio force-pushed the pr/de-duplicate-dependencies branch 3 times, most recently from 1b420ef to 07c076d Compare April 3, 2022 23:41

consideRatio changed the title ~~maint: de-duplicate declared Python dependencies~~ maint: de-duplicate deps by docs/requirements.txt and tests/requirements.txt and update CI images Apr 3, 2022

consideRatio added maintenance ci labels Apr 3, 2022

consideRatio marked this pull request as ready for review April 4, 2022 00:02

consideRatio mentioned this pull request Apr 4, 2022

paywalling of cloudera repo #401

Closed

consideRatio force-pushed the pr/de-duplicate-dependencies branch 4 times, most recently from 763f2e4 to 188b488 Compare April 4, 2022 04:11

consideRatio marked this pull request as draft April 4, 2022 04:13

consideRatio force-pushed the pr/de-duplicate-dependencies branch from 188b488 to 9ec4a39 Compare April 4, 2022 05:41

consideRatio force-pushed the pr/de-duplicate-dependencies branch from 9ec4a39 to 7fd663a Compare April 5, 2022 13:22

consideRatio marked this pull request as ready for review April 5, 2022 13:25

consideRatio force-pushed the pr/de-duplicate-dependencies branch 6 times, most recently from a20b908 to 20183f2 Compare April 5, 2022 16:26

consideRatio changed the title ~~maint: de-duplicate deps by docs/requirements.txt and tests/requirements.txt and update CI images~~ ci: de-duplicate deps by docs/requirements.txt and tests/requirements.txt and update CI images Apr 5, 2022

consideRatio removed the maintenance label Apr 5, 2022

consideRatio force-pushed the pr/de-duplicate-dependencies branch 2 times, most recently from 86c5df3 to b8d57a8 Compare April 5, 2022 22:51

This was referenced Apr 6, 2022

Reduce complexity of Helm chart's Dockerfiles and define expectations #524

Closed

maint: avoid regression/breaking change in click and declare our dependency to the library explicitly #525

Merged

consideRatio added 2 commits April 6, 2022 11:38

ci: de-duplicate deps and update ci images to build again (hadoop 2->3)

ef79c14

ci: rely on skein 0.8.2 and remove workaround

d6cdf33

consideRatio force-pushed the pr/de-duplicate-dependencies branch from b8d57a8 to d6cdf33 Compare April 6, 2022 09:39

consideRatio requested review from jcrist, jacobtomlinson and martindurant April 6, 2022 12:08

martindurant reviewed Apr 9, 2022

View reviewed changes

consideRatio added 3 commits April 15, 2022 19:50

Remove mapred related files/config

7972d15

Remove config without impact

45c4576

Reduce resources for workers etc, hoping to avoid memory crashes

9cd7c9e

consideRatio force-pushed the pr/de-duplicate-dependencies branch from f2de44d to 9cd7c9e Compare April 15, 2022 17:55

ci: stop running basic auth test for kerberos

31fdf57

Add colorized headings on what commands are run

02996f6

consideRatio force-pushed the pr/de-duplicate-dependencies branch from b8a515a to 02996f6 Compare April 15, 2022 21:38

consideRatio added 3 commits April 16, 2022 00:28

Revert removal, and add http-secret-file to avoid warning

27797cb

ci: fix permissions on a few files

cdd8044

ci: add notes about configuration reference for supervisord

9041d13

consideRatio force-pushed the pr/de-duplicate-dependencies branch from 71a8d4a to 9041d13 Compare April 16, 2022 13:04

consideRatio merged commit e054a65 into dask:main Apr 16, 2022

ci: de-duplicate deps by docs/requirements.txt and tests/requirements.txt and update CI images #519

ci: de-duplicate deps by docs/requirements.txt and tests/requirements.txt and update CI images #519

Uh oh!

Conversation

consideRatio commented Mar 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Important

Summary

Intermittent failures

Uh oh!

martindurant commented Apr 4, 2022

Uh oh!

consideRatio commented Apr 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

consideRatio commented Apr 5, 2022

Uh oh!

consideRatio commented Apr 5, 2022

Uh oh!

consideRatio commented Apr 6, 2022

Uh oh!

martindurant left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

consideRatio commented Apr 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

consideRatio commented Apr 15, 2022

Uh oh!

martindurant commented Apr 16, 2022

Uh oh!

consideRatio commented Apr 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

consideRatio commented Mar 27, 2022 •

edited

Loading

consideRatio commented Apr 4, 2022 •

edited

Loading

consideRatio commented Apr 15, 2022 •

edited

Loading

consideRatio commented Apr 16, 2022 •

edited

Loading