Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jdk11u Alpine linux build failure: gpg keyserver timeout #3518

Closed
andrew-m-leonard opened this issue Nov 1, 2023 · 12 comments · Fixed by #3544
Closed

jdk11u Alpine linux build failure: gpg keyserver timeout #3518

andrew-m-leonard opened this issue Nov 1, 2023 · 12 comments · Fixed by #3544
Labels
alpine-linux Issues that affect or relate to the Alpine LINUX OS buildbreak High priority issues that cause build breaks in jenkins or build scripts x-linux Issues that affect or relate to the x64/x32 LINUX OS

Comments

@andrew-m-leonard
Copy link
Contributor

https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk11u/job/jdk11u-alpine-linux-x64-temurin/277/console

18:05:42  GNUPGHOME=/tmp/.gpg-temp.194
18:05:42  gpg: keybox '/tmp/.gpg-temp.194/pubring.kbx' created
18:06:51  gpg: keyserver receive failed: Operation timed out
@andrew-m-leonard andrew-m-leonard added the buildbreak High priority issues that cause build breaks in jenkins or build scripts label Nov 1, 2023
@github-actions github-actions bot added alpine-linux Issues that affect or relate to the Alpine LINUX OS x-linux Issues that affect or relate to the x64/x32 LINUX OS labels Nov 1, 2023
@adamfarley
Copy link
Contributor

PR created here to extend the timeout. If this failure is a simple reaction to a higher server load causing our requests to take too long, this could give us the necessary tolerance for that slowness.

@sxa
Copy link
Member

sxa commented Nov 3, 2023

Hmm we're getting quite a bit of variance in the workspace prep phase in those build jobs overall:

[sxa@fedora shm]$ for A in `seq 275 280`; do curl -s https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk11u/job/jdk11u-alpine-linux-x64-temurin/$A/consoleText | egrep 'Configuring workspace|Initiating build'; done
[2023-10-26T18:05:40.893Z] build.sh : 18:05:40 : Configuring workspace inc. clone and cacerts generation ...
[2023-10-26T18:06:46.113Z] build.sh : 18:06:45 : Initiating build ...
[2023-10-28T17:06:45.353Z] build.sh : 17:06:45 : Configuring workspace inc. clone and cacerts generation ...
[2023-10-28T17:09:51.271Z] build.sh : 17:09:50 : Initiating build ...
[2023-10-31T18:05:36.287Z] build.sh : 18:05:36 : Configuring workspace inc. clone and cacerts generation ...
[2023-11-02T18:05:45.878Z] build.sh : 18:05:45 : Configuring workspace inc. clone and cacerts generation ...
[2023-11-02T18:13:15.347Z] build.sh : 18:13:15 : Initiating build ...
[2023-11-03T14:02:26.710Z] build.sh : 14:02:26 : Configuring workspace inc. clone and cacerts generation ...
[2023-11-03T14:03:56.815Z] build.sh : 14:03:56 : Initiating build ...
[2023-11-03T14:39:08.143Z] build.sh : 14:39:07 : Configuring workspace inc. clone and cacerts generation ...
[2023-11-03T14:40:35.632Z] build.sh : 14:40:35 : Initiating build ...
[sxa@fedora shm]$ 

@sxa
Copy link
Member

sxa commented Nov 13, 2023

@adamfarley Can you take a look at https://ci.adoptium.net/job/build-scripts-pr-tester/job/build-test/job/jobs/job/jdk8u/job/jdk8u-alpine-linux-x64-temurin/139/console which was a test on one of my PRs.

This run did happen with your timeout PR fix in it (you can see the output from uptime) but but it looks like it still failed after just over a minute instead of five. Although it does show the machine was heavily loaded at the time.

23:40:38  GNUPGHOME=/tmp/.gpg-temp.347
23:40:38   23:40:37 up 395 days, 12:30,  load average: 51.04, 15.16, 6.02
23:40:38  gpg: keybox '/tmp/.gpg-temp.347/pubring.kbx' created
23:41:47  gpg: keyserver receive failed: Operation timed out

@andrew-m-leonard
Copy link
Contributor Author

Wondering if this is a firewall issue? can a different port be used?
Maybe use an alternate keyserver..?

@adamfarley
Copy link
Contributor

The man page for gpg says that "The keyserver hkp://keys.gnupg.net uses round robin DNS to give a different keyserver each time you use it.".

So perhaps we could rerun the gpg command in the event of a timeout, using that gnupg keyserver to prevent us from rerunning on the same, overburdened(?) keyserver.

@sxa - What do you think?

@adamfarley
Copy link
Contributor

Ok, have added some code to run the gpg command up to 10 times in the event of a failure, using the hkp keyserver I mentioned in the previous comment so the build doesn't fail if we get an overburdened keyserver.

PR: #3544

@sxa
Copy link
Member

sxa commented Nov 28, 2023

so the build doesn't fail if we get an overburdened keyserver.

Have you seen anything suggesting that it is due to an overburdened keyserver and not our machines i.e. does it happen when the load of the build machine is low instead of very high as in my earlier comment? We should have more data points now that we've added the uptime display. Also is it definitely specific to the Alpine jobs or is it being seen elsewhere?

Wondering if this is a firewall issue?

A fire wall should reject the connection immediately instead of timing out, so I would suggest that is unlikely, especially if it's limited to our Alpine environments as is showing variations in timing.

@adamfarley
Copy link
Contributor

adamfarley commented Nov 28, 2023

...does it happen when the load of the build machine is low instead of very high...?

Doesn't look like it. Current data points:

Job Result Load Averages
pr-tester/jdk8u-alpine-linux-x64-temurin/141 Pass 1.56, 0.70, 0.30
pr-tester/jdk8u-alpine-linux-x64-temurin/140 Pass 2.37, 0.91, 0.37
pr-tester/jdk8u-alpine-linux-x64-temurin/139 Fail 51.04, 15.16, 6.02
pr-tester/jdk11u-alpine-linux-x64-temurin/148 Pass 2.37, 0.91, 0.37
pr-tester/jdk11u-alpine-linux-x64-temurin/147 Pass 40.88, 21.71, 9.18
pr-tester/jdk17u-alpine-linux-x64-temurin/146 Fail 2.37, 0.91, 0.37
pr-tester/jdk11u-alpine-linux-x64-temurin/145 Pass 51.04, 15.16, 6.02
pr-tester/jdk8u-alpine-linux-x64-temurin/369 Pass 5.33, 2.68, 1.87
pr-tester/jdk8u-alpine-linux-x64-temurin/369 Pass 8.89, 2.30, 0.83
pr-tester/jdk8u-alpine-linux-x64-temurin/369 Pass 10.23, 4.21, 2.37
pr-tester/jdk8u-alpine-linux-x64-temurin/369 Pass 5.30, 1.72, 0.66
pr-tester/jdk8u-alpine-linux-x64-temurin/369 Pass 8.39, 6.05, 5.04
build-scripts/jobs/jdk/jdk-alpine-linux-x64-temurin/262 Fail 1.03, 0.29, 0.15
build-scripts/jobs/jdk/jdk-alpine-linux-x64-temurin/261 Pass 40.43, 12.60, 7.56

So the failure happens on high and low uptimes.

We should have more data points now that we've added the uptime display. Also is it definitely specific to the Alpine jobs or is it being seen elsewhere?

Not sure. Will take a look. Update: Haven't found any failures on other platforms, even the ones that run on the same machine (x64 Linux). Maybe the failures are related to the alpine container differences (networking, different gpg versions, etc)?

I'm also starting some "weekly pipeline" grinders with flawed javaopts (so they die if they get to the config step), so we try to brute-force a reproduction while using debug options in gpg and the surrounding code. Grinder link

@sxa
Copy link
Member

sxa commented Nov 28, 2023

OK that's useful - doesn't seem load-related then ... But also doesn't make sense that it would be just down to that platform - and we build others in docker containers ... And that includes Alpine on aarch64 now.

@adamfarley
Copy link
Contributor

I've modified the new PR to rerun the command 10 times with intervals.

The previous attempted fixes were the hkp://keys.gnupg.net keyserver (which didn't work on all platforms for some reason), and the "array of keyservers". These have been rejected due to their changes to our security profile, and also the apparent imminent retirement of the SKS network.

@adamfarley
Copy link
Contributor

Also, one theory is that this is an Alipine-specific issue which is down to timeout settings in their networking setup, as gpg may simply be trying to parse a return code given to us by the os function it's calling.

Plus, it seems to be exactly 70 seconds between an execution and a timeout, and I'm not seeing anything in the gpg setup that happens after 70 seconds.

@adamfarley
Copy link
Contributor

Also also, I ran a job a few days back to try and reproduce the bug by rerunning the recv-keys command over and over again. It didn't work, sadly. Ran for 8 hours without any errors, and got killed due to the extra-long job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
alpine-linux Issues that affect or relate to the Alpine LINUX OS buildbreak High priority issues that cause build breaks in jenkins or build scripts x-linux Issues that affect or relate to the x64/x32 LINUX OS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants