Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jenkins losing connection to GCE VM / GCE VM shutting down #467

Open
j-koehler opened this issue Jun 6, 2024 · 3 comments
Open

Jenkins losing connection to GCE VM / GCE VM shutting down #467

j-koehler opened this issue Jun 6, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@j-koehler
Copy link

Jenkins and plugins versions report

Environment

Jenkins: 2.452.1
OS: Linux - 5.10.0-22-amd64
Java: 17.0.11 - Eclipse Adoptium (OpenJDK 64-Bit Server VM)

PrioritySorter:5.1.0
adoptopenjdk:1.5
amazon-ecs:1.49
ansicolor:1.0.4
ant:497.v94e7d9fffa_b_9
antisamy-markup-formatter:162.v0e6ec0fcfcf6
apache-httpcomponents-client-4-api:4.5.14-208.v438351942757
apache-httpcomponents-client-5-api:5.3.1-1.0
asm-api:9.7-33.v4d23ef79fcc8
authentication-tokens:1.113.v81215a_241826
aws-credentials:231.v08a_59f17d742
aws-java-sdk:1.12.696-451.v0651a_da_9ca_ec
aws-java-sdk-api-gateway:1.12.696-451.v0651a_da_9ca_ec
aws-java-sdk-autoscaling:1.12.696-451.v0651a_da_9ca_ec
aws-java-sdk-cloudformation:1.12.696-451.v0651a_da_9ca_ec
aws-java-sdk-cloudfront:1.12.696-451.v0651a_da_9ca_ec
aws-java-sdk-codebuild:1.12.696-451.v0651a_da_9ca_ec
aws-java-sdk-codedeploy:1.12.696-451.v0651a_da_9ca_ec
aws-java-sdk-ec2:1.12.696-451.v0651a_da_9ca_ec
aws-java-sdk-ecr:1.12.696-451.v0651a_da_9ca_ec
aws-java-sdk-ecs:1.12.696-451.v0651a_da_9ca_ec
aws-java-sdk-efs:1.12.696-451.v0651a_da_9ca_ec
aws-java-sdk-elasticbeanstalk:1.12.696-451.v0651a_da_9ca_ec
aws-java-sdk-elasticloadbalancingv2:1.12.696-451.v0651a_da_9ca_ec
aws-java-sdk-iam:1.12.696-451.v0651a_da_9ca_ec
aws-java-sdk-kinesis:1.12.696-451.v0651a_da_9ca_ec
aws-java-sdk-lambda:1.12.696-451.v0651a_da_9ca_ec
aws-java-sdk-logs:1.12.696-451.v0651a_da_9ca_ec
aws-java-sdk-minimal:1.12.696-451.v0651a_da_9ca_ec
aws-java-sdk-organizations:1.12.696-451.v0651a_da_9ca_ec
aws-java-sdk-secretsmanager:1.12.696-451.v0651a_da_9ca_ec
aws-java-sdk-sns:1.12.696-451.v0651a_da_9ca_ec
aws-java-sdk-sqs:1.12.696-451.v0651a_da_9ca_ec
aws-java-sdk-ssm:1.12.696-451.v0651a_da_9ca_ec
basic-branch-build-strategies:81.v05e333931c7d
blueocean:1.27.12
blueocean-autofavorite:1.2.5
blueocean-bitbucket-pipeline:1.27.12
blueocean-commons:1.27.12
blueocean-config:1.27.12
blueocean-core-js:1.27.12
blueocean-dashboard:1.27.12
blueocean-display-url:2.4.2
blueocean-events:1.27.12
blueocean-git-pipeline:1.27.12
blueocean-github-pipeline:1.27.12
blueocean-i18n:1.27.12
blueocean-jira:1.27.12
blueocean-jwt:1.27.12
blueocean-personalization:1.27.12
blueocean-pipeline-api-impl:1.27.12
blueocean-pipeline-editor:1.27.12
blueocean-pipeline-scm-api:1.27.12
blueocean-rest:1.27.12
blueocean-rest-impl:1.27.12
blueocean-web:1.27.12
bootstrap5-api:5.3.3-1
bouncycastle-api:2.30.1.78.1-233.vfdcdeb_0a_08a_a_
branch-api:2.1169.va_f810c56e895
browserstack-integration:1.2.13
build-name-setter:2.4.2
build-timeout:1.32
caffeine-api:3.1.8-133.v17b_1ff2e0599
checks-api:2.2.0
cloud-stats:336.v788e4055508b_
cloudbees-bitbucket-branch-source:886.v44cf5e4ecec5
cloudbees-disk-usage-simple:203.v3f46a_7462b_1a_
cloudbees-folder:6.928.v7c780211d66e
cobertura:1.17
code-coverage-api:4.99.0
command-launcher:107.v773860566e2e
commons-httpclient3-api:3.1-3
commons-lang3-api:3.13.0-62.v7d18e55f51e2
commons-text-api:1.11.0-109.vfe16c66636eb_
conditional-buildstep:1.4.3
config-file-provider:973.vb_a_80ecb_9a_4d0
configuration-as-code:1810.v9b_c30a_249a_4c
copyartifact:722.v0662a_9b_e22a_c
coverage:1.14.0
credentials:1337.v60b_d7b_c7b_c9f
credentials-binding:677.vdc9d38cb_254d
dashboard-view:2.508.va_74654f026d1
data-tables-api:2.0.7-1
dependency-check-jenkins-plugin:5.5.0
dependency-track:4.3.1
display-url-api:2.204.vf6fddd8a_8b_e9
docker-commons:439.va_3cb_0a_6a_fb_29
docker-compose-build-step:1.0
docker-java-api:3.3.4-86.v39b_a_5ede342c
docker-plugin:1.6.1
docker-slaves:1.0.7
docker-workflow:580.vc0c340686b_54
durable-task:555.v6802fe0f0b_82
echarts-api:5.5.0-1
email-ext:1814.v404722f34263
embeddable-build-status:487.va_0ef04c898a_2
envinject:2.908.v66a_774b_31d93
envinject-api:1.199.v3ce31253ed13
extended-read-permission:53.v6499940139e5
external-monitor-job:215.v2e88e894db_f8
favorite:2.208.v91d65b_7792a_c
file-operations:214.v2e7dc7f25757
font-awesome-api:6.5.2-1
forensics-api:2.4.0
git:5.2.2
git-client:4.7.0
git-server:126.v0d945d8d2b_39
github:1.39.0
github-api:1.318-461.v7a_c09c9fa_d63
github-branch-source:1789.v5b_0c0cea_18c3
gitlab-api:5.3.0-91.v1f9a_fda_d654f
gitlab-branch-source:704.vc7f1202d7e14
gitlab-plugin:1.8.1
google-compute-engine:4.563.vfa_446a_7e00a_d
google-oauth-plugin:1.330.vf5e86021cb_ec
google-play-android-publisher:4.2
gradle:2.12
gson-api:2.11.0-41.v019fcf6125dc
h2-api:11.1.4.199-30.v1c64e772f3a_c
handy-uri-templates-2-api:2.1.8-30.v7e777411b_148
htmlpublisher:1.34
http_request:1.18
instance-identity:185.v303dc7c645f9
ionicons-api:74.v93d5eb_813d5f
jackson2-api:2.17.0-379.v02de8ec9f64c
jacoco:3.3.6
jakarta-activation-api:2.1.3-1
jakarta-mail-api:2.1.3-1
javadoc:243.vb_b_503b_b_45537
javax-activation-api:1.2.0-7
javax-mail-api:1.6.2-10
jaxb:2.3.9-1
jdk-tool:73.vddf737284550
jenkins-design-language:1.27.12
jenkins-jira-plugin:4.0.0
jersey2-api:2.42-147.va_28a_44603b_d5
jira:3.13
jira-ext:114.v7b_8b_1d4274c6
jira-steps:2.0.165.v8846cf59f3db
jjwt-api:0.11.5-112.ve82dfb_224b_a_d
jnr-posix-api:3.1.19-2
job-dsl:1.87
job-restrictions:0.8
joda-time-api:2.12.7-29.v5a_b_e3a_82269a_
jquery3-api:3.7.1-2
jsch:0.2.16-86.v42e010d9484b_
json-api:20240303-41.v94e11e6de726
json-path-api:2.9.0-58.v62e3e85b_a_655
junit:1265.v65b_14fa_f12f0
ldap:725.v3cb_b_711b_1a_ef
lockable-resources:1255.vf48745da_35d0
mailer:472.vf7c289a_4b_420
mask-passwords:173.v6a_077a_291eb_5
matrix-auth:3.2.2
matrix-project:830.v7ea_da_561b_a_34
maven-plugin:3.23
mercurial:1260.vdfb_723cdcc81
metrics:4.2.21-451.vd51df8df52ec
mina-sshd-api-common:2.12.1-101.v85b_e08b_780dd
mina-sshd-api-core:2.12.1-101.v85b_e08b_780dd
next-build-number:1.8
nexus-artifact-uploader:2.14
nexus-task-runner:0.9.2
nodejs:1.6.1
oauth-credentials:0.653.v14cf2088e950
okhttp-api:4.11.0-172.vda_da_1feeb_c6e
pam-auth:1.10
parameterized-trigger:806.vf6fff3e28c3e
performance:960.v701db_0d66e1b_
pipeline-aws:1.45
pipeline-build-step:540.vb_e8849e1a_b_d8
pipeline-github-lib:61.v629f2cc41d83
pipeline-graph-analysis:216.vfd8b_ece330ca_
pipeline-groovy-lib:710.v4b_94b_077a_808
pipeline-input-step:495.ve9c153f6067b_
pipeline-maven:1416.v54f125b_3e1c7
pipeline-maven-api:1416.v54f125b_3e1c7
pipeline-milestone-step:119.vdfdc43fc3b_9a_
pipeline-model-api:2.2198.v41dd8ef6dd56
pipeline-model-definition:2.2198.v41dd8ef6dd56
pipeline-model-extensions:2.2198.v41dd8ef6dd56
pipeline-rest-api:2.34
pipeline-stage-step:312.v8cd10304c27a_
pipeline-stage-tags-metadata:2.2198.v41dd8ef6dd56
pipeline-stage-view:2.34
pipeline-utility-steps:2.16.2
plain-credentials:182.v468b_97b_9dcb_8
plot:2.1.12
plugin-util-api:4.1.0
prism-api:1.29.0-15
publish-over:0.22
publish-over-ftp:1.17
publish-over-ssh:1.25
pubsub-light:1.18
rebuild:332.va_1ee476d8f6d
resource-disposer:0.23
run-condition:1.7
s3:466.vf5b_3db_8e3eb_2
scm-api:690.vfc8b_54395023
script-security:1336.vf33a_a_9863911
snakeyaml-api:2.2-111.vc6598e30cc65
snyk-security-scanner:4.0.3
sonar:2.17.2
sse-gateway:1.26
ssh-agent:367.vf9076cd4ee21
ssh-credentials:337.v395d2403ccd4
ssh-slaves:2.966.vea_5112da_0a_a_1
sshd:3.322.v159e91f6a_550
structs:337.v1b_04ea_4df7c8
swarm:3.46
thinBackup:2.0
timestamper:1.27
token-macro:400.v35420b_922dcb_
trilead-api:2.142.v748523a_76693
variant:60.v7290fc0eb_b_cd
view-job-filters:377.v66f4b_796e5fa_
workflow-aggregator:596.v8c21c963d92d
workflow-api:1316.v33eb_726c50b_a_
workflow-basic-steps:1058.vcb_fc1e3a_21a_9
workflow-cps:3894.3896.vca_2c931e7935
workflow-durable-task-step:1353.v1891a_b_01da_18
workflow-job:1400.v7fd111b_ec82f
workflow-multibranch:783.va_6eb_ef636fb_d
workflow-scm-step:427.v4ca_6512e7df1
workflow-step-api:657.v03b_e8115821b_
workflow-support:907.v6713a_ed8a_573
ws-cleanup:0.46

What Operating System are you using (both controller, and any agents involved in the problem)?

Debian 12

Reproduction steps

Use GCP for (oneshot-)agents

Expected Results

  • build is running
  • build is completing
  • agent VM is shutting down

Actual Results

  • build is running
  • build is sometimes failing with "Remote call on ... failed. The channel is closing down or has closed down":
[2024-05-28T08:24:30.268Z] Cannot contact gcp-rre-debian12-wrgi83: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@790c3757:gcp-rre-debian12-wrgi83": Remote call on gcp-rre-debian12-wrgi83 failed. The channel is closing down or has closed down
[2024-05-28T08:25:56.170Z] Could not connect to gcp-rre-debian12-wrgi83 to send interrupt signal to process

Anything else?

I'm not sure where to look for this error, not blaming this project by itself but it's very hard to grasp.

We were using Jenkins 2.440.2 with GCE Plugin 4.563.vfa_446a_7e00a_d before without any of these problems. After upgrading to 2.452.1 (including all of the plugins including GCE Plugin (to 4.573.v7dcd6a_37a_ee2) to problems began to start with strange errors like pasted into the actual results.

It did not happen every time, but quite often (maybe 20% yes, 80% no). Rolling back the plugin to 4.563.vfa_446a_7e00a_d (including restarting Jenkins to apply) did not help.

In GCP there are multiple "DELETE" statements for one machine finally resulting in an error but unfortunately I don't know what's "normal" (the usual problem when you look closer at the logs once problems begin to arise):

2024-05-28 10:09:07.764 CEST Compute Engine insert europe-west3-c:gcp-rre-debian12-wrgi83 ...
2024-05-28 10:09:14.732 CEST Compute Engine insert europe-west3-c:gcp-rre-debian12-wrgi83 ...
2024-05-28 10:23:56.154 CEST Compute Engine delete europe-west3-c:gcp-rre-debian12-wrgi83 ...
2024-05-28 10:24:42.072 CEST Compute Engine delete europe-west3-c:gcp-rre-debian12-wrgi83 ...
2024-05-28 10:25:55.799 CEST Compute Engine delete europe-west3-c:gcp-rre-debian12-wrgi83 ...

Are you interested in contributing a fix?

No response

@j-koehler j-koehler added the bug Something isn't working label Jun 6, 2024
@j-koehler
Copy link
Author

j-koehler commented Jun 18, 2024

Since the issue is still around even after:

  • rolling back GCE Plugin to 4.563.vfa_446a_7e00a_d
  • upgrading to Jenkins 2.452.2 and upgrading all plugins (GCE now at 4.575.v6969b_7c435eb_)

I checked the GCP logs some more. It looks like for whatever reason GCP is receiving an DELETE for the VM while the job is still running.

Here are the logs for a successful jobs (I added the NOTICE/ERROR depending on the icon)

NOTICE 2024-06-18 08:15:15.251 CEST Compute Engine insert europe-west3-c:gcp-rre-unittest-debian12-di1edw ...
NOTICE 2024-06-18 08:15:21.166 CEST Compute Engine insert europe-west3-c:gcp-rre-unittest-debian12-di1edw ...
NOTICE 2024-06-18 08:34:54.780 CEST Compute Engine delete europe-west3-c:gcp-rre-unittest-debian12-di1edw ...
NOTICE 2024-06-18 08:35:40.626 CEST Compute Engine delete europe-west3-c:gcp-rre-unittest-debian12-di1edw ...

and here the logs for a failing job:

NOTICE 2024-06-18 07:46:45.363 CEST Compute Engine insert europe-west3-c:gcp-rre-unittest-debian12-jkt5ag ...
NOTICE 2024-06-18 07:47:00.887 CEST Compute Engine insert europe-west3-c:gcp-rre-unittest-debian12-jkt5ag ...
NOTICE 2024-06-18 08:04:13.367 CEST Compute Engine delete europe-west3-c:gcp-rre-unittest-debian12-jkt5ag ...
NOTICE 2024-06-18 08:04:59.185 CEST Compute Engine delete europe-west3-c:gcp-rre-unittest-debian12-jkt5ag ...
ERROR 2024-06-18 08:05:02.081 CEST Compute Engine delete europe-west3-c:gcp-rre-unittest-debian12-jkt5ag ...

In Jenkins itself it looks like this (note: times are UTC here):

...
[2024-06-18T06:04:43.784Z] PASS src/view/store/tracking/suspendData/tracking.suspend.data.setSuspendDataIfSectionIsVisited.epic.test.ts
[2024-06-18T06:04:44.494Z] Cannot contact gcp-rre-unittest-debian12-jkt5ag: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@1362ebbd:gcp-rre-unittest-debian12-jkt5ag": Remote call on gcp-rre-unittest-debian12-jkt5ag failed. The channel is closing down or has closed down
[2024-06-18T06:05:02.226Z] Could not connect to gcp-rre-unittest-debian12-jkt5ag to send interrupt signal to process

So from my perspective (without much insight) it looks like the delete at 2024-06-18 08:04:13.367 CEST is causing the trouble. It looks like it sends a delete, GCP is starting to shut down. It loses connection, wants to cleanup (the VMs are configured as "one shot" instances, so it sends another delete at 2024-06-18 08:04:59.185 which then causes the error at 2024-06-18 08:05:02.081 (since the VM is already gone).

There is nothing unusual at the 06:04:13 (aka 2024-06-18 08:04:13.367 CEST in GCP). Just some PASSes, not even a single entry for the exact 06:04:13 second.

I don't know who (which plugin) might cause this. Can this GCE plugin even cause this?

@j-koehler
Copy link
Author

j-koehler commented Aug 26, 2024

Some update. We made the log level more verbose and added a dedicated logger for this.

Looks like the plugin is sometimes "forgetting" about the VM:

Plugin log:

Aug 26, 2024 7:00:46 AM INFO com.google.jenkins.plugins.computeengine.InstanceConfiguration provision
Sent insert request for instance configuration [Debian12 agent for RRE unittests]
Aug 26, 2024 7:00:46 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineComputerLauncher launch
Launch will wait 300000 for operation operation-1724648445057-6208f01ee1a77-f70c6933-7f833c72 to complete...
Aug 26, 2024 7:00:46 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud lambda$getPlannedNodeFuture$0
Waiting 300000ms for node gcp-rre-unittest-debian12-xqphpu to connect
Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Launching instance: gcp-rre-unittest-debian12-xqphpu
Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
bootstrap
Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Getting keypair...
Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Using autogenerated ssh keypair
Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Authenticating as jenkins
Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
No public address found. Fall back to internal address.
Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Connecting to 192.168.75.37 on port 22, with timeout 10000.
Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Failed to connect via ssh: There was a problem while connecting to 192.168.75.37:22
Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Waiting for SSH to come up. Sleeping 5.
Aug 26, 2024 7:01:17 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
No public address found. Fall back to internal address.
Aug 26, 2024 7:01:17 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Connecting to 192.168.75.37 on port 22, with timeout 10000.
Aug 26, 2024 7:01:17 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Connected via SSH.
Aug 26, 2024 7:01:17 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Verifying: java -fullversion
Aug 26, 2024 7:01:18 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Copying agent.jar to: /tmp
Aug 26, 2024 7:01:18 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Launching Jenkins agent via plugin SSH: java -jar /tmp/agent.jar
Aug 26, 2024 7:01:26 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud lambda$getPlannedNodeFuture$0
40479ms elapsed waiting for node gcp-rre-unittest-debian12-xqphpu to connect
Aug 26, 2024 7:04:12 AM INFO com.google.jenkins.plugins.computeengine.CleanLostNodesWork terminateInstance
Remote instance gcp-rre-unittest-debian12-xqphpu not found locally, removing it

At the same time the VM was doing work:

...
07:02:21  Agent: gcp-rre-unittest-debian12-xqphpu
...
07:03:22  + yarn run test:ci
07:03:22  yarn run v1.22.22
07:03:22  $ yarn generate && craco test --coverage
07:03:22  $ yarn dependency test && yarn create-plugin-list && yarn create-view-model && yarn create-template-list && yarn create-themes && yarn bundle-messages
07:03:22  $ cross-env TS_NODE_PROJECT=./tsconfig.buildConf.json node -r ts-node/register build-config/dependency.ts test
07:03:25  Successfully checked view/edit dependencies array in package.json.
07:03:25  $ cross-env TS_NODE_PROJECT=./tsconfig.buildConf.json node -r ts-node/register build-config/createPluginList.ts
07:03:27  $ cross-env TS_NODE_PROJECT=./tsconfig.buildConf.json node -r ts-node/register build-config/createViewModel.ts
07:03:30  $ cross-env TS_NODE_PROJECT=./tsconfig.buildConf.json node -r ts-node/register build-config/createTemplateList.ts
07:03:32  $ cross-env TS_NODE_PROJECT=./tsconfig.buildConf.json node -r ts-node/register build-config/createThemes.ts
07:04:56  Cannot contact gcp-rre-unittest-debian12-xqphpu: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@70fb0402:gcp-rre-unittest-debian12-xqphpu": Remote call on gcp-rre-unittest-debian12-xqphpu failed. The channel is closing down or has closed down
07:05:02  Agent gcp-rre-unittest-debian12-xqphpu was deleted; cancelling node body
07:05:02  Could not connect to gcp-rre-unittest-debian12-xqphpu to send interrupt signal to process

I guess no developer is ever reading this, but I wanted to add this information.

@Dohbedoh
Copy link

@j-koehler Is it possible that you have several Cloud configured with the same <instanceId>...</instanceId> ? You may run a groovy script like the following to double check on this from the Manage Jenkins > Script Console:

import com.google.jenkins.plugins.computeengine.ComputeEngineCloud
import jenkins.model.Jenkins

jenkins.model.Jenkins.get().clouds.findAll {cloud -> cloud instanceof com.google.jenkins.plugins.computeengine.ComputeEngineCloud }.each { cloud -> 
    println "- cloud: " + cloud.cloudName;
	println "  instanceId: " + cloud.instanceId;
    println "  instanceConfigurations:"
    cloud.configurations.each { configuration -> 
      println "  - namePrefix:" + configuration.namePrefix
      println "    jenkins_cloud_id: " + configuration.googleLabels["jenkins_cloud_id"]
    }
}
return

This is something I recently discovered in an environment showing the same symptoms.
This can be quite possible if for example you use configuration as code to define your GCE cloud, and for example copy/paste a configuration over to add additional ones..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants