Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spot VM evictions are not reported to Jenkins, so builds hang and status not reported #323

Open
tchrischan opened this issue Nov 4, 2021 · 4 comments

Comments

@tchrischan
Copy link

tchrischan commented Nov 4, 2021

Version report

Jenkins and plugins versions report:

Result
Jenkins: 2.303.2
OS: Linux - 4.19.0-16-cloud-amd64
---
ace-editor:1.1
ant:1.11
antisamy-markup-formatter:2.1
apache-httpcomponents-client-4-api:4.5.13-1.0
authentication-tokens:1.4
authorize-project:1.4.0
azure-acs:1.0.4
azure-ad:180.v8b1e80e6f242
azure-artifact-manager:86.va2aa4b1038c7
azure-commons:1.1.3
azure-container-registry-tasks:0.6.5
azure-credentials:182.v3ccd4a755864
azure-iot-edge:2.0.0
azure-sdk:23.v5682688d0eef
azure-vm-agents:783.v58077630847d
basic-branch-build-strategies:1.3.2
blueocean:1.24.3
blueocean-autofavorite:1.2.4
blueocean-bitbucket-pipeline:1.24.3
blueocean-commons:1.24.6
blueocean-config:1.24.3
blueocean-core-js:1.24.3
blueocean-dashboard:1.24.3
blueocean-display-url:2.4.0
blueocean-events:1.24.3
blueocean-git-pipeline:1.24.3
blueocean-github-pipeline:1.24.3
blueocean-i18n:1.24.3
blueocean-jira:1.24.3
blueocean-jwt:1.24.3
blueocean-personalization:1.24.3
blueocean-pipeline-api-impl:1.24.3
blueocean-pipeline-editor:1.24.3
blueocean-pipeline-scm-api:1.24.3
blueocean-rest:1.24.6
blueocean-rest-impl:1.24.3
blueocean-web:1.24.3
bootstrap4-api:4.5.3-1
bootstrap5-api:5.1.0-3
bouncycastle-api:2.20
branch-api:2.6.5
build-timeout:1.20
caffeine-api:2.9.2-29.v717aac953ff3
checks-api:1.7.2
cloud-stats:0.27
cloudbees-bitbucket-branch-source:2.9.6
cloudbees-folder:6.15
cmakebuilder:2.6.3
cobertura:1.16
code-coverage-api:1.4.1
command-launcher:1.5
configuration-as-code:1.54
copyartifact:1.46
credentials:2.6.1
credentials-binding:1.27
data-tables-api:1.10.25-3
discard-old-build:1.05
display-url-api:2.3.5
docker-build-step:2.6
docker-commons:1.17
docker-java-api:3.1.5.2
docker-plugin:1.2.1
docker-workflow:1.25
durable-task:1.35
echarts-api:5.1.2-11
email-ext:2.81
extended-read-permission:3.2
favorite:2.3.2
font-awesome-api:5.15.4-1
forensics-api:1.3.0
git:4.9.0
git-client:3.10.0
git-server:1.9
github:1.32.0
github-api:1.123
github-branch-source:2.10.2
github-checks:1.0.8
github-oauth:0.33
github-pr-coverage-status:2.1.1
github-pullrequest:0.2.8
global-slack-notifier:1.5
google-oauth-plugin:1.0.2
gradle:1.36
handlebars:1.1.1
handy-uri-templates-2-api:2.1.8-1.0
htmlpublisher:1.25
icon-shim:2.0.3
influxdb:3.0.2
jackson2-api:2.12.3
jaxb:2.3.0.1
jdk-tool:1.4
jenkins-design-language:1.24.3
jira:3.1.3
jjwt-api:0.11.2-9.c8b45b8bb173
jobConfigHistory:2.27
jquery-detached:1.2.1
jquery3-api:3.6.0-2
jsch:0.1.55.2
junit:1.48
kubernetes:1.28.5
kubernetes-cd:2.3.1
kubernetes-client-api:4.11.1
kubernetes-credentials:0.7.0
ldap:2.3
llvm-cov:1.0.0
lockable-resources:2.10
mailer:1.34
mapdb-api:1.0.9.0
matrix-auth:2.6.6
matrix-project:1.19
mercurial:2.12
metrics:4.0.2.7
momentjs:1.1.1
multibranch-build-strategy-extension:1.0.10
oauth-credentials:0.4
okhttp-api:3.14.9
pam-auth:1.6
pipeline-build-step:2.13
pipeline-github-lib:1.0
pipeline-graph-analysis:1.10
pipeline-input-step:2.12
pipeline-milestone-step:1.3.1
pipeline-model-api:1.7.2
pipeline-model-declarative-agent:1.1.1
pipeline-model-definition:1.7.2
pipeline-model-extensions:1.7.2
pipeline-rest-api:2.19
pipeline-stage-step:2.5
pipeline-stage-tags-metadata:1.7.2
pipeline-stage-view:2.19
plain-credentials:1.7
plugin-usage-plugin:1.1
plugin-util-api:2.4.0
popper-api:1.16.0-7
popper2-api:2.9.3-1
pubsub-light:1.13
resource-disposer:0.14
scm-api:2.6.5
script-security:1.78
slack:2.48
snakeyaml-api:1.29.1
sse-gateway:1.24
ssh-agent:1.23
ssh-credentials:1.19
ssh-slaves:1.32.0
sshd:3.0.3
structs:1.23
subversion:2.13.2
timestamper:1.11.8
token-macro:266.v44a80cf277fd
trilead-api:1.0.13
variant:1.4
windows-azure-storage:355.v4da08e72a251
windows-slaves:1.7
workflow-aggregator:2.6
workflow-api:2.46
workflow-basic-steps:2.23
workflow-cps:2.93
workflow-cps-global-lib:2.17
workflow-durable-task-step:2.37
workflow-job:2.41
workflow-multibranch:2.26
workflow-scm-step:2.13
workflow-step-api:2.24
workflow-support:3.8
ws-cleanup:0.38
  • What Operating System are you using (both controller, and any agents involved in the problem)?
Paste here

Reproduction steps

  • Configure cloud VMs with Spot instance box checked
  • Run some builds
  • Eventually, a build will run far past the expected completion time because the VM was evicted.

Results

Expected result:

The build should be reported as FAILURE. At least if it is marked failed, we can have the pipeline re-run it. Ideally, eviction would deallocate the VM and Jenkins could allocate a new spot VM with the same disk and restart the failed stage.

Actual result:

Build hangs indefinitely until aborted. Logs report following:
Connection was broken

java.io.EOFException
	at java.base/java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2872)
	at java.base/java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3367)
	at java.base/java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:936)
	at java.base/java.io.ObjectInputStream.<init>(ObjectInputStream.java:379)
	at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49)
	at hudson.remoting.Command.readFrom(Command.java:142)
	at hudson.remoting.Command.readFrom(Command.java:128)
	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:61)
Caused: java.io.IOException: Unexpected termination of the channel
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:75)
@domazaris
Copy link

domazaris commented Oct 3, 2022

Bumping this issue.

I believe this is due to this check during cleanup:

// If the machine is not idle, don't do anything.
// Could have been taken offline by the plugin while still running
// builds.
if (!azureComputer.isIdle()) {
      continue;
}

As far as I can tell, even if a spot node has been evicted, this check will prevent any jobs still running on the agent from being deleted causing the job to hang indefinitely (until someone manually deletes the agent from Jenkins).

Potentially need to move the check from further down into/above this idle check

// Check if the virtual machine exists.  If not, it could have been
// deleted in the background.  Remove from Jenkins if that is the case.
if (!AzureVMManagementServiceDelegate.virtualMachineExists(agentNode)) {

Unless there is a reason to keep a spot node around in Jenkins even if it has been deleted in Azure?

Edit: Formatting, grammar

@timja
Copy link
Member

timja commented Oct 3, 2022

@jglick I think you were doing some work in this area to make it easier to handle spot evictions in cloud providers

Any tips?

@jglick
Copy link
Member

jglick commented Oct 4, 2022

Well, you can use the new

retry(count: 2, conditions: [agent()]) {
  node(…) {
    //
  }
}

idiom, which will retry the node block if it gets killed for a recognized reason—cases where the behavior otherwise is that the build fails/aborts with an agent-related error. If the cloud plugin fails to properly terminate the node to begin with then this will not work. Normally the channel pinger ought to abort on the controller side at some point even if the cloud plugin does nothing special, though.

@domazaris
Copy link

domazaris commented Oct 4, 2022

Yes, I was very happy with the new agent/retry functionality. This normally works really well, but when a long running sh command is running and a spot node gets reclaimed/removed by Azure, the job will hang (seemingly) indefinitely. I have tried leaving them for multiple hours before manually deleting the agent via Jenkins. Once I have manually deleted the agent in Jenkins, the job will retry and resume correctly, saving a rebuild of the whole job.

@timja timja removed the bug label Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants