Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure VMs get stuck in offline after a job, but not deprovisioning #588

Open
alexholliz opened this issue Nov 22, 2024 · 1 comment
Open

Comments

@alexholliz
Copy link

Jenkins and plugins versions report

Environment
Jenkins: 2.479.1
OS: Linux - 5.15.0-1074-azure
Java: 21.0.5 - Ubuntu (OpenJDK 64-Bit Server VM)
---
Office-365-Connector:5.0.0
ace-editor:1.1
ant:511.v0a_a_1a_334f41b_
antisamy-markup-formatter:162.v0e6ec0fcfcf6
apache-httpcomponents-client-4-api:4.5.14-208.v438351942757
asm-api:9.7.1-97.v4cc844130d97
authentication-tokens:1.119.v50285141b_7e1
azure-ad:531.v13107da_f2635
azure-artifact-manager:154.vb6e2724f0095
azure-commons:1.1.3
azure-container-agents:253.vd2f5cd5c5040
azure-credentials:312.v0f3973cd1e59
azure-keyvault:251.vcfe31c013dc7
azure-sdk:184.v1f2c161c9777
azure-vm-agents:968.v7885b_e6e6fb_b_
blueocean:1.27.16
blueocean-bitbucket-pipeline:1.27.16
blueocean-commons:1.27.16
blueocean-config:1.27.16
blueocean-core-js:1.27.16
blueocean-dashboard:1.27.16
blueocean-display-url:2.4.3
blueocean-events:1.27.16
blueocean-git-pipeline:1.27.16
blueocean-github-pipeline:1.27.16
blueocean-i18n:1.27.16
blueocean-jwt:1.27.16
blueocean-personalization:1.27.16
blueocean-pipeline-api-impl:1.27.16
blueocean-pipeline-editor:1.27.16
blueocean-pipeline-scm-api:1.27.16
blueocean-rest:1.27.16
blueocean-rest-impl:1.27.16
blueocean-web:1.27.16
bootstrap5-api:5.3.3-1
bouncycastle-api:2.30.1.78.1-248.ve27176eb_46cb_
branch-api:2.1200.v4b_a_3da_2eb_db_4
build-timeout:1.33
caffeine-api:3.1.8-133.v17b_1ff2e0599
checks-api:2.2.1
cloud-stats:336.v788e4055508b_
cloudbees-bitbucket-branch-source:912.v3b_f74026941c
cloudbees-folder:6.959.v4ed5cc9e2dd4
command-launcher:116.vd85919c54a_d6
commons-lang3-api:3.17.0-84.vb_b_938040b_078
commons-text-api:1.12.0-129.v99a_50df237f7
copyartifact:757.v05365583a_455
credentials:1389.vd7a_b_f5fa_50a_2
credentials-binding:687.v619cb_15e923f
display-url-api:2.209.v582ed814ff2f
docker-commons:445.v6b_646c962a_94
durable-task:577.v2a_8a_4b_7c0247
echarts-api:5.5.1-4
eddsa-api:0.3.0-4.v84c6f0f4969e
email-ext:1855.vd9e491cb_de1e
embeddable-build-status:487.va_0ef04c898a_2
favorite:2.221.v19ca_666b_62f5
font-awesome-api:6.6.0-2
git:5.6.0
git-client:6.1.0
github:1.40.0
github-api:1.321-478.vc9ce627ce001
github-branch-source:1807.v50351eb_7dd13
gradle:2.13.1
gson-api:2.11.0-85.v1f4e87273c33
handy-uri-templates-2-api:2.1.8-30.v7e777411b_148
htmlpublisher:1.37
instance-identity:201.vd2a_b_5a_468a_a_6
ionicons-api:74.v93d5eb_813d5f
jackson2-api:2.17.0-379.v02de8ec9f64c
jakarta-activation-api:2.1.3-1
jakarta-mail-api:2.1.3-1
javax-activation-api:1.2.0-7
javax-mail-api:1.6.2-10
jaxb:2.3.9-1
jdk-tool:80.v8a_dee33ed6f0
jenkins-design-language:1.27.16
jjwt-api:0.11.5-112.ve82dfb_224b_a_d
jobConfigHistory:1283.veb_dfb_00b_5ec0
joda-time-api:2.13.0-93.v9934da_29b_a_e9
jquery3-api:3.7.1-2
jsch:0.2.16-86.v42e010d9484b_
json-api:20240303-101.v7a_8666713110
json-path-api:2.9.0-118.v7f23ed82a_8b_8
junit:1309.v0078b_fecd6ed
ldap:770.vb_455e934581a_
mailer:489.vd4b_25144138f
matrix-auth:3.2.3
matrix-project:840.v812f627cb_578
metrics:4.2.21-458.vcf496cb_839e4
mina-sshd-api-common:2.14.0-133.vcc091215a_358
mina-sshd-api-core:2.14.0-133.vcc091215a_358
momentjs:1.1.1
okhttp-api:4.11.0-181.v1de5b_83857df
p4:1.16.0
pam-auth:1.11
pipeline-build-step:540.vb_e8849e1a_b_d8
pipeline-github-lib:61.v629f2cc41d83
pipeline-graph-analysis:216.vfd8b_ece330ca_
pipeline-graph-view:382.vb_9a_27b_7b_ea_71
pipeline-groovy-lib:744.v5b_556ee7c253
pipeline-input-step:495.ve9c153f6067b_
pipeline-milestone-step:119.vdfdc43fc3b_9a_
pipeline-model-api:2.2218.v56d0cda_37c72
pipeline-model-definition:2.2218.v56d0cda_37c72
pipeline-model-extensions:2.2218.v56d0cda_37c72
pipeline-rest-api:2.34
pipeline-stage-step:312.v8cd10304c27a_
pipeline-stage-tags-metadata:2.2218.v56d0cda_37c72
pipeline-stage-view:2.34
plain-credentials:183.va_de8f1dd5a_2b_
plugin-util-api:5.1.0
popper2-api:2.11.6-5
pubsub-light:1.18
resource-disposer:0.25
scm-api:698.v8e3b_c788f0a_6
script-security:1369.v9b_98a_4e95b_2d
snakeyaml-api:2.3-123.v13484c65210a_
sse-gateway:1.27
ssh-credentials:343.v884f71d78167
ssh-slaves:2.973.v0fa_8c0dea_f9f
sshd:3.330.vc866a_8389b_58
structs:338.v848422169819
support-core:1523.v5486c8d6da_f3
timestamper:1.28
token-macro:400.v35420b_922dcb_
trilead-api:2.147.vb_73cc728a_32e
variant:60.v7290fc0eb_b_cd
windows-azure-storage:454.v573775e9feef
workflow-aggregator:600.vb_57cdd26fdd7
workflow-api:1336.vee415d95c521
workflow-basic-steps:1058.vcb_fc1e3a_21a_9
workflow-cps:3993.v3e20a_37282f8
workflow-durable-task-step:1378.v6a_3e903058a_3
workflow-job:1468.vcf4f5ee92395
workflow-multibranch:795.ve0cb_1f45ca_9a_
workflow-scm-step:427.v4ca_6512e7df1
workflow-step-api:678.v3ee58b_469476
workflow-support:932.vb_555de1b_a_b_94
ws-cleanup:0.48

What Operating System are you using (both controller, and any agents involved in the problem)?

Controller:
OS: Ubuntu 20.04.6 LTS
JDK: 21.0.5

Agents:
OS: Microsoft Windows Server 2022-datacenter-g2
Launch Method: SSH
JDK: 21.0.5 +11 LTS (From Temurin package)

Reproduction steps

  1. Configure an agent template with the "Azure VM Idle Retention Strategy" and a timeout of 5 minutes
  2. Wait for a build to run
  3. Wait for 5 minutes after the build completes

Expected Results

The agent should be marked as "Offline" and then deprovisioned in Azure via the Azure VM Agents Plugin.

Actual Results

The agent instead sits in the "Offline" state in Jenkins, while the actual VM in Azure is deleted. This causes the plugin to hit its max agents limit, and queue jobs indefinitely without trying to start new VMs to process jobs.

Anything else?

The only thing I've been able to track down on this are a bulleted error message from the "Clouds" overview page:

* Provisioning activity was completed before reaching OPERATING phase without reporting a problem
* Provisioning interrupted by restart

I don't see errors or stack traces in the com.microsoft.azure.vmagent logger, so there's nothing to dig into there. And looking through the deployments in the Azure Portal, they successfully get the deprovisioning deployment and carry it out, there's just something missing between the plugin checking the state of a VM in Azure and reporting it to the provisioned nodes in Jenkins.

Are you interested in contributing a fix?

I mean sure, but Java isn't my main strength

@timja
Copy link
Member

timja commented Dec 19, 2024

Is this still happening?

Can you provide all the logs you have?

I can't reproduce this, I set one up with a 5 min idle retention and it deleted fine:

AzureVMAgent#deprovision: linux-5-min-idlebea3f0 has been deprovisioned. Remove node ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants