Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-rollback of failed upgrade fails with some resources #421

Open
porridge opened this issue Feb 13, 2025 · 3 comments
Open

Auto-rollback of failed upgrade fails with some resources #421

porridge opened this issue Feb 13, 2025 · 3 comments

Comments

@porridge
Copy link
Member

porridge commented Feb 13, 2025

It seems that helm's force option for rollbacks (that this library uses here) does not deal well with some resources.

When using the stackrox operator the error looks like this:

rollback failed: failed to replace object: PersistentVolumeClaim "scanner-v4-db" is invalid: spec: Forbidden: spec is immutable after creation except resources.requests and volumeAttributesClassName for bound claims
  core.PersistentVolumeClaimSpec{
      AccessModes:      {"ReadWriteOnce"},
      Selector:         nil,
      Resources:        {Requests: {s"storage": {i: {...}, s: "50Gi", Format: "BinarySI"}}},
-     VolumeName:       "pvc-a78c78d0-a692-47d6-b01a-be7f8d6ebd4a",
+     VolumeName:       "",
-     StorageClassName: &"standard-csi",
+     StorageClassName: nil,
      VolumeMode:       &"Filesystem",
      DataSource:       nil,
      ... // 2 identical fields
  }
: original upgrade error: cannot patch "scanner-v4-matcher-config" with kind ConfigMap:  "" is invalid: patch: Invalid value: "
{
  "apiVersion": "v1",
  "data": {
    "config.yaml": {
      "matcher": {
        "vulnerabilities_url": "https://central.stackrox.svc/api/extensions/scannerdefinitions?version=dev"
      }
    }
  },
  "kind": "ConfigMap",
  "metadata": {
[... snip ...]
  }
}": json: cannot unmarshal object into Go struct field ConfigMap.data of type string 

It may be easier to understand what is going on on the following example, since the issue also happens with helm CLI. It's just that while the CLI warns to exercise caution with this option, the helm-operator-plugins library uses force automatically and unconditionally.

Let's say I have a chart that contains just a PVC such as this:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc1
spec:
  accessModes:
  - ReadWriteOnce
{{- if .Values.badness }}
  storageClassName:
    nonsense: yes
{{- else }}
  storageClassName: standard
{{- end }}
  resources:
    requests:
      storage: 500Mi

Then I:

helm install --debug repro1 .

This works fine. Now I create any pod that uses the PVC to make it flip to a Bound state.

Then I try to apply a broken upgrade:

$ helm upgrade repro1 . --set badness=true
Error: UPGRADE FAILED: cannot patch "pvc1" with kind PersistentVolumeClaim:  "" is invalid: patch: Invalid value: "{\"apiVersion\":\"v1\",\"kind\":\"PersistentVolumeClaim\",\"metadata\":{\"annotations\":{\"meta.helm.sh/release-name\":\"repro1\",\"meta.helm.sh/release-namespace\":\"default\",\"pv.kubernetes.io/bind-completed\":\"yes\",\"pv.kubernetes.io/bound-by-controller\":\"yes\",\"volume.beta.kubernetes.io/storage-provisioner\":\"rancher.io/local-path\",\"volume.kubernetes.io/selected-node\":\"kind-control-plane\",\"volume.kubernetes.io/storage-provisioner\":\"rancher.io/local-path\"},\"creationTimestamp\":\"2025-02-13T12:00:25Z\",\"finalizers\":[\"kubernetes.io/pvc-protection\"],\"labels\":{\"app.kubernetes.io/managed-by\":\"Helm\"},\"managedFields\":[{\"manager\":\"helm\",\"operation\":\"Update\",\"apiVersion\":\"v1\",\"time\":\"2025-02-13T12:00:25Z\",\"fieldsType\":\"FieldsV1\",\"fieldsV1\":{\"f:metadata\":{\"f:annotations\":{\".\":{},\"f:meta.helm.sh/release-name\":{},\"f:meta.helm.sh/release-namespace\":{}},\"f:labels\":{\".\":{},\"f:app.kubernetes.io/managed-by\":{}}},\"f:spec\":{\"f:accessModes\":{},\"f:resources\":{\"f:requests\":{\".\":{},\"f:storage\":{}}},\"f:storageClassName\":{},\"f:volumeMode\":{}}}},{\"manager\":\"kube-scheduler\",\"operation\":\"Update\",\"apiVersion\":\"v1\",\"time\":\"2025-02-13T12:00:38Z\",\"fieldsType\":\"FieldsV1\",\"fieldsV1\":{\"f:metadata\":{\"f:annotations\":{\"f:volume.kubernetes.io/selected-node\":{}}}}},{\"manager\":\"kube-controller-manager\",\"operation\":\"Update\",\"apiVersion\":\"v1\",\"time\":\"2025-02-13T12:00:41Z\",\"fieldsType\":\"FieldsV1\",\"fieldsV1\":{\"f:metadata\":{\"f:annotations\":{\"f:pv.kubernetes.io/bind-completed\":{},\"f:pv.kubernetes.io/bound-by-controller\":{},\"f:volume.beta.kubernetes.io/storage-provisioner\":{},\"f:volume.kubernetes.io/storage-provisioner\":{}}},\"f:spec\":{\"f:volumeName\":{}}}},{\"manager\":\"kube-controller-manager\",\"operation\":\"Update\",\"apiVersion\":\"v1\",\"time\":\"2025-02-13T12:00:41Z\",\"fieldsType\":\"FieldsV1\",\"fieldsV1\":{\"f:status\":{\"f:accessModes\":{},\"f:capacity\":{\".\":{},\"f:storage\":{}},\"f:phase\":{}}},\"subresource\":\"status\"}],\"name\":\"pvc1\",\"namespace\":\"default\",\"resourceVersion\":\"764382\",\"uid\":\"75dd22fc-06ea-4ee0-b9e5-76eb054c3c62\"},\"spec\":{\"accessModes\":[\"ReadWriteOnce\"],\"resources\":{\"requests\":{\"storage\":\"500Mi\"}},\"storageClassName\":{\"nonsense\":true},\"volumeMode\":\"Filesystem\",\"volumeName\":\"pvc-75dd22fc-06ea-4ee0-b9e5-76eb054c3c62\"},\"status\":{\"phase\":\"Bound\",\"accessModes\":[\"ReadWriteOnce\"],\"capacity\":{\"storage\":\"500Mi\"}}}": json: cannot unmarshal object into Go struct field PersistentVolumeClaimSpec.spec.storageClassName of type string

This is expected. Now in this case the library would try to roll back, but with the force option this fails:

$ helm rollback repro1 1 --force
Error: failed to replace object: PersistentVolumeClaim "pvc1" is invalid: spec: Forbidden: spec is immutable after creation except resources.requests and volumeAttributesClassName for bound claims
  core.PersistentVolumeClaimSpec{
  	AccessModes:      {"ReadWriteOnce"},
  	Selector:         nil,
  	Resources:        {Requests: {s"storage": {i: {...}, s: "500Mi", Format: "BinarySI"}}},
- 	VolumeName:       "pvc-75dd22fc-06ea-4ee0-b9e5-76eb054c3c62",
+ 	VolumeName:       "",
  	StorageClassName: &"standard",
  	VolumeMode:       &"Filesystem",
  	... // 3 identical fields
  }

And then we fall into the loop as described in #224

@porridge
Copy link
Member Author

Before I try patching the library to see whether removing Force helps: what is the reason for using this option in the first place? 🤔

@joelanford
Copy link
Member

Hey thanks for the report @porridge!

The intention behind use of the Force option is to allow helm to delete and re-create objects, if necessary, in order to allow the rollback to succeed. It seems we didn't envision a scenario where Force would cause a rollback to fail when it would have otherwise succeeded without Force.

I was actually discussing a similar problem that we discovered in operator-controller just today in operator-framework/operator-controller#1776 (specifically: operator-framework/operator-controller#1776 (comment)), where we are concerned about how to safely and properly recover from a helm install/upgrade that was left in the pending state.

We thus far have opted to specifically disable the auto-uninstall/auto-rollback behavior because we are concerned that rolling back (I guess uninstalling a failed release would probably be fine) could actually exacerbate problems if a partial upgrade has performed some migration that the previous version cannot handle.

@porridge
Copy link
Member Author

Hey thanks for the report @porridge!

The intention behind use of the Force option is to allow helm to delete and re-create objects, if necessary

Oh, this sounds risky w.r.t. storage-related resources 😬 I mean, deleting and recreating a PVC if something goes wrong could be a disaster. 🙈
+1 to removing the auto-rollback here too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants