Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behavior with Image pull secrets and external secrets #8357

Open
stephenc opened this issue Oct 31, 2024 · 2 comments
Open

Inconsistent behavior with Image pull secrets and external secrets #8357

stephenc opened this issue Oct 31, 2024 · 2 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@stephenc
Copy link

Expected Behavior

When I apply a single yaml file that includes an ExternalSecret and a TaskRun that uses the secret to be created by the ExternalSecret, the status of the TaskRun should be deterministic, especially if default-imagepullbackoff-timeout is using its default value of 0.

When default-imagepullbackoff-timeout is using its default value of 0
Either entrypoint inference should fail after allowing a 30-40s back-off in the case where the TaskRun is specifying image pull secrets and some of those secrets do not yet exist or when the entrypoint inference is not required, the initial image pull failure should result in an immediate failure.

When default-imagepullbackoff-timeout is set to a non-zero value, entrypoint inference should not fail immediately but should retry up to the configured timeout especially in the case where the TaskRun includes a pod template that specifies image pull secrets and some of those secrets have not yet been created. When the entrypoint inference is not required, the back-off can take up to 30s longer under my observations.

Perhaps at least the documentation for default-imagepullbackoff-timeout should mention that there is perhaps a 30s additional grace, though it would be best if everything behaved consistently, ideally respecting the configured back-off

Actual Behavior

  • If the TaskRun step that uses the image pull secret requires inference of the entrypoint, then the TaskRun will fail immediately without waiting for the secret to be provisioned, ignoring any configured default-imagepullbackoff-timeout
  • If the TaskRun step that uses the image pull secret has the entrypoint explicitly stated, then the TaskRun (in Tekton v0.65) may allow additional back-off retry typically of 30-40s before failing the TaskRun if the external secret has not been provisioned or succeeding if the ExternalSecret controller has managed to get the secret provisioned

For example, the following task runs were all created at the same time in a Tekton cluster with default-imagepullbackoff-timeout: 60s

Screenshot 2024-10-31 at 10 51 47

Notice how the two task runs which are inferring the entrypoint both fail immediately and ignore the image pull backoff. The two task runs that have the entrypoint explicit fail at least 60s after starting, but this can be 30-40s later.

For example in this case the completion time of the two explict entrypoint task runs was approx 60s after start
Screenshot 2024-10-31 at 10 56 17

But I have also had cases where the TaskRun that does not require an image pull secret took >90s while the one that required the image pull secret took ~60s

Steps to Reproduce the Problem

Note this is not a full reproducer as I am simplifying down from a more complex case. For the issue as described above you should be able to do something similar to this (or exclude external secrets and just manually create the secrets less than 30s after applying the TaskRun):

$ cat > infer-entrypoint.yaml <<EOF
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: image-pull-secret
spec:
  refreshInterval: "1h"
  secretStoreRef:
    name: my-secret-store
    kind: SecretStore
  target:
    name: does-not-exist-yet
    creationPolicy: Owner
  data:
    - secretKey: .dockerconfigjson
      remoteRef:
        key: /path/to/your/secret # Path to the image pull secret in the secret store
  dataFrom: []
  type: kubernetes.io/dockerconfigjson
---
apiVersion: tekton.dev/v1
kind: TaskRun
metadata:
  name: infer-entrypoint-external-secret
spec:
  taskSpec:
    steps:
      - name: print-message
        image: my-registry.example.com/some-image:latest
  podTemplate:
    imagePullSecrets:
      - name: does-not-exist-yet
EOF
$ kubectl apply -f infer-entrypoint.yaml 

Almost always this will result in a taskrun that fails immediately, though if you are lucky and depending on the latency to the Secret store you may get a successful run. The time between the StartTime and the CompleteionTime is typically minimal, e.g. I got

Status:
  Completion Time:  2024-10-31T10:25:35Z
  Conditions:
    Last Transition Time:  2024-10-31T10:25:35Z
    Message:               failed to create task run pod "infer-entrypoint-external-secret": translating TaskSpec to Pod: Get "https://my-registry.example.com/v2/": dial tcp: lookup my-registry.example.com on 10.43.0.10:53: no such host. Maybe invalid TaskSpec
    Reason:                PodCreationFailed
    Status:                False
    Type:                  Succeeded
  Pod Name:                
  Provenance:
    Feature Flags:
      Await Sidecar Readiness:                true
      Coschedule:                             workspaces
      Disable Affinity Assistant:             false
      Disable Creds Init:                     false
      Disable Inline Spec:                    
      Enable API Fields:                      beta
      Enable Artifacts:                       false
      Enable CEL In When Expression:          false
      Enable Concise Resolver Syntax:         false
      Enable Keep Pod On Cancel:              false
      Enable Kubernetes Sidecar:              false
      Enable Param Enum:                      false
      Enable Provenance In Status:            true
      Enable Step Actions:                    false
      Enforce Nonfalsifiability:              none
      Max Result Size:                        4096
      Require Git SSH Secret Known Hosts:     false
      Result Extraction Method:               termination-message
      Running In Env With Injected Sidecars:  true
      Send Cloud Events For Runs:             false
      Set Security Context:                   false
      Verification No Match Policy:           ignore
  Start Time:                                 2024-10-31T10:25:35Z
  Task Spec:
    Steps:
      Compute Resources:
      Image:  my-registry.example.com/some-image:latest
      Name:   print-message
Events:
  Type     Reason         Age                From     Message
  ----     ------         ----               ----     -------
  Normal   Started        24s (x2 over 24s)  TaskRun  
  Warning  Failed         24s (x2 over 24s)  TaskRun  failed to create task run pod "infer-entrypoint-external-secret": translating TaskSpec to Pod: Get "https://my-registry.example.com/v2/": dial tcp: lookup my-registry.example.com on 10.43.0.10:53: no such host. Maybe invalid TaskSpec
  Warning  InternalError  24s (x2 over 24s)  TaskRun  1 error occurred:
           * failed to create task run pod "infer-entrypoint-external-secret": translating TaskSpec to Pod: Get "https://my-registry.example.com/v2/": dial tcp: lookup my-registry.example.com on 10.43.0.10:53: no such host. Maybe invalid TaskSpec

Notice how the time difference between the StartTime 2024-10-31T10:25:35Z and the CompletionTime is negligible 2024-10-31T10:25:35Z (I have observed worst case a 1s difference between the two)

On the other hand with

$ cat > infer-entrypoint.yaml <<EOF
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: image-pull-secret
spec:
  refreshInterval: "1h"
  secretStoreRef:
    name: my-secret-store
    kind: SecretStore
  target:
    name: does-not-exist-yet
    creationPolicy: Owner
  data:
    - secretKey: .dockerconfigjson
      remoteRef:
        key: /path/to/your/secret # Path to the image pull secret in the secret store
  dataFrom: []
  type: kubernetes.io/dockerconfigjson
---
apiVersion: tekton.dev/v1
kind: TaskRun
metadata:
  name: explicit-entrypoint-external-secret
spec:
  taskSpec:
    steps:
      - name: print-message
        image: my-registry.example.com/some-image:latest
        command:
          - echo
        args:
          - "Hello, world!"
  podTemplate:
    imagePullSecrets:
      - name: does-not-exist-yet
EOF
$ kubectl apply -f infer-entrypoint.yaml 

You get on. Tekton v0.65.0, something like this (in the case where I have rigged the ExternalSecret to have an issue provisioning the secret...

Status:
  Completion Time:  2024-10-31T10:26:17Z
  Conditions:
    Last Transition Time:  2024-10-31T10:26:17Z
    Message:               the step "print-message" in TaskRun "explicit-entrypoint-external-secret" failed to pull the image "". The pod errored with the message: "Back-off pulling image "my-registry.example.com/some-image:latest"."
    Reason:                TaskRunImagePullFailed
    Status:                False
    Type:                  Succeeded
  Pod Name:                explicit-entrypoint-external-secret-pod
  Provenance:
    Feature Flags:
      Await Sidecar Readiness:                true
      Coschedule:                             workspaces
      Disable Affinity Assistant:             false
      Disable Creds Init:                     false
      Disable Inline Spec:                    
      Enable API Fields:                      beta
      Enable Artifacts:                       false
      Enable CEL In When Expression:          false
      Enable Concise Resolver Syntax:         false
      Enable Keep Pod On Cancel:              false
      Enable Kubernetes Sidecar:              false
      Enable Param Enum:                      false
      Enable Provenance In Status:            true
      Enable Step Actions:                    false
      Enforce Nonfalsifiability:              none
      Max Result Size:                        4096
      Require Git SSH Secret Known Hosts:     false
      Result Extraction Method:               termination-message
      Running In Env With Injected Sidecars:  true
      Send Cloud Events For Runs:             false
      Set Security Context:                   false
      Verification No Match Policy:           ignore
  Start Time:                                 2024-10-31T10:25:35Z
  Steps:
    Container:  step-print-message
    Name:       print-message
    Terminated:
      Exit Code:         1
      Finished At:       2024-10-31T10:26:17Z
      Message:           Step print-message terminated as pod explicit-entrypoint-external-secret-pod is terminated
      Reason:            TaskRunImagePullFailed
      Started At:        <nil>
    Termination Reason:  TaskRunImagePullFailed
  Task Spec:
    Steps:
      Args:
        Hello, world!
      Command:
        echo
      Compute Resources:
      Image:  my-registry.example.com/some-image:latest
      Name:   print-message
Events:
  Type     Reason           Age                  From     Message
  ----     ------           ----                 ----     -------
  Normal   Started          2m49s                TaskRun  
  Normal   Pending          2m49s                TaskRun  Pending
  Normal   Pending          2m49s                TaskRun  pod status "PodReadyToStartContainers":"False"; message: ""
  Normal   Pending          2m48s                TaskRun  pod status "Ready":"False"; message: "containers with unready status: [step-print-message]"
  Normal   PullImageFailed  2m35s                TaskRun  build step "step-print-message" is pending with reason "failed to pull and unpack image \"my-registry.example.com/some-image:latest\": failed to resolve reference \"my-registry.example.com/some-image:latest\": failed to do request: Head \"https://my-registry.example.com/v2/some-image/manifests/latest\": dial tcp: lookup my-registry.example.com: no such host"
  Normal   PullImageFailed  2m7s                 TaskRun  build step "step-print-message" is pending with reason "Back-off pulling image \"my-registry.example.com/some-image:latest\""
  Warning  Failed           2m7s (x2 over 2m7s)  TaskRun  the step "print-message" in TaskRun "explicit-entrypoint-external-secret" failed to pull the image "". The pod errored with the message: "Back-off pulling image "my-registry.example.com/some-image:latest"."
  Warning  InternalError    2m6s (x2 over 2m6s)  TaskRun  pods "explicit-entrypoint-external-secret-pod" not found

Of note here is that the StartTime 2024-10-31T10:25:35Z is approx 40s before the CompletionTime 2024-10-31T10:26:17Z because Tekton has allowed a back-off on the image pull, this is despite my having left the default configuration for default-imagepullbackoff-timeout i.e. fail fast.

Additional Info

  • Kubernetes version:

    Output of kubectl version:

Client Version: v1.31.2
Kustomize Version: v5.4.2
Server Version: v1.30.4+k3s1
  • Tekton Pipeline version:

    Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'

v0.65.0
@stephenc stephenc added the kind/bug Categorizes issue or PR as related to a bug. label Oct 31, 2024
@vdemeester
Copy link
Member

  • If the TaskRun step that uses the image pull secret requires inference of the entrypoint, then the TaskRun will fail immediately without waiting for the secret to be provisioned, ignoring any configured default-imagepullbackoff-timeout

  • If the TaskRun step that uses the image pull secret has the entrypoint explicitly stated, then the TaskRun (in Tekton v0.65) may allow additional back-off retry typically of 30-40s before failing the TaskRun if the external secret has not been provisioned or succeeding if the ExternalSecret controller has managed to get the secret provisioned

It make sense (the failure I mean), because if the entrypoint is not specified (using command: or script:), the controller will be the one trying to resolve the image first. And I guess it doesn't not take default-imagepullbackoff-timeout into consideration at all 😓 .

It is indeed an inconsistency we should fix.

cc @afrittoli

@stephenc
Copy link
Author

stephenc commented Nov 4, 2024

The additional 30-40s grace also makes sense to me of just timing issues and allowing the reconcile to get in just ahead of the timeout so the next check is 30-40s later... I think that it could over complicate the code to anticipate the back-off expiration and schedule the re-reconcile for then in the absence of any other changes... so for that case I think a docs update on the config field to indicate that there can be additional grace would be good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

2 participants