You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I apply a single yaml file that includes an ExternalSecret and a TaskRun that uses the secret to be created by the ExternalSecret, the status of the TaskRun should be deterministic, especially if default-imagepullbackoff-timeout is using its default value of 0.
When default-imagepullbackoff-timeout is using its default value of 0
Either entrypoint inference should fail after allowing a 30-40s back-off in the case where the TaskRun is specifying image pull secrets and some of those secrets do not yet exist or when the entrypoint inference is not required, the initial image pull failure should result in an immediate failure.
When default-imagepullbackoff-timeout is set to a non-zero value, entrypoint inference should not fail immediately but should retry up to the configured timeout especially in the case where the TaskRun includes a pod template that specifies image pull secrets and some of those secrets have not yet been created. When the entrypoint inference is not required, the back-off can take up to 30s longer under my observations.
Perhaps at least the documentation for default-imagepullbackoff-timeout should mention that there is perhaps a 30s additional grace, though it would be best if everything behaved consistently, ideally respecting the configured back-off
Actual Behavior
If the TaskRun step that uses the image pull secret requires inference of the entrypoint, then the TaskRun will fail immediately without waiting for the secret to be provisioned, ignoring any configured default-imagepullbackoff-timeout
If the TaskRun step that uses the image pull secret has the entrypoint explicitly stated, then the TaskRun (in Tekton v0.65) may allow additional back-off retry typically of 30-40s before failing the TaskRun if the external secret has not been provisioned or succeeding if the ExternalSecret controller has managed to get the secret provisioned
For example, the following task runs were all created at the same time in a Tekton cluster with default-imagepullbackoff-timeout: 60s
Notice how the two task runs which are inferring the entrypoint both fail immediately and ignore the image pull backoff. The two task runs that have the entrypoint explicit fail at least 60s after starting, but this can be 30-40s later.
For example in this case the completion time of the two explict entrypoint task runs was approx 60s after start
But I have also had cases where the TaskRun that does not require an image pull secret took >90s while the one that required the image pull secret took ~60s
Steps to Reproduce the Problem
Note this is not a full reproducer as I am simplifying down from a more complex case. For the issue as described above you should be able to do something similar to this (or exclude external secrets and just manually create the secrets less than 30s after applying the TaskRun):
Almost always this will result in a taskrun that fails immediately, though if you are lucky and depending on the latency to the Secret store you may get a successful run. The time between the StartTime and the CompleteionTime is typically minimal, e.g. I got
Status:
Completion Time: 2024-10-31T10:25:35Z
Conditions:
Last Transition Time: 2024-10-31T10:25:35Z
Message: failed to create task run pod "infer-entrypoint-external-secret": translating TaskSpec to Pod: Get "https://my-registry.example.com/v2/": dial tcp: lookup my-registry.example.com on 10.43.0.10:53: no such host. Maybe invalid TaskSpec
Reason: PodCreationFailed
Status: False
Type: Succeeded
Pod Name:
Provenance:
Feature Flags:
Await Sidecar Readiness: true
Coschedule: workspaces
Disable Affinity Assistant: false
Disable Creds Init: false
Disable Inline Spec:
Enable API Fields: beta
Enable Artifacts: false
Enable CEL In When Expression: false
Enable Concise Resolver Syntax: false
Enable Keep Pod On Cancel: false
Enable Kubernetes Sidecar: false
Enable Param Enum: false
Enable Provenance In Status: true
Enable Step Actions: false
Enforce Nonfalsifiability: none
Max Result Size: 4096
Require Git SSH Secret Known Hosts: false
Result Extraction Method: termination-message
Running In Env With Injected Sidecars: true
Send Cloud Events For Runs: false
Set Security Context: false
Verification No Match Policy: ignore
Start Time: 2024-10-31T10:25:35Z
Task Spec:
Steps:
Compute Resources:
Image: my-registry.example.com/some-image:latest
Name: print-message
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Started 24s (x2 over 24s) TaskRun
Warning Failed 24s (x2 over 24s) TaskRun failed to create task run pod "infer-entrypoint-external-secret": translating TaskSpec to Pod: Get "https://my-registry.example.com/v2/": dial tcp: lookup my-registry.example.com on 10.43.0.10:53: no such host. Maybe invalid TaskSpec
Warning InternalError 24s (x2 over 24s) TaskRun 1 error occurred:
* failed to create task run pod "infer-entrypoint-external-secret": translating TaskSpec to Pod: Get "https://my-registry.example.com/v2/": dial tcp: lookup my-registry.example.com on 10.43.0.10:53: no such host. Maybe invalid TaskSpec
Notice how the time difference between the StartTime 2024-10-31T10:25:35Z and the CompletionTime is negligible 2024-10-31T10:25:35Z (I have observed worst case a 1s difference between the two)
You get on. Tekton v0.65.0, something like this (in the case where I have rigged the ExternalSecret to have an issue provisioning the secret...
Status:
Completion Time: 2024-10-31T10:26:17Z
Conditions:
Last Transition Time: 2024-10-31T10:26:17Z
Message: the step "print-message" in TaskRun "explicit-entrypoint-external-secret" failed to pull the image "". The pod errored with the message: "Back-off pulling image "my-registry.example.com/some-image:latest"."
Reason: TaskRunImagePullFailed
Status: False
Type: Succeeded
Pod Name: explicit-entrypoint-external-secret-pod
Provenance:
Feature Flags:
Await Sidecar Readiness: true
Coschedule: workspaces
Disable Affinity Assistant: false
Disable Creds Init: false
Disable Inline Spec:
Enable API Fields: beta
Enable Artifacts: false
Enable CEL In When Expression: false
Enable Concise Resolver Syntax: false
Enable Keep Pod On Cancel: false
Enable Kubernetes Sidecar: false
Enable Param Enum: false
Enable Provenance In Status: true
Enable Step Actions: false
Enforce Nonfalsifiability: none
Max Result Size: 4096
Require Git SSH Secret Known Hosts: false
Result Extraction Method: termination-message
Running In Env With Injected Sidecars: true
Send Cloud Events For Runs: false
Set Security Context: false
Verification No Match Policy: ignore
Start Time: 2024-10-31T10:25:35Z
Steps:
Container: step-print-message
Name: print-message
Terminated:
Exit Code: 1
Finished At: 2024-10-31T10:26:17Z
Message: Step print-message terminated as pod explicit-entrypoint-external-secret-pod is terminated
Reason: TaskRunImagePullFailed
Started At: <nil>
Termination Reason: TaskRunImagePullFailed
Task Spec:
Steps:
Args:
Hello, world!
Command:
echo
Compute Resources:
Image: my-registry.example.com/some-image:latest
Name: print-message
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Started 2m49s TaskRun
Normal Pending 2m49s TaskRun Pending
Normal Pending 2m49s TaskRun pod status "PodReadyToStartContainers":"False"; message: ""
Normal Pending 2m48s TaskRun pod status "Ready":"False"; message: "containers with unready status: [step-print-message]"
Normal PullImageFailed 2m35s TaskRun build step "step-print-message" is pending with reason "failed to pull and unpack image \"my-registry.example.com/some-image:latest\": failed to resolve reference \"my-registry.example.com/some-image:latest\": failed to do request: Head \"https://my-registry.example.com/v2/some-image/manifests/latest\": dial tcp: lookup my-registry.example.com: no such host"
Normal PullImageFailed 2m7s TaskRun build step "step-print-message" is pending with reason "Back-off pulling image \"my-registry.example.com/some-image:latest\""
Warning Failed 2m7s (x2 over 2m7s) TaskRun the step "print-message" in TaskRun "explicit-entrypoint-external-secret" failed to pull the image "". The pod errored with the message: "Back-off pulling image "my-registry.example.com/some-image:latest"."
Warning InternalError 2m6s (x2 over 2m6s) TaskRun pods "explicit-entrypoint-external-secret-pod" not found
Of note here is that the StartTime 2024-10-31T10:25:35Z is approx 40s before the CompletionTime 2024-10-31T10:26:17Z because Tekton has allowed a back-off on the image pull, this is despite my having left the default configuration for default-imagepullbackoff-timeout i.e. fail fast.
Additional Info
Kubernetes version:
Output of kubectl version:
Client Version: v1.31.2
Kustomize Version: v5.4.2
Server Version: v1.30.4+k3s1
Tekton Pipeline version:
Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'
v0.65.0
The text was updated successfully, but these errors were encountered:
If the TaskRun step that uses the image pull secret requires inference of the entrypoint, then the TaskRun will fail immediately without waiting for the secret to be provisioned, ignoring any configured default-imagepullbackoff-timeout
If the TaskRun step that uses the image pull secret has the entrypoint explicitly stated, then the TaskRun (in Tekton v0.65) may allow additional back-off retry typically of 30-40s before failing the TaskRun if the external secret has not been provisioned or succeeding if the ExternalSecret controller has managed to get the secret provisioned
It make sense (the failure I mean), because if the entrypoint is not specified (using command: or script:), the controller will be the one trying to resolve the image first. And I guess it doesn't not take default-imagepullbackoff-timeout into consideration at all 😓 .
The additional 30-40s grace also makes sense to me of just timing issues and allowing the reconcile to get in just ahead of the timeout so the next check is 30-40s later... I think that it could over complicate the code to anticipate the back-off expiration and schedule the re-reconcile for then in the absence of any other changes... so for that case I think a docs update on the config field to indicate that there can be additional grace would be good
Expected Behavior
When I apply a single yaml file that includes an ExternalSecret and a TaskRun that uses the secret to be created by the ExternalSecret, the status of the TaskRun should be deterministic, especially if default-imagepullbackoff-timeout is using its default value of
0
.When
default-imagepullbackoff-timeout
is using its default value of0
Either entrypoint inference should fail after allowing a 30-40s back-off in the case where the TaskRun is specifying image pull secrets and some of those secrets do not yet exist or when the entrypoint inference is not required, the initial image pull failure should result in an immediate failure.
When
default-imagepullbackoff-timeout
is set to a non-zero value, entrypoint inference should not fail immediately but should retry up to the configured timeout especially in the case where the TaskRun includes a pod template that specifies image pull secrets and some of those secrets have not yet been created. When the entrypoint inference is not required, the back-off can take up to 30s longer under my observations.Perhaps at least the documentation for
default-imagepullbackoff-timeout
should mention that there is perhaps a 30s additional grace, though it would be best if everything behaved consistently, ideally respecting the configured back-offActual Behavior
default-imagepullbackoff-timeout
For example, the following task runs were all created at the same time in a Tekton cluster with
default-imagepullbackoff-timeout: 60s
Notice how the two task runs which are inferring the entrypoint both fail immediately and ignore the image pull backoff. The two task runs that have the entrypoint explicit fail at least 60s after starting, but this can be 30-40s later.
For example in this case the completion time of the two explict entrypoint task runs was approx 60s after start
But I have also had cases where the TaskRun that does not require an image pull secret took >90s while the one that required the image pull secret took ~60s
Steps to Reproduce the Problem
Note this is not a full reproducer as I am simplifying down from a more complex case. For the issue as described above you should be able to do something similar to this (or exclude external secrets and just manually create the secrets less than 30s after applying the TaskRun):
Almost always this will result in a taskrun that fails immediately, though if you are lucky and depending on the latency to the Secret store you may get a successful run. The time between the StartTime and the CompleteionTime is typically minimal, e.g. I got
Notice how the time difference between the StartTime
2024-10-31T10:25:35Z
and the CompletionTime is negligible2024-10-31T10:25:35Z
(I have observed worst case a 1s difference between the two)On the other hand with
You get on. Tekton v0.65.0, something like this (in the case where I have rigged the ExternalSecret to have an issue provisioning the secret...
Of note here is that the StartTime
2024-10-31T10:25:35Z
is approx 40s before the CompletionTime2024-10-31T10:26:17Z
because Tekton has allowed a back-off on the image pull, this is despite my having left the default configuration for default-imagepullbackoff-timeout i.e. fail fast.Additional Info
Kubernetes version:
Output of
kubectl version
:Tekton Pipeline version:
Output of
tkn version
orkubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'
The text was updated successfully, but these errors were encountered: