Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alerts/Github - is it possible to separate add additional metadata? #589

Closed
audunsolemdal opened this issue Jul 31, 2023 · 29 comments · Fixed by #1068
Closed

Alerts/Github - is it possible to separate add additional metadata? #589

audunsolemdal opened this issue Jul 31, 2023 · 29 comments · Fixed by #1068

Comments

@audunsolemdal
Copy link

So my goal is to run a github action step to check if fluxcd reconciliation for a kustomization towards a specificcluster was successful. There are three different kustomizations in three different clusters which may post reconcilation successful status to my commit on github. The kustomizations have the same exact name It does not seem like modifying the eventMetadata or summary fields in the Alert resource adds any extra metadata to the github status.

I guess a workaround is to name the kustomizations tstsp1-github-info-${cluster} rather than just tstsp1-github-info, but I am wondering if there are other methods to inject this metadata into the Github status? The description or context fields would be sufficient to modify in my use case.

Say I have a repo with the following structure

orgX/wl-tstsp1:

fluxcd/
  dev/
    deployment.yaml
  test/
    deployment.yaml
  prod/
    deployment.yaml

Each folder is connected to the namespace called tstsp1 in three different cluster (dev/test/prod)
From each of the clusters there is a alert updating the github status

---
apiVersion: notification.toolkit.fluxcd.io/v1beta2
kind: Provider
metadata:
  name: tstsp1-github
  namespace: flux-system
spec:
  type: github
  address: https://github.com/orgX/wl-tstsp1
  secretRef:
    name: gh-flux-pat
---
apiVersion: notification.toolkit.fluxcd.io/v1beta2
kind: Alert
metadata:
  name: "tstsp1-github-info"
  namespace: flux-system
spec:
  providerRef:
    name: tstsp1-github
  eventSeverity: info
  summary: "dev01"
  eventMetadata:
    cluster: dev01
  eventSources:
    - kind: Kustomization
      name: '*'
      namespace: tstsp1

I end up with a status looking somewhat like the below:

gh api /repos/orgX/wl-tstsp1/statuses/88efe1eeb3e2ad9724acc6ff7c06324f6e613a66
[
  {
    "url": "https://api.github.com/repos/orgX/wl-tstsp1/statuses/88efe1eeb3e2ad9724acc6ff7c06324f6e613a66",
    "avatar_url": "https://avatars.githubusercontent.com/u/18458716?v=4",
    "id": 24342874371,
    "node_id": "SC_kwDOHpAnes8AAAAFqvLJAw",
    "state": "success",
    "description": "reconciliation succeeded",
    "target_url": null,
    "context": "kustomization/tstsp1/025ecd77",
    "created_at": "2023-07-31T10:38:27Z",
    "updated_at": "2023-07-31T10:38:27Z",
    "creator": {
      ...
    }
  }
]   
@makkes
Copy link
Member

makkes commented Aug 1, 2023

The context field of the status is constructed from the involved object and the UID of the provider (.metadata.uid of each Provider object) to make it unique across clusters but I reckon you might want something more easily trackable to the cluster the status came from.

The commit status API accepts the parameters state, description, context and target_url. The first three are already used by Flux but maybe we can make the context configurable to include an additional field derived from Alert.spec.Summary (which ends up in the event metadata).

@audunsolemdal
Copy link
Author

I'll be happy with any solution that doesn't involve renaming my kustomizations per cluster.
I also tried setting

eventMetadata:
  context: ${cluster}

to no avail, my understanding is that certain fields as context cannot be overridden https://github.com/fluxcd/notification-controller/pull/529/files#diff-9c161f071659a6d134b4cfdd27ad66fc88d93b5dfb102d7bb5a768b9e9e483eaR81 as you mentioned they are used by flux already.

@audunsolemdal
Copy link
Author

Looks like this PR described the same use case #340

@makkes
Copy link
Member

makkes commented Aug 1, 2023

Looks like this PR described the same use case #340

I don't think a new field is necessary here as we can rather fetch the needed information from the Alert object's .spec.summary field as suggested here.

@somtochiama
Copy link
Member

maybe we can make the context configurable to include an additional

@makkes maybe we should include it in the description field instead?

@makkes
Copy link
Member

makkes commented Aug 1, 2023

maybe we can make the context configurable to include an additional

@makkes maybe we should include it in the description field instead?

Would be fine from my perspective.

@gdasson
Copy link
Contributor

gdasson commented Dec 29, 2023

@makkes : Please assign this issue to me. I can work on it.

@hirenko-v
Copy link

is there any workaround to achieve this?

@hirenko-v
Copy link

I see that ut uses uid of the provider in commit status like this:

kustomization/tf-controller/25c8a95e
is it possible to set or override generated uid 25c8a95e?

@makkes
Copy link
Member

makkes commented Jun 5, 2024

I see that ut uses uid of the provider in commit status like this:

kustomization/tf-controller/25c8a95e is it possible to set or override generated uid 25c8a95e?

No.

@hirenko-v
Copy link

@makkes
So basicalliy there is no way to have some additional data like cluster name in commit check except setting it in Kustomization(or another resource) name, right?

@makkes
Copy link
Member

makkes commented Jun 5, 2024

@makkes So basicalliy there is no way to have some additional data like cluster name in commit check except setting it in Kustomization(or another resource) name, right?

#589 (comment)

@devnev
Copy link

devnev commented Oct 2, 2024

@makkes the summary field is human-readable text, not an identifier. For this issue, it would be preferable to to change the context value of the github status to make it predictable for each cluster so that other CI steps can wait on the status to be set.

@stefanprodan
Copy link
Member

We've implemented RFC-0008 in Flux 2.5 which allow custom metadata like GitHub deployment ID to be injected in Flux events.

@audunsolemdal
Copy link
Author

After bumping to Flux 2.5 and testing it seems that this issue should be reopened. While it is now possible to add custom metadata to flux events, this issue is about adding extra metadata to alerts targetting Notification providers of type: github, although the issue title could be more clear on this

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  annotations:
    event.toolkit.fluxcd.io/context: mycustomcontext # ignored in github commit status
gh api /repos/myorg/myrepo/commits/f50c693/status
{
  "state": "success",
  "statuses": [
    {
      (...)
      *context": "kustomization/tstsp1/025ecd77", # unable to override this property
    }
  ],

I don't see how this can be done with the current Github commit status API limitations, other than allowing us to override the context parameter as mentioned in this comment #589 (comment)

@matheuscscp matheuscscp reopened this Feb 27, 2025
@matheuscscp
Copy link
Member

matheuscscp commented Feb 27, 2025

One way to implement this would be an optional field in the Provider API called spec.commitStatusIDExpr, a CEL expression to compose the commit status ID for commit status providers from the event metadata, and also possibly the involved Alert and Provider objects, e.g.:

...
spec:
  commitStatusIDExpr: "'my-kustomization/' + event.metadata.foo + '/' + alert.metadata.name + '/' + provider.metadata.name"

Here event.metadata would already contain the custom fields injected via the implementation of RFC 0008. Would this suffice for your use case? Also you would need to be careful with how you compose this, right now it's unique per (providerUID, event.involvedObject).

@audunsolemdal
Copy link
Author

Here event.metadata would already contain the custom fields injected via the implementation of RFC 0008. Would this suffice for your use case?

Yes, I believe so. If I understand you correctly we could then make the end result to be be changed in a manner like this?

"context": "kustomization/my-kustomization/025ecd77"
-->
"context": "kustomization/my-kustomization/foo-event-metadata/025ecd77"

@matheuscscp
Copy link
Member

Exactly, the expression for that would be:

(event.involvedObject.kind + '/' + event.involvedObject.name + '/' + event.metadata.foo + '/' + provider.metadata.uid.split('-')[0]).lowerAscii()

See my CEL Playground test.

@audunsolemdal
Copy link
Author

Exactly, the expression for that would be:

(event.involvedObject.kind + '/' + event.involvedObject.name + '/' + event.metadata.foo + '/' + provider.metadata.uid.split('-')[0]).lowerAscii()

See my CEL Playground test.

Yes, this would suffice for the use case described in this issue.

@kathleenfrench
Copy link
Contributor

kathleenfrench commented Mar 11, 2025

@matheuscscp i'd be happy to help work on this, but was curious about what this would mean for updates to the provider API -- would we want to add a status subresource to the provider type with owned conditions to reconcile it similar to the receiver resource and its validating of CEL expressions? or just emit warnings if the expression encounters errors at runtime when an alert fires w/ a compatible provider (that uses the commit status util func) and fall back to the existing behavior without reflecting it in the conditions?

also, opened this: fluxcd/pkg#882

@matheuscscp
Copy link
Member

@kathleenfrench That is a great design question, it turns out that we recently went through the process of cleaning up the reconcilers for Alert and Provider because at the time it didn't seem that those APIs needed any more state. But those controllers are still there even if without any custom/special API logic, so we can more easily go back to having some API logic there if it makes sense, I personally don't see a problem with that.

It totally makes sense to me to do the same as what we did in Receiver, because we can't benefit for example from the Kubernetes API server built-in CEL validations in the CRD to check if this expression is valid, I don't think CEL has the ability to compile CEL, at least not yet. So we can't go with the ideal approach of blocking creations/updates of objects entirely by validating the expression at the time when the object is sent to the Kubernetes API. We are left with this async validation through the controller. At least this gives some good feedback to users when the expression is invalid, better than unseen error/warning logs in the controller, because we won't stop sending the event to other matching Alert/Provider objects that are correctly configured and fail catastrophically.

I'd be for reintroducing the status subresource for Provider. This is the status for Receiver:

// ReceiverStatus defines the observed state of the Receiver.
type ReceiverStatus struct {
	meta.ReconcileRequestStatus `json:",inline"`

	// Conditions holds the conditions for the Receiver.
	// +optional
	Conditions []metav1.Condition `json:"conditions,omitempty"`

	// WebhookPath is the generated incoming webhook address in the format
	// of '/hook/sha256sum(token+name+namespace)'.
	// +optional
	WebhookPath string `json:"webhookPath,omitempty"`

	// ObservedGeneration is the last observed generation of the Receiver object.
	// +optional
	ObservedGeneration int64 `json:"observedGeneration,omitempty"`
}

For Provider we obviously don't need WebhookPath but I think we need all the rest. Not entirely sure about ObservedGeneration though, as that field is supposed to be in the Conditions, but we do have a standard in Flux where we initialize .status.observedGeneration with -1 to work well with kstatus, so I'll let @stefanprodan comment on that.

In the event server handler we can drop the alert sooner by looking at the condition like we did for Receiver here:

if receiver.Spec.Suspend || !conditions.IsReady(&receiver) {

This would avoid evaluating the expression if we already know it's invalid from the object status.

We also need flux reconcile alert-provider in the CLI if we go with this.

And finally I think it would also be nice to add Provider to the allowed kinds in the .spec.resources[].kind field of the Receiver API to allow for reconciling the Provider validation on webhooks as well, but this is tricky because the type CrossNamespaceObjectReference is shared with the Alert API for the field .spec.eventSources, and I don't know if there's harm on doing this, from the 13 Flux APIs the only missing APIs here are the notification ones.

@stefanprodan does any of this make sense?

@kathleenfrench
Copy link
Contributor

kathleenfrench commented Mar 11, 2025

all of that makes sense to me, though i do have a question about:

because we can't benefit for example from the Kubernetes API server built-in CEL validations in the CRD to check if this expression is valid, I don't think CEL has the ability to compile CEL, at least not yet. So we can't go with the ideal approach of blocking creations/updates of objects entirely by validating the expression at the time when the object is sent to the Kubernetes API.

we probably do want a mechanism for blocking creation/updates entirely based on certain known conditions -- i've not worked with CEL previously, but my understanding is we do need knowledge of the variables that will be evaluated in order to compile the expression, which means won't we need to explicitly limit what can be referenced in the expression itself? i.e. event would be supported, but in your example using provider we would need to also forward provider.GetObjectMeta() to notifier.NewFactory in order to access that metadata (not currently implemented, we just send string(providerUID))

so i think my first question is should event and provider (metadata) be the the only two supported variables? should it just be event? the downside to just using event is we can't enforce a unique ctx/status that way (as is done via the provider UID now), which puts the onus on the end-user to ensure they're supplying contextually-unique statuses for their purposes. one other measure could be just superficially appending the truncated provider UID to any custom status, but i suspect that runs counter to the point so it might just be something that needs to be called out explicitly in documentation. i also saw a comment that included alert. but i think ultimately what people are most interested in is what can be pulled from the event metadata.

We are left with this async validation through the controller.

from my reading even this is fairly limited, though, no? we can go as far as validating that the expression compiles, but if -- for instance -- someone uses event.metadata.mycustomfield in their expression and that field is not forwarded in the event due to some misconfiguration, then that wouldn't be caught by validations that can be performed by the provider reconciliation so overall readiness wouldn't be representative as it would only error at runtime -- so for those instances is that best handled by just falling back to existing behavior and logging? it seems like trying to reflect that kind of issue on the conditions of the provider would be a bit fraught, and i'm not sure if dropping alerts is desirable behavior either if the alternative is falling back to the status quo.

@matheuscscp
Copy link
Member

matheuscscp commented Mar 11, 2025

What I meant by CEL compiling CEL is about these XValidations here, which are CEL expressions:

https://github.com/fluxcd/source-controller/blob/9dedcede9d47d925caea7251e8ad073c5bb93dc1/api/v1/bucket_types.go#L49-L53

These annotations in the spec struct generate these in the CRD:

https://github.com/fluxcd/source-controller/blob/9dedcede9d47d925caea7251e8ad073c5bb93dc1/config/crd/bases/source.toolkit.fluxcd.io_buckets.yaml#L219-L234

I meant we can't have one of those CEL expressions in the struct annotations to validate .spec.commitStatusIDExpr, I don't think there's a CEL expression that is equivalent to the Go code we are about to write for compiling .spec.commitStatusIDExpr.

we do need knowledge of the variables that will be evaluated

It depends on the use case. For example, in the Receiver case we know all the top-level variables we will feed into the data that will be evaluated by the expression, i.e. req and res, like this:

func newFilterExpression(s string) (*cel.Expression, error) {
return cel.NewExpression(s,
cel.WithCompile(),
cel.WithOutputType(types.BoolType),
cel.WithStructVariables("res", "req"))
}

But in the Custom CEL Health Checks feature we just shipped in kustomize-controller unfortunately we don't know all the fields a custom resource could have, so we build the expression without WithStructVariables() (and consequently also without WithCompile(), and because of that also without WithOutputType()) because if the expression references a variable that was not declared, the compilation will fail. That's unfortunate because then we can't predict the output type, and the validation of the expression is not the best, in a case like this you will only know if the expression is good when you evaluate it with some specific data input. See this:

https://github.com/fluxcd/pkg/blob/8b1f852228b117123efcc7b5708ce112801416b6/runtime/cel/status_evaluator.go#L38-L47

But in the case of .spec.commitStatusIDExpr we know all the inputs like in Receiver, the inputs for this use case are:

  • event
  • alert
  • provider

So I think we can probably build the expression with something like this:

func newCommitStatusIDExpression(s string) (*cel.Expression, error) {
	return cel.NewExpression(s,
		cel.WithCompile(),
		cel.WithOutputType(types.StringType),
		cel.WithStructVariables("event", "alert", "provider"))
}

we would need to also forward provider.GetObjectMeta() to notifier.NewFactory in order to access that metadata

Btw because of this comment I noticed that we need to swap the order of these lines:

sender, token, err := createNotifier(ctx, s.kubeClient, provider)
if err != nil {
return nil, nil, "", 0, fmt.Errorf("failed to initialize notifier for provider '%s': %w", provider.Name, err)
}
notification := *event.DeepCopy()
s.combineEventMetadata(ctx, &notification, alert)

combineEventMetadata() has to run before we create the notifier and the resulting event has to be forwarded as well. So an alternative would be building the commit status ID outside if specified and then only forward it instead of these two new inputs to the notifier factory.

Passing the built commit status ID when present could also be an opportunity to refactor the notifier factory constructor, I really don't like this constructor taking all these inputs at the same time the way it does, not all of them are present i.e. some are actually optional. Explicit arguments are for mandatory parameters and not optional ones. Then inside createNotifier() we declare a slice of options at the beginning and then go checking all the API fields and conditionally add the options appropriately.

i also saw a comment that included alert. but i think ultimately what people are most interested in is what can be pulled from the event metadata.

Fair point! But at the place of the code where we should compile+evaluate the expression all the three objects are easily available i.e. inside the function getNotificationParams() I referenced above, so it should be easy to make all the three involved objects available to the expression.

from my reading even this is fairly limited, though, no? we can go as far as validating that the expression compiles, but if -- for instance -- someone uses event.metadata.mycustomfield in their expression and that field is not forwarded in the event due to some misconfiguration, then that wouldn't be caught by validations that can be performed by the provider reconciliation so overall readiness wouldn't be representative as it would only error at runtime -- so for those instances is that best handled by just falling back to existing behavior and logging? it seems like trying to reflect that kind of issue on the conditions of the provider would be a bit fraught, and i'm not sure if dropping alerts is desirable behavior either if the alternative is falling back to the status quo.

You're right, the validation wouldn't be perfect anyway, but I think it still gives a bit of a head start, and also we did this already in three places:

  • Receiver resource filter expression
  • Kustomization custom health checks
  • flux-operator's ResourceSet dependency check expression

So for consistency it would make sense to do the same for Provider as well

@matheuscscp
Copy link
Member

i'm not sure if dropping alerts is desirable behavior either if the alternative is falling back to the status quo.

This is a fair point btw, I'm really unsure about what's best here, would folks in this thread prefer the status quo if the expression evaluation results in an error? We would log it as an error anyway of course, but maybe it's easier to see a commit status missing than a commit status with wrong ID?

@audunsolemdal What would you prefer?

@matheuscscp
Copy link
Member

I usually prefer failing catastrophically rather than making the best effort, to force people to fix their mistakes sooner rather than later

@kathleenfrench
Copy link
Contributor

@matheuscscp given the scope of changes to the provider api does this warrant a new v1beta4 package? or can changes target v1beta3?

@matheuscscp
Copy link
Member

Maybe, there's also other changes in Provider we are considering for the next minor release that could also warrant v1beta4 or even v1, but you don't need to worry about that in your PR, we will do this API version bump later if needed 👍

@matheuscscp
Copy link
Member

matheuscscp commented Mar 12, 2025

Hey @kathleenfrench we discussed this in the flux dev meeting today and reached two conclusions:

  • We don't need the status part. We could fail to deliver the alert for a number of other reasons and we handle all of them the same way, with an error log at the server entrypoint, so we really just need to bubble up the error with a good error message and that should be enough. This eliminates a lot of complexity in the implementation, we don't need flux reconcile alert-provider anymore, and all the discussion about the allowed kinds is also gone.
  • The name of the field should be commitStatusExpr, the ID part makes it not so good.

By the way this also settles the discussion around dropping vs using a fallback, we should simply drop the alert if the configuration is wrong, we error out and let the error log be printed, that's it. Also the field is optional, so users that don't use it obviously won't be affected, so it's all good

@kathleenfrench
Copy link
Contributor

sounds good -- i'll get to work on something 👍🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants