Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding ObservedGeneration in the CRD #650

Open
Shashankft9 opened this issue Oct 30, 2023 · 5 comments
Open

Adding ObservedGeneration in the CRD #650

Shashankft9 opened this issue Oct 30, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@Shashankft9
Copy link

Currently, its hard to tell if solr cloud deployment has changed from "x" to "y" state, or if it has changed at all after I have done some changes in the CR.
Perhaps if we can add Observed Generator field and/or something like "LastTransitionTime", the CR could tell more about the current state of the solr cloud? This could be a good basis for a requirement we have, where we are tracking the transition of solr from "x" to "y".

Expected status:

status:
  backupRestoreReady: false
  internalCommonAddress: http://test-solrcloud-common.solr-cluster
  podSelector: solr-cloud=test,technology=solr-cloud
  readyReplicas: 3
  replicas: 3
  **observedGeneration: 1**
  solrNodes:
  - internalAddress: http://test-solrcloud-0.test-solrcloud-headless.solr-cluster:8983
    name: test-solrcloud-0
    nodeName: gccdelinttexw2
    ready: true
    specUpToDate: true
    version: 9.2.0
    **lastTransitionTime: 2023-10-27T12:53:56.989Z**
  - internalAddress: http://test-solrcloud-1.test-solrcloud-headless.solr-cluster:8983
    name: test-solrcloud-1
    nodeName: gccdelinttexw3
    ready: true
    specUpToDate: true
    version: 9.2.0
    **lastTransitionTime: 2023-10-27T12:54:56.989Z**
  - internalAddress: http://test-solrcloud-2.test-solrcloud-headless.solr-cluster:8983
    name: test-solrcloud-2
    nodeName: gccdelinttexw1
    ready: true
    specUpToDate: true
    version: 9.2.0
    **lastTransitionTime: 2023-10-27T12:55:56.989Z**
  upToDateNodes: 3
  version: 9.2.0
  zookeeperConnectionInfo:
    chroot: /test
    externalConnectionString: N/A
    internalConnectionString: test-solrcloud-zookeeper-0.test-solrcloud-zookeeper-headless.solr-cluster.svc.cluster.local:2181,test-solrcloud-zookeeper-1.test-solrcloud-zookeeper-headless.solr-cluster.svc.cluster.local:2181,test-solrcloud-zookeeper-2.test-solrcloud-zookeeper-headless.solr-cluster.svc.cluster.local:2181
@gerlowskija
Copy link
Contributor

Hey @Shashankft9 , @RavinaChidambaram - can you elaborate a little on your broader use case? What sort of changes/states are you trying to monitor, that are tough to track with the existing 'status'?

@Shashankft9
Copy link
Author

hello @gerlowskija, we use status of SolrCloud CR as source of truth for the process of notifying the provisioner system (UI) that "whatever was put in the CR" has been "successfully deployed", so that the provisioner system can get an ack.
Not just Solr, we use the same mechanism for other operator based services of database, streaming and more.

In this case, without observedGeneration, its hard for our automation running in kubernetes cluster to identify whether the patch done in the SolrCloud has been successfully deployed in the actual instance or not, and so stopping our automation to send an ack signal to the provisioner system.
Same way, lastTransitionTime adds another layer of proof in the status to reflect on whats going on at the CR and operator level - I believe this is still not implemented in the PR yet, but if you agree on the intention here, we can add that as well!

@gerlowskija
Copy link
Contributor

its hard for our automation running in kubernetes cluster to identify whether the patch done in the SolrCloud has been successfully deployed in the actual instance or not

I guess this is the bit I'm particularly curious about. The main thing I can think of that wouldn't appear in the status would be pod or statefulset-level settings like env-vars, resources, etc - is that representative of the sort of things you don't have a good way to poll for? Or is there another common example?

My worry I guess is that the observedGeneration/lastTransitionTime approach doesn't give enough granularity to be useful in practice. It wouldn't give any way to distinguish between multiple resource changes made in quick succession. Or between a truly substantive "status" update and a (e.g.) pod-readiness blip (which also impacts SolrCloudStatus afaict). Even for a substantive update, many spec changes require multiple steps and update "status" multiple times throughout - so in practice there can be a big gap between "the status changed" and "the user-requested change is complete".

It feels like observedGeneration might send a wrong (or at least incomplete?) signal to your provisioner/UI much of the time. Have you thought much about those scenarios? Is there maybe something I'm missing?

I'm not against the approach btw - I just want to make sure it'll work before we build.

@Shashankft9
Copy link
Author

I hear your concern, maybe I can give some examples of status in other CRs that can help explain where I am coming from.

Example 1:
the status of a strimzi kafka cr looks something like this:

status:
  clusterId: ETqrvQ33S9uL5gshnirFPQ
  conditions:
  - lastTransitionTime: "2024-09-27T07:23:13.389Z"
    message: Exceeded timeout of 300000ms while waiting for Deployment resource dev-hsp-01-entity-operator
      in namespace hsp-01 to be ready
    reason: TimeoutException
    status: "True"
    type: NotReady
  listeners:
  - addresses:
    - host: hsp-01.sample.com
      port: 30442
    bootstrapServers: hsp-01.sample.com:30442
    name: external
    type: external
  observedGeneration: 1

now the idea is that this observedGeneration should always match with the metadata.Generation, this is done so that instead of giving me a granular insight, it will give me an overall view that, when I did a change in spec (which incremented the metadata.Generation value), I can see that the operator worked upon it and showed me an error - at some point in time T I can verify that the operator did actually work on the last CR change by comparing the two values. Ofcourse, Its not giving me all the details of what changed in between, but it does show me that if in one edit I changed value x, y and z, did the operator succesfully process them all or not.

Example 2:
We can take a look at something more complex like a knative service CR status, where a change in spec, actually causes the knative controllers to update 4-5 other CRs in the cluster, which are all captured in a status like this:

status:
  address:
    url: http://healthchecks.metacontroller.svc.cluster.local
  conditions:
  - lastTransitionTime: "2024-03-14T09:31:36Z"
    status: "True"
    type: ConfigurationsReady
  - lastTransitionTime: "2024-03-14T09:31:37Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-03-14T09:31:37Z"
    status: "True"
    type: RoutesReady
  latestCreatedRevisionName: healthchecks-00006
  latestReadyRevisionName: healthchecks-00006
  observedGeneration: 6
  traffic:
  - latestRevision: true
    percent: 100
    revisionName: healthchecks-00006
  url: http://healthchecks.metacontroller.svc.cluster.local

now looking at this status, I know the knative controllers worked on the last cr spec change, because the observedGeneration is equal to whats there in the metadata.Generation.

I may not be able to articulate it well, but I recently saw a good talk on a kubernetes CRD status design, just the first few mins where scott is covering the tldr might help - https://youtu.be/iNp-fsffIgQ?t=70

is that representative of the sort of things you don't have a good way to poll for?

we wanted to make "ack-to-provisioner" process event driven instead of polling, because if we go the polling route, we will have to write custom logic for each operator based service, versus implementing something at CR status level, which will save us from lot of code maintenance since most of the other operators already follow this pattern I think.

It feels like observedGeneration might send a wrong (or at least incomplete?) signal to your provisioner/UI much of the time. Have you thought much about those scenarios? Is there maybe something I'm missing?

Yes, we have thought about this, and to handle this, we have a check where we only send the ack for success scenarios right now, specifically when a condition like this becomes true:
are all the condition status true && is the observedGeneration equal to metadata generation && is the observedGeneration value greater than last observedGeneration for which we sent an ack
this would take care of scenarios where the instance is created the first time and/or if its resized.
Now, in this above logic, we dont really use lastTransitionTime, but I added it in the issue, because generally speaking, the time helps in other day to day operational activities.

Apologies if I have gone south of what you actually asked.

@gerlowskija
Copy link
Contributor

gerlowskija commented Jan 6, 2025

Not at all - I appreciate the very detailed response @Shashankft9! The YT link was especially helpful; it made a good starting point for me. Thanks for your patience while I read up on this stuff!

I think I was initially under the misapprehension that 'observedGeneration' was "just" a counter that would increase at each "status" update. Having it driven by the metadata.Generation value seen on existing objects makes much more sense and alleviates many of my concerns.

It's also clearly a best practice, given that it's in StatefulSet and other native resource types.

I'm +1 to the approach around observedGeneration, now that I've got a better grasp of things. No concerns with lastTransitionTime either, so long as folks don't misunderstand it as being sufficient. I'll review the PR shortly and we can continue discussion there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants