Adding ObservedGeneration in the CRD #650

Shashankft9 · 2023-10-30T06:49:07Z

Currently, its hard to tell if solr cloud deployment has changed from "x" to "y" state, or if it has changed at all after I have done some changes in the CR.
Perhaps if we can add Observed Generator field and/or something like "LastTransitionTime", the CR could tell more about the current state of the solr cloud? This could be a good basis for a requirement we have, where we are tracking the transition of solr from "x" to "y".

Expected status:

status:
  backupRestoreReady: false
  internalCommonAddress: http://test-solrcloud-common.solr-cluster
  podSelector: solr-cloud=test,technology=solr-cloud
  readyReplicas: 3
  replicas: 3
  **observedGeneration: 1**
  solrNodes:
  - internalAddress: http://test-solrcloud-0.test-solrcloud-headless.solr-cluster:8983
    name: test-solrcloud-0
    nodeName: gccdelinttexw2
    ready: true
    specUpToDate: true
    version: 9.2.0
    **lastTransitionTime: 2023-10-27T12:53:56.989Z**
  - internalAddress: http://test-solrcloud-1.test-solrcloud-headless.solr-cluster:8983
    name: test-solrcloud-1
    nodeName: gccdelinttexw3
    ready: true
    specUpToDate: true
    version: 9.2.0
    **lastTransitionTime: 2023-10-27T12:54:56.989Z**
  - internalAddress: http://test-solrcloud-2.test-solrcloud-headless.solr-cluster:8983
    name: test-solrcloud-2
    nodeName: gccdelinttexw1
    ready: true
    specUpToDate: true
    version: 9.2.0
    **lastTransitionTime: 2023-10-27T12:55:56.989Z**
  upToDateNodes: 3
  version: 9.2.0
  zookeeperConnectionInfo:
    chroot: /test
    externalConnectionString: N/A
    internalConnectionString: test-solrcloud-zookeeper-0.test-solrcloud-zookeeper-headless.solr-cluster.svc.cluster.local:2181,test-solrcloud-zookeeper-1.test-solrcloud-zookeeper-headless.solr-cluster.svc.cluster.local:2181,test-solrcloud-zookeeper-2.test-solrcloud-zookeeper-headless.solr-cluster.svc.cluster.local:2181

The text was updated successfully, but these errors were encountered:

gerlowskija · 2024-12-09T17:30:09Z

Hey @Shashankft9 , @RavinaChidambaram - can you elaborate a little on your broader use case? What sort of changes/states are you trying to monitor, that are tough to track with the existing 'status'?

Shashankft9 · 2024-12-10T08:11:11Z

hello @gerlowskija, we use status of SolrCloud CR as source of truth for the process of notifying the provisioner system (UI) that "whatever was put in the CR" has been "successfully deployed", so that the provisioner system can get an ack.
Not just Solr, we use the same mechanism for other operator based services of database, streaming and more.

In this case, without observedGeneration, its hard for our automation running in kubernetes cluster to identify whether the patch done in the SolrCloud has been successfully deployed in the actual instance or not, and so stopping our automation to send an ack signal to the provisioner system.
Same way, lastTransitionTime adds another layer of proof in the status to reflect on whats going on at the CR and operator level - I believe this is still not implemented in the PR yet, but if you agree on the intention here, we can add that as well!

gerlowskija · 2024-12-10T16:54:43Z

its hard for our automation running in kubernetes cluster to identify whether the patch done in the SolrCloud has been successfully deployed in the actual instance or not

I guess this is the bit I'm particularly curious about. The main thing I can think of that wouldn't appear in the status would be pod or statefulset-level settings like env-vars, resources, etc - is that representative of the sort of things you don't have a good way to poll for? Or is there another common example?

My worry I guess is that the observedGeneration/lastTransitionTime approach doesn't give enough granularity to be useful in practice. It wouldn't give any way to distinguish between multiple resource changes made in quick succession. Or between a truly substantive "status" update and a (e.g.) pod-readiness blip (which also impacts SolrCloudStatus afaict). Even for a substantive update, many spec changes require multiple steps and update "status" multiple times throughout - so in practice there can be a big gap between "the status changed" and "the user-requested change is complete".

It feels like observedGeneration might send a wrong (or at least incomplete?) signal to your provisioner/UI much of the time. Have you thought much about those scenarios? Is there maybe something I'm missing?

I'm not against the approach btw - I just want to make sure it'll work before we build.

Shashankft9 · 2024-12-10T18:11:14Z

I hear your concern, maybe I can give some examples of status in other CRs that can help explain where I am coming from.

Example 1:
the status of a strimzi kafka cr looks something like this:

status:
  clusterId: ETqrvQ33S9uL5gshnirFPQ
  conditions:
  - lastTransitionTime: "2024-09-27T07:23:13.389Z"
    message: Exceeded timeout of 300000ms while waiting for Deployment resource dev-hsp-01-entity-operator
      in namespace hsp-01 to be ready
    reason: TimeoutException
    status: "True"
    type: NotReady
  listeners:
  - addresses:
    - host: hsp-01.sample.com
      port: 30442
    bootstrapServers: hsp-01.sample.com:30442
    name: external
    type: external
  observedGeneration: 1

now the idea is that this observedGeneration should always match with the metadata.Generation, this is done so that instead of giving me a granular insight, it will give me an overall view that, when I did a change in spec (which incremented the metadata.Generation value), I can see that the operator worked upon it and showed me an error - at some point in time T I can verify that the operator did actually work on the last CR change by comparing the two values. Ofcourse, Its not giving me all the details of what changed in between, but it does show me that if in one edit I changed value x, y and z, did the operator succesfully process them all or not.

Example 2:
We can take a look at something more complex like a knative service CR status, where a change in spec, actually causes the knative controllers to update 4-5 other CRs in the cluster, which are all captured in a status like this:

status:
  address:
    url: http://healthchecks.metacontroller.svc.cluster.local
  conditions:
  - lastTransitionTime: "2024-03-14T09:31:36Z"
    status: "True"
    type: ConfigurationsReady
  - lastTransitionTime: "2024-03-14T09:31:37Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-03-14T09:31:37Z"
    status: "True"
    type: RoutesReady
  latestCreatedRevisionName: healthchecks-00006
  latestReadyRevisionName: healthchecks-00006
  observedGeneration: 6
  traffic:
  - latestRevision: true
    percent: 100
    revisionName: healthchecks-00006
  url: http://healthchecks.metacontroller.svc.cluster.local

now looking at this status, I know the knative controllers worked on the last cr spec change, because the observedGeneration is equal to whats there in the metadata.Generation.

I may not be able to articulate it well, but I recently saw a good talk on a kubernetes CRD status design, just the first few mins where scott is covering the tldr might help - https://youtu.be/iNp-fsffIgQ?t=70

is that representative of the sort of things you don't have a good way to poll for?

we wanted to make "ack-to-provisioner" process event driven instead of polling, because if we go the polling route, we will have to write custom logic for each operator based service, versus implementing something at CR status level, which will save us from lot of code maintenance since most of the other operators already follow this pattern I think.

It feels like observedGeneration might send a wrong (or at least incomplete?) signal to your provisioner/UI much of the time. Have you thought much about those scenarios? Is there maybe something I'm missing?

Yes, we have thought about this, and to handle this, we have a check where we only send the ack for success scenarios right now, specifically when a condition like this becomes true:
are all the condition status true && is the observedGeneration equal to metadata generation && is the observedGeneration value greater than last observedGeneration for which we sent an ack
this would take care of scenarios where the instance is created the first time and/or if its resized.
Now, in this above logic, we dont really use lastTransitionTime, but I added it in the issue, because generally speaking, the time helps in other day to day operational activities.

Apologies if I have gone south of what you actually asked.

gerlowskija · 2025-01-06T17:31:48Z

Not at all - I appreciate the very detailed response @Shashankft9! The YT link was especially helpful; it made a good starting point for me. Thanks for your patience while I read up on this stuff!

I think I was initially under the misapprehension that 'observedGeneration' was "just" a counter that would increase at each "status" update. Having it driven by the metadata.Generation value seen on existing objects makes much more sense and alleviates many of my concerns.

It's also clearly a best practice, given that it's in StatefulSet and other native resource types.

I'm +1 to the approach around observedGeneration, now that I've got a better grasp of things. No concerns with lastTransitionTime either, so long as folks don't misunderstand it as being sufficient. I'll review the PR shortly and we can continue discussion there.

HoustonPutman added the enhancement New feature or request label Nov 28, 2023

RavinaChidambaram mentioned this issue Sep 11, 2024

Added ObservedGeneration to solrcloud status #722

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding ObservedGeneration in the CRD #650

Adding ObservedGeneration in the CRD #650

Shashankft9 commented Oct 30, 2023

gerlowskija commented Dec 9, 2024

Shashankft9 commented Dec 10, 2024

gerlowskija commented Dec 10, 2024

Shashankft9 commented Dec 10, 2024

gerlowskija commented Jan 6, 2025 •

edited

Loading

Adding ObservedGeneration in the CRD #650

Adding ObservedGeneration in the CRD #650

Comments

Shashankft9 commented Oct 30, 2023

gerlowskija commented Dec 9, 2024

Shashankft9 commented Dec 10, 2024

gerlowskija commented Dec 10, 2024

Shashankft9 commented Dec 10, 2024

gerlowskija commented Jan 6, 2025 • edited Loading

gerlowskija commented Jan 6, 2025 •

edited

Loading