-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CFP-36975: Configuration Profiles #67
base: main
Are you sure you want to change the base?
Conversation
* `--enable-cilium-clusterwide-network-policy=false` | ||
* `--identity-allocation-mode=crd` | ||
* `--disable-endpoint-crd=true` | ||
* `--enable-cilium-endpoint-slice=false` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--enable-cilium-endpoint-slice=true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're disabling CEP CRD just above, so CES shouldn't work anyway. We want to disable it explicitly since we know there might be incompatibilities already, that if CES is enabled, but CEP is disabled, cilium-agent will not start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enabling ces is no-op in this profile since no cilium endpoints would be created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's true, but the CES controller will still run inside cilium-operator, and cilium-agent needs to be compatible with disable CEP + enable CES config (cilium/cilium#36726).
# Or: | ||
# --set profile=high-scale (if we add a 'profile' value to the main chart) | ||
``` | ||
The main `values.yaml` might include a top-level `profile` key, allowing users to select a profile with `--set profile=high-scale`. This would then conditionally include the settings from the corresponding profile file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer the first option (--values profile/high-scale.yaml
) since it makes it easier for people the option to override specific settings if they know what they're doing (just add more overrides after using --values
or --set
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. It makes more sense. I will keep the --set. profile=
in the design as an alternative for now, until we get broader feedback.
## Open Issues | ||
|
||
* Onboard initial profiles. This requires community input. | ||
* Determine the best way to handle profile updates and versioning. How do we ensure that users can safely upgrade Cilium while using a specific profile? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One idea would be to couple profile changes with Cilium releases (so there'd be a v1.19 of the profiles for Cilium v1.19 release). Then we could enable new optimizations in the high-scale profile starting from some Cilium version, with the caveat that we won't enable anything that could break on upgrade from the previous version.
An example would be cilium-operator managed identities: we'd want that to eventually become part of high-scale profile, but might need to start with identityAllocationMode=both in version N, then identityAllocationMode=operator in version N+1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like a good solution, considering that configuration depends on the Cilium version. Thanks for the suggestion.
Backward compatibility is important here, because we need to ensure that upgrades don't break any previous profile (or at least up to a reasonable number of versions).
This means that users can apply the new Profile config themselves after they upgrade. After that, we can think if automating this is required.
|
||
* Onboard initial profiles. This requires community input. | ||
* Determine the best way to handle profile updates and versioning. How do we ensure that users can safely upgrade Cilium while using a specific profile? | ||
* Develop detailed test plans for each profile. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that this proposal could streamline the process of testing new optimizations. Right now it's quite expensive to add new E2E tests for optimizations that are available but not enabled because we need to create/test a new GitHub workflow for each one. It would be nice to enable something in a profile and have the existing test suites cover that.
0405186
to
b040f82
Compare
|
||
## Example Profile | ||
|
||
* **High-Scale:** Optimized for large-scale clusters, with high number (thousands) of nodes and high pod churn rate (hundreds per second). Limited to a set of basic networking and K8s features: pod connectivity, K8s Service and basic observability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we name profile as basic-networking
instead of high-scale
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, based on recent discussion I agree that it would make more sense to use basic-networking
. I'd still want to get more input from the contributors and stakeholders of this effort before making the change. We can agree on the name before applying it in the CFP.
We should continue that discussion in the separate issue for introducing that profile, while in this CFP we define in detail what profiles are and how to create them. cilium/cilium#37510
* `--enable-cilium-clusterwide-network-policy=false` | ||
* `--identity-allocation-mode=crd` | ||
* `--disable-endpoint-crd=true` | ||
* `--enable-cilium-endpoint-slice=false` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enabling ces is no-op in this profile since no cilium endpoints would be created.
* **Separate Helm Charts:** Creating separate Helm charts for each profile would lead to significant code duplication and maintenance overhead. | ||
|
||
## Open Issues | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should also consider adding guidelines on how users can upgrade from basic-networking
profile to network policy enabled or other profiles without much disruption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. It should be discussed and covered in the issue for basic-networking
/ high-scale
profile - cilium/cilium#37510
The main goals of this proposal are: | ||
* Define a process for creating a new Configuration Profile | ||
* Establish a testing framework to ensure the stability and correctness of each profile. | ||
* Enable the community to propose and contribute new profiles. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defining what features works with this profile
* `--enable-policy=never` | ||
* `--enable-k8s-networkpolicy=false` | ||
* `--enable-cilium-network-policy=false` | ||
* `--enable-cilium-clusterwide-network-policy=false` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we consider adding new field disable-remote-pod-watcher
to disable pod watcher in cilium?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we'd need to. To be discussed in cilium/cilium#37510
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the proposal makes sense. One main comment around the testing below, which is mainly to state a desire for more "layered" testing so we have unit tests, component tests, integration tests and end-to-end, and these increase in scope in that order. The more we invest particularly in the middle categories of testing, the more payoff I think we'll get in the reliability of the implementation and the reliability of the e2e testsuites we run on top.
* **Integration/E2E Tests:** A new test suite will be created to validate the functionality and stability of each profile. These tests will run against a Kubernetes cluster with Cilium installed using the profile's Helm values. A subset of existing tests will run together with new per-profile tests that cover key features and use cases relevant to the profile. | ||
* **Continuous Integration:** Profile tests will be integrated into the Cilium CI pipeline to ensure that changes to Cilium do not break existing profiles. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll need to discuss these aspects a bit more. We've relied far too heavily on end-to-end testing in the past and I want to see us seriously investigate more narrow testing to validate the behavior - like the hive scriptest described here. The downsides of e2e are that it's expensive (in time, money, CPU, CO₂ etc), inherently tends towards multiple-owner tests which makes it hard to pin down who can help fix the failures, and the corresponding complexity tends to introduce a high rate of unreliability, which then exacerbates the first two properties.
It's not obvious to me what kind of coverage specifically that motivate additional e2e testing, but we can review that on a case-by-case basis for the new profiles.
On the other hand I also separately wonder whether these configuration profiles may provide an alternative to the generated "matrix" tests we have in some of the existing e2e workflows. If this is a path to reduce the burden of those existing testsuites then that would be interesting.
As for continuous integration, if we decide to create new workflows to test a specific profile then I think it'll likely make sense to set those up on a cron job and assign an owner group for those workflows so that they are responsible for monitoring & triaging failures. I think that for more granular testing it may make sense to include those in pre-merge testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue has some related discussion and background which might help inform where I'm coming from: cilium/cilium#37837
Proposal based on Cilium feature enablement: personas and profiles for Cilium users/operators (cilium/cilium#36975)