-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CFP-36975: Configuration Profiles #67
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,121 @@ | ||
# Cilium Feature Proposal: Configuration Profiles | ||
|
||
## Author(s) | ||
|
||
Dorde Lapcevic ([email protected]) | ||
|
||
## Date | ||
|
||
2025-02-07 | ||
|
||
## Status | ||
|
||
Implementable | ||
|
||
## Abstract | ||
|
||
This proposal introduces the concept of "Configuration Profiles" to Cilium. Profiles are pre-defined, documented, and tested sets of Cilium configuration options designed to address specific operational needs and use cases. This aims to simplify deployment, improve predictability, and enhance the user experience by providing well-defined operational modes. | ||
|
||
## Motivation | ||
|
||
Currently, Cilium offers numerous feature flags that operators configure individually. This granular control, while powerful, presents several challenges: | ||
|
||
* **Combinatorial Complexity:** The interaction between numerous flags can lead to unexpected behavior and makes exhaustive testing difficult. Certain combinations may be subtly broken or have performance implications that are not immediately apparent. | ||
* **Operational Expertise Required:** Operators need a deep understanding of individual flags and their interplay to configure Cilium optimally for their environment. | ||
* **Lack of Clear Guidance:** New users may struggle to understand the best configuration for their needs, leading to suboptimal deployments. | ||
* **Implicit Knowledge:** Experienced operators often develop a set of preferred flags based on their specific environment (scale, performance requirements, network topology, etc.). This knowledge is often implicit and not easily shared. | ||
|
||
Configuration Profiles address these issues by: | ||
|
||
* **Simplifying Deployment:** Profiles provide a "one-click" (or "one-command") deployment option for common use cases. | ||
* **Improving Predictability:** Each profile will be thoroughly tested and documented, ensuring consistent behavior and performance. | ||
* **Providing Clear Guidance:** Documentation will clearly explain the purpose, benefits, limitations, and enabled/disabled features of each profile. | ||
* **Best Practices:** Profiles encapsulate expert knowledge and best practices for specific operational scenarios. | ||
* **Targeted Testing:** By testing at the profile level (a set of features) instead of only individual flags, we increase coverage and reduce the risk of unexpected interactions. | ||
|
||
## Goals | ||
|
||
The main goals of this proposal are: | ||
* Define a process for creating a new Configuration Profile | ||
* Establish a testing framework to ensure the stability and correctness of each profile. | ||
* Enable the community to propose and contribute new profiles. | ||
|
||
## Non-Goals | ||
|
||
* Completely eliminate individual feature flags. Advanced users will still be able to customize their deployments beyond the pre-defined profiles. Profiles are intended to be a starting point, not a restriction. | ||
* Create a profile for every possible combination of flags. The focus is on common, well-understood use cases. | ||
* Guarantee that profiles will cover 100% of user needs. It's recognized there will always be edge cases, but this helps a majority of users. | ||
|
||
## Example Profile | ||
|
||
* **High-Scale:** Optimized for large-scale clusters, with high number (thousands) of nodes and high pod churn rate (hundreds per second). Limited to a set of basic networking and K8s features: pod connectivity, K8s Service and basic observability. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should we name profile as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, based on recent discussion I agree that it would make more sense to use We should continue that discussion in the separate issue for introducing that profile, while in this CFP we define in detail what profiles are and how to create them. cilium/cilium#37510 |
||
* `--enable-policy=never` | ||
* `--enable-k8s-networkpolicy=false` | ||
* `--enable-cilium-network-policy=false` | ||
* `--enable-cilium-clusterwide-network-policy=false` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are we consider adding new field There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, we'd need to. To be discussed in cilium/cilium#37510 |
||
* `--identity-allocation-mode=crd` | ||
* `--disable-endpoint-crd=true` | ||
* `--enable-cilium-endpoint-slice=false` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. --enable-cilium-endpoint-slice=true? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We're disabling CEP CRD just above, so CES shouldn't work anyway. We want to disable it explicitly since we know there might be incompatibilities already, that if CES is enabled, but CEP is disabled, cilium-agent will not start. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. enabling ces is no-op in this profile since no cilium endpoints would be created. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that's true, but the CES controller will still run inside cilium-operator, and cilium-agent needs to be compatible with disable CEP + enable CES config (cilium/cilium#36726). |
||
|
||
Note: This is only an example profile. It's not ready for use. | ||
|
||
## Creating a Profile | ||
|
||
* Provide clear documentation for each profile, including: | ||
* Purpose and use case | ||
* Benefits and limitations (trade-offs) | ||
* Specific configuration options (flags) enabled/disabled | ||
* Installation instructions | ||
* Supported Cilium versions | ||
* Known issues | ||
* Implement profile selection via Helm charts. | ||
|
||
## Implementation | ||
|
||
The implementation will leverage Cilium's Helm chart capabilities: | ||
|
||
1. **Helm Chart Values:** Each profile will be defined as a set of Helm values overrides. These overrides will configure the necessary Cilium feature flags. For example, the `values.yaml` for the "High-Scale" profile might include: | ||
|
||
```yaml | ||
# profiles/high-scale.yaml | ||
disableEndpointCRD: false | ||
policyEnforcementMode: "never" | ||
# ... other relevant settings ... | ||
``` | ||
|
||
2. **Profile Selection:** Users will select a profile during installation by referencing the appropriate values file: | ||
|
||
```bash | ||
helm install cilium cilium/cilium --version 1.17.x \ | ||
--namespace kube-system \ | ||
--values profiles/high-scale.yaml | ||
# Or: | ||
# --set profile=high-scale (if we add a 'profile' value to the main chart) | ||
``` | ||
The main `values.yaml` might include a top-level `profile` key, allowing users to select a profile with `--set profile=high-scale`. This would then conditionally include the settings from the corresponding profile file. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd prefer the first option ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree. It makes more sense. I will keep the |
||
|
||
3. **Documentation:** Each profile will have a dedicated page in the Cilium documentation, detailing its characteristics, installation instructions, incompatible features and functionality, and relevant configuration options. This documentation will be part of the Cilium documentation repository. | ||
|
||
4. **Testing:** | ||
* **Unit Tests:** The Helm chart rendering will be unit-tested to ensure that the correct flags are set for each profile. | ||
* **Integration/E2E Tests:** A new test suite will be created to validate the functionality and stability of each profile. These tests will run against a Kubernetes cluster with Cilium installed using the profile's Helm values. A subset of existing tests will run together with new per-profile tests that cover key features and use cases relevant to the profile. | ||
* **Continuous Integration:** Profile tests will be integrated into the Cilium CI pipeline to ensure that changes to Cilium do not break existing profiles. | ||
Comment on lines
+101
to
+102
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we'll need to discuss these aspects a bit more. We've relied far too heavily on end-to-end testing in the past and I want to see us seriously investigate more narrow testing to validate the behavior - like the hive scriptest described here. The downsides of e2e are that it's expensive (in time, money, CPU, CO₂ etc), inherently tends towards multiple-owner tests which makes it hard to pin down who can help fix the failures, and the corresponding complexity tends to introduce a high rate of unreliability, which then exacerbates the first two properties. It's not obvious to me what kind of coverage specifically that motivate additional e2e testing, but we can review that on a case-by-case basis for the new profiles. On the other hand I also separately wonder whether these configuration profiles may provide an alternative to the generated "matrix" tests we have in some of the existing e2e workflows. If this is a path to reduce the burden of those existing testsuites then that would be interesting. As for continuous integration, if we decide to create new workflows to test a specific profile then I think it'll likely make sense to set those up on a cron job and assign an owner group for those workflows so that they are responsible for monitoring & triaging failures. I think that for more granular testing it may make sense to include those in pre-merge testing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This issue has some related discussion and background which might help inform where I'm coming from: cilium/cilium#37837 |
||
|
||
## Alternatives Considered | ||
|
||
* **Individual Flags:** As discussed in the "Motivation" section, this approach is complex and error-prone. | ||
* **Separate Helm Charts:** Creating separate Helm charts for each profile would lead to significant code duplication and maintenance overhead. | ||
|
||
## Open Issues | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should also consider adding guidelines on how users can upgrade from There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree. It should be discussed and covered in the issue for |
||
* Onboard initial profiles. This requires community input. | ||
* Determine the best way to handle profile updates and versioning. How do we ensure that users can safely upgrade Cilium while using a specific profile? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One idea would be to couple profile changes with Cilium releases (so there'd be a v1.19 of the profiles for Cilium v1.19 release). Then we could enable new optimizations in the high-scale profile starting from some Cilium version, with the caveat that we won't enable anything that could break on upgrade from the previous version. An example would be cilium-operator managed identities: we'd want that to eventually become part of high-scale profile, but might need to start with identityAllocationMode=both in version N, then identityAllocationMode=operator in version N+1. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That seems like a good solution, considering that configuration depends on the Cilium version. Thanks for the suggestion. Backward compatibility is important here, because we need to ensure that upgrades don't break any previous profile (or at least up to a reasonable number of versions). |
||
* Develop detailed test plans for each profile. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like that this proposal could streamline the process of testing new optimizations. Right now it's quite expensive to add new E2E tests for optimizations that are available but not enabled because we need to create/test a new GitHub workflow for each one. It would be nice to enable something in a profile and have the existing test suites cover that. |
||
* Consider how to allow users to extend profiles (create custom profiles based on existing ones) without modifying the core Helm charts. | ||
|
||
## Next Steps | ||
|
||
Proceed with implementing one of the initial candidate profiles. | ||
* Initial Documentation. | ||
* Create the necessary Helm chart values file. | ||
* Create a test suite. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defining what features works with this profile