CFP-36975: Configuration Profiles #67

dlapcevic · 2025-02-07T16:41:32Z

Proposal based on Cilium feature enablement: personas and profiles for Cilium users/operators (cilium/cilium#36975)

wedaly · 2025-02-10T20:51:16Z

cilium/CFP-36975-configuration-profiles.md

+    *   `--enable-cilium-clusterwide-network-policy=false`
+    *   `--identity-allocation-mode=crd`
+    *   `--disable-endpoint-crd=true`
+    *   `--enable-cilium-endpoint-slice=false`


--enable-cilium-endpoint-slice=true?

We're disabling CEP CRD just above, so CES shouldn't work anyway. We want to disable it explicitly since we know there might be incompatibilities already, that if CES is enabled, but CEP is disabled, cilium-agent will not start.

enabling ces is no-op in this profile since no cilium endpoints would be created.

Yes, that's true, but the CES controller will still run inside cilium-operator, and cilium-agent needs to be compatible with disable CEP + enable CES config (cilium/cilium#36726).

wedaly · 2025-02-10T21:01:08Z

cilium/CFP-36975-configuration-profiles.md

+      # Or:
+      # --set profile=high-scale  (if we add a 'profile' value to the main chart)
+    ```
+    The main `values.yaml` might include a top-level `profile` key, allowing users to select a profile with `--set profile=high-scale`.  This would then conditionally include the settings from the corresponding profile file.


I'd prefer the first option (--values profile/high-scale.yaml) since it makes it easier for people the option to override specific settings if they know what they're doing (just add more overrides after using --values or --set)

I agree. It makes more sense. I will keep the --set. profile= in the design as an alternative for now, until we get broader feedback.

cilium/CFP-36975-configuration-profiles.md

wedaly · 2025-02-10T21:02:48Z

cilium/CFP-36975-configuration-profiles.md

+## Open Issues
+
+*   Onboard initial profiles. This requires community input.
+*   Determine the best way to handle profile updates and versioning.  How do we ensure that users can safely upgrade Cilium while using a specific profile?


One idea would be to couple profile changes with Cilium releases (so there'd be a v1.19 of the profiles for Cilium v1.19 release). Then we could enable new optimizations in the high-scale profile starting from some Cilium version, with the caveat that we won't enable anything that could break on upgrade from the previous version.

An example would be cilium-operator managed identities: we'd want that to eventually become part of high-scale profile, but might need to start with identityAllocationMode=both in version N, then identityAllocationMode=operator in version N+1.

That seems like a good solution, considering that configuration depends on the Cilium version. Thanks for the suggestion.

Backward compatibility is important here, because we need to ensure that upgrades don't break any previous profile (or at least up to a reasonable number of versions).
This means that users can apply the new Profile config themselves after they upgrade. After that, we can think if automating this is required.

wedaly · 2025-02-10T21:04:18Z

cilium/CFP-36975-configuration-profiles.md

+
+*   Onboard initial profiles. This requires community input.
+*   Determine the best way to handle profile updates and versioning.  How do we ensure that users can safely upgrade Cilium while using a specific profile?
+*   Develop detailed test plans for each profile.


I like that this proposal could streamline the process of testing new optimizations. Right now it's quite expensive to add new E2E tests for optimizations that are available but not enabled because we need to create/test a new GitHub workflow for each one. It would be nice to enable something in a profile and have the existing test suites cover that.

tamilmani1989 · 2025-02-14T03:45:23Z

cilium/CFP-36975-configuration-profiles.md

+
+## Example Profile
+
+*   **High-Scale:**  Optimized for large-scale clusters, with high number (thousands) of nodes and high pod churn rate (hundreds per second). Limited to a set of basic networking and K8s features: pod connectivity, K8s Service and basic observability.


should we name profile as basic-networking instead of high-scale?

Yes, based on recent discussion I agree that it would make more sense to use basic-networking. I'd still want to get more input from the contributors and stakeholders of this effort before making the change. We can agree on the name before applying it in the CFP.

We should continue that discussion in the separate issue for introducing that profile, while in this CFP we define in detail what profiles are and how to create them. cilium/cilium#37510

tamilmani1989 · 2025-02-18T17:45:17Z

cilium/CFP-36975-configuration-profiles.md

+    *   `--enable-cilium-clusterwide-network-policy=false`
+    *   `--identity-allocation-mode=crd`
+    *   `--disable-endpoint-crd=true`
+    *   `--enable-cilium-endpoint-slice=false`


enabling ces is no-op in this profile since no cilium endpoints would be created.

tamilmani1989 · 2025-02-18T17:48:30Z

cilium/CFP-36975-configuration-profiles.md

+*   **Separate Helm Charts:**  Creating separate Helm charts for each profile would lead to significant code duplication and maintenance overhead.
+
+## Open Issues
+


should also consider adding guidelines on how users can upgrade from basic-networking profile to network policy enabled or other profiles without much disruption?

I agree. It should be discussed and covered in the issue for basic-networking / high-scale profile - cilium/cilium#37510

tamilmani1989 · 2025-02-18T17:49:58Z

cilium/CFP-36975-configuration-profiles.md

+The main goals of this proposal are:
+*   Define a process for creating a new Configuration Profile
+*   Establish a testing framework to ensure the stability and correctness of each profile.
+*   Enable the community to propose and contribute new profiles.


defining what features works with this profile

tamilmani1989 · 2025-02-18T17:51:29Z

cilium/CFP-36975-configuration-profiles.md

+    *   `--enable-policy=never`
+    *   `--enable-k8s-networkpolicy=false`
+    *   `--enable-cilium-network-policy=false`
+    *   `--enable-cilium-clusterwide-network-policy=false`


are we consider adding new field disable-remote-pod-watcher to disable pod watcher in cilium?

Yes, we'd need to. To be discussed in cilium/cilium#37510

joestringer

I think the proposal makes sense. One main comment around the testing below, which is mainly to state a desire for more "layered" testing so we have unit tests, component tests, integration tests and end-to-end, and these increase in scope in that order. The more we invest particularly in the middle categories of testing, the more payoff I think we'll get in the reliability of the implementation and the reliability of the e2e testsuites we run on top.

joestringer · 2025-03-11T22:24:07Z

cilium/CFP-36975-configuration-profiles.md

+    *   **Integration/E2E Tests:**  A new test suite will be created to validate the functionality and stability of each profile. These tests will run against a Kubernetes cluster with Cilium installed using the profile's Helm values. A subset of existing tests will run together with new per-profile tests that cover key features and use cases relevant to the profile.
+    *   **Continuous Integration:** Profile tests will be integrated into the Cilium CI pipeline to ensure that changes to Cilium do not break existing profiles.


I think we'll need to discuss these aspects a bit more. We've relied far too heavily on end-to-end testing in the past and I want to see us seriously investigate more narrow testing to validate the behavior - like the hive scriptest described here. The downsides of e2e are that it's expensive (in time, money, CPU, CO₂ etc), inherently tends towards multiple-owner tests which makes it hard to pin down who can help fix the failures, and the corresponding complexity tends to introduce a high rate of unreliability, which then exacerbates the first two properties.

It's not obvious to me what kind of coverage specifically that motivate additional e2e testing, but we can review that on a case-by-case basis for the new profiles.

On the other hand I also separately wonder whether these configuration profiles may provide an alternative to the generated "matrix" tests we have in some of the existing e2e workflows. If this is a path to reduce the burden of those existing testsuites then that would be interesting.

As for continuous integration, if we decide to create new workflows to test a specific profile then I think it'll likely make sense to set those up on a cron job and assign an owner group for those workflows so that they are responsible for monitoring & triaging failures. I think that for more granular testing it may make sense to include those in pre-merge testing.

This issue has some related discussion and background which might help inform where I'm coming from: cilium/cilium#37837

dlapcevic added the sig/scalability label Feb 7, 2025

dlapcevic mentioned this pull request Feb 9, 2025

Introduce High-Scale Configuration Profile (WIP) cilium/cilium#37510

Open

wedaly reviewed Feb 10, 2025

View reviewed changes

dlapcevic requested review from joestringer, julianwiedmann, bowei, borkmann, tamilmani1989 and hemanthmalla February 12, 2025 12:21

CFP-36975: Configuration Profiles

b040f82

dlapcevic force-pushed the cilium-profiles branch from 0405186 to b040f82 Compare February 12, 2025 12:29

tamilmani1989 reviewed Feb 18, 2025

View reviewed changes

joestringer approved these changes Mar 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CFP-36975: Configuration Profiles #67

CFP-36975: Configuration Profiles #67

dlapcevic commented Feb 7, 2025

wedaly Feb 10, 2025

dlapcevic Feb 12, 2025

tamilmani1989 Feb 18, 2025

dlapcevic Feb 27, 2025

wedaly Feb 10, 2025

dlapcevic Feb 12, 2025

wedaly Feb 10, 2025

dlapcevic Feb 14, 2025 •

edited

Loading

wedaly Feb 10, 2025

tamilmani1989 Feb 14, 2025

dlapcevic Feb 27, 2025 •

edited

Loading

tamilmani1989 Feb 18, 2025

tamilmani1989 Feb 18, 2025

dlapcevic Feb 27, 2025

tamilmani1989 Feb 18, 2025

tamilmani1989 Feb 18, 2025

dlapcevic Feb 27, 2025 •

edited

Loading

joestringer left a comment

joestringer Mar 11, 2025

joestringer Mar 11, 2025


		## Example Profile

		* High-Scale: Optimized for large-scale clusters, with high number (thousands) of nodes and high pod churn rate (hundreds per second). Limited to a set of basic networking and K8s features: pod connectivity, K8s Service and basic observability.

		* Separate Helm Charts: Creating separate Helm charts for each profile would lead to significant code duplication and maintenance overhead.

		## Open Issues

		* Integration/E2E Tests: A new test suite will be created to validate the functionality and stability of each profile. These tests will run against a Kubernetes cluster with Cilium installed using the profile's Helm values. A subset of existing tests will run together with new per-profile tests that cover key features and use cases relevant to the profile.
		* Continuous Integration: Profile tests will be integrated into the Cilium CI pipeline to ensure that changes to Cilium do not break existing profiles.

CFP-36975: Configuration Profiles #67

Are you sure you want to change the base?

CFP-36975: Configuration Profiles #67

Conversation

dlapcevic commented Feb 7, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dlapcevic Feb 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dlapcevic Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dlapcevic Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

joestringer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dlapcevic Feb 14, 2025 •

edited

Loading

dlapcevic Feb 27, 2025 •

edited

Loading

dlapcevic Feb 27, 2025 •

edited

Loading