Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CFP-36975: Configuration Profiles #67

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dlapcevic
Copy link
Contributor

Proposal based on Cilium feature enablement: personas and profiles for Cilium users/operators (cilium/cilium#36975)

* `--enable-cilium-clusterwide-network-policy=false`
* `--identity-allocation-mode=crd`
* `--disable-endpoint-crd=true`
* `--enable-cilium-endpoint-slice=false`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--enable-cilium-endpoint-slice=true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're disabling CEP CRD just above, so CES shouldn't work anyway. We want to disable it explicitly since we know there might be incompatibilities already, that if CES is enabled, but CEP is disabled, cilium-agent will not start.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enabling ces is no-op in this profile since no cilium endpoints would be created.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's true, but the CES controller will still run inside cilium-operator, and cilium-agent needs to be compatible with disable CEP + enable CES config (cilium/cilium#36726).

# Or:
# --set profile=high-scale (if we add a 'profile' value to the main chart)
```
The main `values.yaml` might include a top-level `profile` key, allowing users to select a profile with `--set profile=high-scale`. This would then conditionally include the settings from the corresponding profile file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer the first option (--values profile/high-scale.yaml) since it makes it easier for people the option to override specific settings if they know what they're doing (just add more overrides after using --values or --set)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It makes more sense. I will keep the --set. profile= in the design as an alternative for now, until we get broader feedback.

## Open Issues

* Onboard initial profiles. This requires community input.
* Determine the best way to handle profile updates and versioning. How do we ensure that users can safely upgrade Cilium while using a specific profile?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One idea would be to couple profile changes with Cilium releases (so there'd be a v1.19 of the profiles for Cilium v1.19 release). Then we could enable new optimizations in the high-scale profile starting from some Cilium version, with the caveat that we won't enable anything that could break on upgrade from the previous version.

An example would be cilium-operator managed identities: we'd want that to eventually become part of high-scale profile, but might need to start with identityAllocationMode=both in version N, then identityAllocationMode=operator in version N+1.

Copy link
Contributor Author

@dlapcevic dlapcevic Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a good solution, considering that configuration depends on the Cilium version. Thanks for the suggestion.

Backward compatibility is important here, because we need to ensure that upgrades don't break any previous profile (or at least up to a reasonable number of versions).
This means that users can apply the new Profile config themselves after they upgrade. After that, we can think if automating this is required.


* Onboard initial profiles. This requires community input.
* Determine the best way to handle profile updates and versioning. How do we ensure that users can safely upgrade Cilium while using a specific profile?
* Develop detailed test plans for each profile.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that this proposal could streamline the process of testing new optimizations. Right now it's quite expensive to add new E2E tests for optimizations that are available but not enabled because we need to create/test a new GitHub workflow for each one. It would be nice to enable something in a profile and have the existing test suites cover that.


## Example Profile

* **High-Scale:** Optimized for large-scale clusters, with high number (thousands) of nodes and high pod churn rate (hundreds per second). Limited to a set of basic networking and K8s features: pod connectivity, K8s Service and basic observability.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we name profile as basic-networking instead of high-scale?

Copy link
Contributor Author

@dlapcevic dlapcevic Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, based on recent discussion I agree that it would make more sense to use basic-networking. I'd still want to get more input from the contributors and stakeholders of this effort before making the change. We can agree on the name before applying it in the CFP.

We should continue that discussion in the separate issue for introducing that profile, while in this CFP we define in detail what profiles are and how to create them. cilium/cilium#37510

* `--enable-cilium-clusterwide-network-policy=false`
* `--identity-allocation-mode=crd`
* `--disable-endpoint-crd=true`
* `--enable-cilium-endpoint-slice=false`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enabling ces is no-op in this profile since no cilium endpoints would be created.

* **Separate Helm Charts:** Creating separate Helm charts for each profile would lead to significant code duplication and maintenance overhead.

## Open Issues

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should also consider adding guidelines on how users can upgrade from basic-networking profile to network policy enabled or other profiles without much disruption?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It should be discussed and covered in the issue for basic-networking / high-scale profile - cilium/cilium#37510

The main goals of this proposal are:
* Define a process for creating a new Configuration Profile
* Establish a testing framework to ensure the stability and correctness of each profile.
* Enable the community to propose and contribute new profiles.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defining what features works with this profile

* `--enable-policy=never`
* `--enable-k8s-networkpolicy=false`
* `--enable-cilium-network-policy=false`
* `--enable-cilium-clusterwide-network-policy=false`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we consider adding new field disable-remote-pod-watcher to disable pod watcher in cilium?

Copy link
Contributor Author

@dlapcevic dlapcevic Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we'd need to. To be discussed in cilium/cilium#37510

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the proposal makes sense. One main comment around the testing below, which is mainly to state a desire for more "layered" testing so we have unit tests, component tests, integration tests and end-to-end, and these increase in scope in that order. The more we invest particularly in the middle categories of testing, the more payoff I think we'll get in the reliability of the implementation and the reliability of the e2e testsuites we run on top.

Comment on lines +101 to +102
* **Integration/E2E Tests:** A new test suite will be created to validate the functionality and stability of each profile. These tests will run against a Kubernetes cluster with Cilium installed using the profile's Helm values. A subset of existing tests will run together with new per-profile tests that cover key features and use cases relevant to the profile.
* **Continuous Integration:** Profile tests will be integrated into the Cilium CI pipeline to ensure that changes to Cilium do not break existing profiles.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll need to discuss these aspects a bit more. We've relied far too heavily on end-to-end testing in the past and I want to see us seriously investigate more narrow testing to validate the behavior - like the hive scriptest described here. The downsides of e2e are that it's expensive (in time, money, CPU, CO₂ etc), inherently tends towards multiple-owner tests which makes it hard to pin down who can help fix the failures, and the corresponding complexity tends to introduce a high rate of unreliability, which then exacerbates the first two properties.

It's not obvious to me what kind of coverage specifically that motivate additional e2e testing, but we can review that on a case-by-case basis for the new profiles.

On the other hand I also separately wonder whether these configuration profiles may provide an alternative to the generated "matrix" tests we have in some of the existing e2e workflows. If this is a path to reduce the burden of those existing testsuites then that would be interesting.

As for continuous integration, if we decide to create new workflows to test a specific profile then I think it'll likely make sense to set those up on a cron job and assign an owner group for those workflows so that they are responsible for monitoring & triaging failures. I think that for more granular testing it may make sense to include those in pre-merge testing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue has some related discussion and background which might help inform where I'm coming from: cilium/cilium#37837

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants