Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set bulk flag to true on intended pools for performance gain #2980

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

malayparida2000
Copy link
Contributor

@malayparida2000 malayparida2000 commented Jan 28, 2025

The bulk flag makes the autoscaler start with the max number of PGs, and the autoscaler then decreases the pg count only if the PG usage starts to skew too much. This could improve performance for ODF users due to great amount of parallelism during read/write on large clusters.

We are going to set the bulk flag for now on the default cephBlockPool & replica-1 cephBlockpools, default cephFS metadata pool & data pools, default cephObjectStore metadata pool & data pool.

The flag is set to true only during new pool creation, as setting it on existing pools can cause data movement.

Ref-https://issues.redhat.com/browse/RHSTOR-6774

Copy link
Contributor

openshift-ci bot commented Jan 28, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: malayparida2000
Once this PR has been reviewed and has the lgtm label, please assign nb-ohad for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@@ -173,6 +173,7 @@ func (r *StorageClusterReconciler) newCephObjectStoreInstances(initData *ocsv1.S
EnableCrushUpdates: true,
FailureDomain: initData.Status.FailureDomain,
Replicated: generateCephReplicatedSpec(initData, "data"),
Parameters: map[string]string{"bulk": "true"},
},
MetadataPool: cephv1.PoolSpec{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BlaineEXE Should we be setting the bulk flag on the object and filesystem metadata pools, in addition to the data pools?

Copy link
Contributor

@BlaineEXE BlaineEXE Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anthony and Kyle seemed to be saying yes when we chatted. It sounds a bit counterintuitive to me as well. My best guess is that the performance gains from increased parallelism must end up being a bigger positive effect than the performance loss from having info split into more chunks.

It might be good if we could test the performance effects of bulk on the meta pools to see if there's an impact either way.

Copy link
Contributor

@mmgaggle mmgaggle Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having more PGs will help ensure more uniform data distribution and will mean there are targets to perform work (pgs). This is detailed here:

https://docs.ceph.com/en/pacific/rados/operations/placement-groups/#object-distribution-within-a-pool

As long as there are one or two orders of magnitude more Placement Groups than OSDs, the distribution should be even. For instance, 256 placement groups for 3 OSDs, 512 or 1024 placement groups for 10 OSDs etc.

The number of PGs for a pool does not impact the size of the objects, that is controlled through

RBD:

  • rbd_default_order
  • --object-size

RGW:

  • rgw_obj_stripe_size

CephFS (file layout):

  • object_size

One way to approach the problem is to think in terms of having a budget for PGs, this budget is guided by mon_target_pg_per_osd, and the default value is 100. Personally, I try to aim for 200-250 and might set this to either 200 or 225. Target ratios are used to proportion that budget, and with bulk the PGs will be created on pools according to those proportions. Using the budget analogy, this means you're putting all your capital to work, but it also means that you're now in a pareto efficient situation. If you create a new pool then the absolute number of PGs for pre-existing pools needs to be reduced.

If you know ahead of time what pools you're going to have in the cluster, it's reasonable to "pay in full" for your pools in advance so that you don't need to keep making payments over time (potentially disruptive to your applications). By doing this you can reap the performance advantages immediately. If you don't use bulk you can avoid the need to reduce PGs in existing pools, but you still need to increase PGs as the capacity used grows, which has a similar disruption potential. So in this decision framework, you can't avoid the disruption potential either way, so it's better to reap the benefits of a "pay in full" strategy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mmgaggle for such a detailed analysis.

@malayparida2000
Copy link
Contributor Author

/cc @travisn @BlaineEXE

@openshift-ci openshift-ci bot requested review from BlaineEXE and travisn January 29, 2025 12:36
Copy link
Contributor

@BlaineEXE BlaineEXE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there plans to allow this to be configured via any means? I trust Kyle and Anthony implicitly, but we still don't have a measure of how the bulk flag impacts ODF specifically. Further, we don't know whether ODF's common 3-OSD/4-OSD clusters will be impacted positively or negatively by this change. My vague understanding is that 3-/4-OSD clusters are much less common for RHCS than for ODF, and my intuition says that cluster size may play an important part in the performance impacts.

I would recommend having some mechanism for enabling/disabling this behavior. We don't have to document it for users, but I think it will still be useful to us internally. That configurability will make it easier for the performance team to do A:B testing to quantify the performance changes, and it will allow us an easy way to disable this change close to ODF release if we have to by changing the "business logic" instead of the code. It'll also give us the ability to disable/enable it for customer environments in cases where that is imporant to them.

Travis is also asking an important question here. His and my intuition both suggest that the RGW metadata pools might respond to the bulk flag differently than other pools. Thus, I would also suggest allowing the CephObjectStore metadata pools have bulk enabled/disabled independently from the rest of the cluster. This is essentially for the same reasons above.

Here's my guess about what we will find with bulk flag performance testing:

  • I think we will find that the bulk flag has no impact on 3-OSD clusters on the whole
  • I suspect there will be noticeable positive impact on clusters with 9-12 or more OSDs
  • There is a risk we find S3 metadata operation slowdown with 3-OSD clusters, possibly resulting in minor S3 I/O loss until the cluster has 6-9 OSDs, where we will break even and then see gradual improvements

@malayparida2000
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 30, 2025
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 16, 2025
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 27, 2025
@malayparida2000 malayparida2000 changed the title Use bulk flag for all odf data pools for performance gain Set bulk flag to true on intended pools for performance gain Feb 28, 2025
// Ensures the bulk flag set during new pool creation is not removed during updates.
preserveBulkFlagParameter(existingParameters, &poolSpec.Parameters)

// set the pg_num & pgp_num parameters only when bulk flag is not set or false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@travisn whats the behaviour of ceph when bulk is set to true as well as pg_num, pgp_num are set to 16? Do we have to selectively exclude setting these 2 parameters when bulk flag is true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some testing on a replica-1 cluster & found that when bulk flag is true the settings related to pg_num, pgp_num are ignored.

~ $  oc get cephblockpool ocs-storagecluster-cephblockpool-us-east-1a -o yaml | yq '.spec.parameters'
bulk: "true"
pg_num: "16"
pgp_num: "16"
sh-5.1$ ceph osd pool autoscale-status
POOL                                           SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  BULK   
.mgr                                         576.5k                3.0         6144G  0.0000                                  1.0       1              on         False  
ocs-storagecluster-cephblockpool             65219k                3.0         6144G  0.3333        0.4900           0.3333   1.0      32              on         True   
ocs-storagecluster-cephfilesystem-metadata   39776                 3.0         6144G  0.3333        0.4900           0.3333   4.0     128              on         True   
ocs-storagecluster-cephfilesystem-data0          0                 3.0         6144G  0.3333        0.4900           0.3333   1.0      32              on         True   
ocs-storagecluster-cephblockpool-us-east-1a     19                 1.0         2048G  0.0000                                  1.0     128              on         True   
ocs-storagecluster-cephblockpool-us-east-1b     19                 1.0         2048G  0.0000                                  1.0     128              on         True   
ocs-storagecluster-cephblockpool-us-east-1c     19                 1.0         2048G  0.0000                                  1.0     128              on         True   

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my preference would be to employ a strategy like the one outlined here that uses target_ratio on each pool instead of defining the number of PGs. I don't think bulk will do anything without either target_ratio or target-size, and if you supplied one of those with a static number of PGs the autoscaler would adjust the quantity.

#2980 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right @mmgaggle. This specifying the PG Num as 16 was only for specific kind of replica-1 ceph block pools where we had issues with auto scaling.
For all the data pools we create we specify the target size ratio as 0.49

@malayparida2000
Copy link
Contributor Author

malayparida2000 commented Feb 28, 2025

some basic testing on 4.19 fresh cluster, will do detailed testing after some reviews

Bulk flag presence on pools

~ $ oc get cephblockpool ocs-storagecluster-cephblockpool -o yaml | yq '.spec.parameters'

bulk: "true"

~ $ oc get cephfilesystem ocs-storagecluster-cephfilesystem -o yaml | yq '.spec.dataPools[].parameters'

bulk: "true"

~ $ oc get cephfilesystem ocs-storagecluster-cephfilesystem -o yaml | yq '.spec.metadataPool.parameters'

bulk: "true"

sh-5.1$ ceph osd pool autoscale-status
POOL                                          SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  BULK   
.mgr                                        576.5k                3.0         6144G  0.0000                                  1.0       1              on         False  
ocs-storagecluster-cephblockpool            66071k                3.0         6144G  0.3333        0.4900           0.3333   1.0      32              on         True   
ocs-storagecluster-cephfilesystem-metadata  25586                 3.0         6144G  0.3333        0.4900           0.3333   4.0     128              on         True   
ocs-storagecluster-cephfilesystem-data0         0                 3.0         6144G  0.3333        0.4900           0.3333   1.0      32              on         True   

@malayparida2000
Copy link
Contributor Author

malayparida2000 commented Feb 28, 2025

Wrote around 110GB(10* 100MB, 10* 1GB, 10* 10GB) of data each on rbd & cephFS

sh-5.1$ ceph osd pool autoscale-status
POOL                                          SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  BULK   
.mgr                                        576.5k                3.0         6144G  0.0000                                  1.0       1              on         False  
ocs-storagecluster-cephblockpool            108.5G                3.0         6144G  0.3333        0.4900           0.3333   1.0      32              on         True   
ocs-storagecluster-cephfilesystem-metadata  162.7M                3.0         6144G  0.3333        0.4900           0.3333   4.0     128              on         True   
ocs-storagecluster-cephfilesystem-data0     108.3G                3.0         6144G  0.3333        0.4900           0.3333   1.0      32              on         True 

@malayparida2000
Copy link
Contributor Author

Wrote around 110GB(10_100MB, 10_1GB, 10*10GB) of data each on rbd & cephFS

sh-5.1$ ceph osd pool autoscale-status
POOL                                          SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  BULK   
.mgr                                        576.5k                3.0         6144G  0.0000                                  1.0       1              on         False  
ocs-storagecluster-cephblockpool            108.5G                3.0         6144G  0.3333        0.4900           0.3333   1.0      32              on         True   
ocs-storagecluster-cephfilesystem-metadata  162.7M                3.0         6144G  0.3333        0.4900           0.3333   4.0     128              on         True   
ocs-storagecluster-cephfilesystem-data0     108.3G                3.0         6144G  0.3333        0.4900           0.3333   1.0      32              on         True 

@travisn Why do we have a BIAS of 4 for the cephFS metadata pool? As a result it ends up having 4X amount of PGs.

@travisn
Copy link
Contributor

travisn commented Feb 28, 2025

@travisn Why do we have a BIAS of 4 for the cephFS metadata pool? As a result it ends up having 4X amount of PGs.

This must be a default that is set for the cephfs metadata pool. @batrick Is this expected? I didn't realize either that the cephfs metadata pool should have so many PGs.

cephBlockPool.Spec.PoolSpec = storageCluster.Spec.ManagedResources.CephBlockPools.PoolSpec

// Set default values in the poolSpec as necessary
setDefaultDataPoolSpec(&cephBlockPool.Spec.PoolSpec, storageCluster)
cephBlockPool.Spec.PoolSpec.EnableRBDStats = true

// The bulk flag is set to true only during new pool creation, as setting it on existing pools can cause data movement.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, it would be helpful to add a note here that explains that this condition should only be possible when the CephBlockPool doesn't already exist, meaning this struct is initialized only in code.

Copy link
Contributor

@BlaineEXE BlaineEXE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only see a single unit test check for this, regarding CephObjectStores. IMO, it would be best to check that bulk is set on new objects and not set on existing objects for all resource kinds.

Comment on lines 49 to 60
expectedCos, err := reconciler.newCephObjectStoreInstances(cr, nil)
expectedCos[0].Spec.MetadataPool.Parameters = map[string]string{
"bulk": "true",
}
expectedCos[0].Spec.DataPool.Parameters = map[string]string{
"bulk": "true",
}

assert.NoError(t, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a test that checks that bulk isn't applied to already-existing stores? That is also important behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add soon

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added detailed unit tests

@BlaineEXE
Copy link
Contributor

Any response to this? #2980 (review)

@malayparida2000
Copy link
Contributor Author

malayparida2000 commented Mar 3, 2025

I would recommend having some mechanism for enabling/disabling this behavior.

Now all the metadata/data pool's poolSpec are configurable via the storageCluster CR so the bulk can be added/removed from the poolSpec parameters on demand by the customers. This can achive the A/B testing perf team might want to do.

Travis is also asking an important question here.

The RGW & CephFS metadata pools would have the bulk flag by default with this change.

The bulk flag makes the autoscaler start with the max number of PGs,
and the autoscaler then decreases the pg count only if the PG usage
starts to skew too much. This could improve performance for ODF users
due to great amount of parallelism during read/write on large clusters.

We are going to set the bulk flag for now on the default cephBlockPool &
replica-1 cephBlockpools, default cephFS metadata pool & data pools,
default cephObjectStore metadata pool & data pool.

The flag is set to true only during new pool creation, as setting it on
existing pools can cause data movement.

Signed-off-by: Malay Kumar Parida <[email protected]>
@malayparida2000
Copy link
Contributor Author

@travisn @BlaineEXE Can you please review the changes
Also With Kyle's suggestions here #2980 (comment)
Should we also look at the mon_target_pg_per_osd parameter.

@BlaineEXE
Copy link
Contributor

BlaineEXE commented Mar 10, 2025

@travisn @BlaineEXE Can you please review the changes Also With Kyle's suggestions here #2980 (comment) Should we also look at the mon_target_pg_per_osd parameter.

[reworded and edited for clarity]

The most definitive source for what recommendations should be is from Anthony's proposed updates here: https://github.com/ceph/ceph/pull/60492/files

I see updates to both mon_max_pg_per_osd (500) and mon_target_pg_per_osd (200) in that PR. I think it is safe to say that ODF should update the bootstrapped rook-config-override to use mon_target_pg_per_osd = 200 as long as we can do so only for new ODF clusters -- we don't want existing clusters to be effected.

Here (below) is where ODF currently sets mon_max_pg_per_osd. I think we can add mon_target_pg_per_osd here as well. This value may risk data movement in existing clusters if we are overwriting it in the rook-config-override configmap. Does ocs-operator only set that configmap once during first instantiation, or does it update it every ODF release?

Malay, I think your change to make it so pg_num/pgp_num are not set to 16 when we are using the bulk flag makes sense. I think setting them to 16 is what Kyle is referring to as "paying in advance", and I'm assuming we did this to prevent these from getting super low numbers with the non-bulk autoscale behavior. With bulk and a higher target pg/osd, I assume they won't be necessary.

@malayparida2000
Copy link
Contributor Author

Does ocs-operator only set that configmap once during first instantiation, or does it update it every ODF release?

Yes currently the behaviour is ocs operator reconciles the configmap & keeps it updated.

I think it is safe to say that ODF should update the bootstrapped rook-config-override to use mon_target_pg_per_osd = 200 as long as we can do so only for new ODF clusters -- we don't want existing clusters to be effected.

Right we are already setting the mon_max_pg_per_osd to 600, so for that we don't need to do anything more. I will work on a separate PR to set mon_target_pg_per_osd = 200.

your change to make it so pg_num/pgp_num are not set to 16 when we are using the bulk flag makes sense

That change is actually only for replica-1 ceph block pools, as earlier these pools had issues with auto scaling so we used to set the PG Nums to 16. Now with bulk flag they won't be necessary anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants