Set bulk flag to true on intended pools for performance gain #2980

malayparida2000 · 2025-01-28T17:41:55Z

The bulk flag makes the autoscaler start with the max number of PGs, and the autoscaler then decreases the pg count only if the PG usage starts to skew too much. This could improve performance for ODF users due to great amount of parallelism during read/write on large clusters.

We are going to set the bulk flag for now on the default cephBlockPool & replica-1 cephBlockpools, default cephFS metadata pool & data pools, default cephObjectStore metadata pool & data pool.

The flag is set to true only during new pool creation, as setting it on existing pools can cause data movement.

Ref-https://issues.redhat.com/browse/RHSTOR-6774

openshift-ci · 2025-01-28T17:42:04Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: malayparida2000
Once this PR has been reviewed and has the lgtm label, please assign nb-ohad for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

travisn · 2025-01-28T17:52:33Z

controllers/storagecluster/cephobjectstores.go

@@ -173,6 +173,7 @@ func (r *StorageClusterReconciler) newCephObjectStoreInstances(initData *ocsv1.S
 					EnableCrushUpdates: true,
 					FailureDomain:      initData.Status.FailureDomain,
 					Replicated:         generateCephReplicatedSpec(initData, "data"),
+					Parameters:         map[string]string{"bulk": "true"},
 				},
 				MetadataPool: cephv1.PoolSpec{


@BlaineEXE Should we be setting the bulk flag on the object and filesystem metadata pools, in addition to the data pools?

Anthony and Kyle seemed to be saying yes when we chatted. It sounds a bit counterintuitive to me as well. My best guess is that the performance gains from increased parallelism must end up being a bigger positive effect than the performance loss from having info split into more chunks.

It might be good if we could test the performance effects of bulk on the meta pools to see if there's an impact either way.

Having more PGs will help ensure more uniform data distribution and will mean there are targets to perform work (pgs). This is detailed here:

https://docs.ceph.com/en/pacific/rados/operations/placement-groups/#object-distribution-within-a-pool

As long as there are one or two orders of magnitude more Placement Groups than OSDs, the distribution should be even. For instance, 256 placement groups for 3 OSDs, 512 or 1024 placement groups for 10 OSDs etc.

The number of PGs for a pool does not impact the size of the objects, that is controlled through

RBD:

rbd_default_order

--object-size

RGW:

rgw_obj_stripe_size

CephFS (file layout):

object_size

One way to approach the problem is to think in terms of having a budget for PGs, this budget is guided by mon_target_pg_per_osd, and the default value is 100. Personally, I try to aim for 200-250 and might set this to either 200 or 225. Target ratios are used to proportion that budget, and with bulk the PGs will be created on pools according to those proportions. Using the budget analogy, this means you're putting all your capital to work, but it also means that you're now in a pareto efficient situation. If you create a new pool then the absolute number of PGs for pre-existing pools needs to be reduced.

If you know ahead of time what pools you're going to have in the cluster, it's reasonable to "pay in full" for your pools in advance so that you don't need to keep making payments over time (potentially disruptive to your applications). By doing this you can reap the performance advantages immediately. If you don't use bulk you can avoid the need to reduce PGs in existing pools, but you still need to increase PGs as the capacity used grows, which has a similar disruption potential. So in this decision framework, you can't avoid the disruption potential either way, so it's better to reap the benefits of a "pay in full" strategy.

Thanks @mmgaggle for such a detailed analysis.

malayparida2000 · 2025-01-29T12:36:55Z

/cc @travisn @BlaineEXE

BlaineEXE

Are there plans to allow this to be configured via any means? I trust Kyle and Anthony implicitly, but we still don't have a measure of how the bulk flag impacts ODF specifically. Further, we don't know whether ODF's common 3-OSD/4-OSD clusters will be impacted positively or negatively by this change. My vague understanding is that 3-/4-OSD clusters are much less common for RHCS than for ODF, and my intuition says that cluster size may play an important part in the performance impacts.

I would recommend having some mechanism for enabling/disabling this behavior. We don't have to document it for users, but I think it will still be useful to us internally. That configurability will make it easier for the performance team to do A:B testing to quantify the performance changes, and it will allow us an easy way to disable this change close to ODF release if we have to by changing the "business logic" instead of the code. It'll also give us the ability to disable/enable it for customer environments in cases where that is imporant to them.

Travis is also asking an important question here. His and my intuition both suggest that the RGW metadata pools might respond to the bulk flag differently than other pools. Thus, I would also suggest allowing the CephObjectStore metadata pools have bulk enabled/disabled independently from the rest of the cluster. This is essentially for the same reasons above.

Here's my guess about what we will find with bulk flag performance testing:

I think we will find that the bulk flag has no impact on 3-OSD clusters on the whole
I suspect there will be noticeable positive impact on clusters with 9-12 or more OSDs
There is a risk we find S3 metadata operation slowdown with 3-OSD clusters, possibly resulting in minor S3 I/O loss until the cluster has 6-9 OSDs, where we will break even and then see gradual improvements

malayparida2000 · 2025-01-30T09:31:46Z

/hold

malayparida2000 · 2025-02-28T09:18:59Z

controllers/storagecluster/cephblockpools.go

+			// Ensures the bulk flag set during new pool creation is not removed during updates.
+			preserveBulkFlagParameter(existingParameters, &poolSpec.Parameters)
+
+			// set the pg_num & pgp_num parameters only when bulk flag is not set or false


@travisn whats the behaviour of ceph when bulk is set to true as well as pg_num, pgp_num are set to 16? Do we have to selectively exclude setting these 2 parameters when bulk flag is true?

Did some testing on a replica-1 cluster & found that when bulk flag is true the settings related to pg_num, pgp_num are ignored.

~ $ oc get cephblockpool ocs-storagecluster-cephblockpool-us-east-1a -o yaml | yq '.spec.parameters' bulk: "true" pg_num: "16" pgp_num: "16"

sh-5.1$ ceph osd pool autoscale-status POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK .mgr 576.5k 3.0 6144G 0.0000 1.0 1 on False ocs-storagecluster-cephblockpool 65219k 3.0 6144G 0.3333 0.4900 0.3333 1.0 32 on True ocs-storagecluster-cephfilesystem-metadata 39776 3.0 6144G 0.3333 0.4900 0.3333 4.0 128 on True ocs-storagecluster-cephfilesystem-data0 0 3.0 6144G 0.3333 0.4900 0.3333 1.0 32 on True ocs-storagecluster-cephblockpool-us-east-1a 19 1.0 2048G 0.0000 1.0 128 on True ocs-storagecluster-cephblockpool-us-east-1b 19 1.0 2048G 0.0000 1.0 128 on True ocs-storagecluster-cephblockpool-us-east-1c 19 1.0 2048G 0.0000 1.0 128 on True

I think my preference would be to employ a strategy like the one outlined here that uses target_ratio on each pool instead of defining the number of PGs. I don't think bulk will do anything without either target_ratio or target-size, and if you supplied one of those with a static number of PGs the autoscaler would adjust the quantity.

#2980 (comment)

Right @mmgaggle. This specifying the PG Num as 16 was only for specific kind of replica-1 ceph block pools where we had issues with auto scaling.
For all the data pools we create we specify the target size ratio as 0.49

malayparida2000 · 2025-02-28T14:33:01Z

some basic testing on 4.19 fresh cluster, will do detailed testing after some reviews

Bulk flag presence on pools

~ $ oc get cephblockpool ocs-storagecluster-cephblockpool -o yaml | yq '.spec.parameters'

bulk: "true"

~ $ oc get cephfilesystem ocs-storagecluster-cephfilesystem -o yaml | yq '.spec.dataPools[].parameters'

bulk: "true"

~ $ oc get cephfilesystem ocs-storagecluster-cephfilesystem -o yaml | yq '.spec.metadataPool.parameters'

bulk: "true"

sh-5.1$ ceph osd pool autoscale-status
POOL                                          SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  BULK   
.mgr                                        576.5k                3.0         6144G  0.0000                                  1.0       1              on         False  
ocs-storagecluster-cephblockpool            66071k                3.0         6144G  0.3333        0.4900           0.3333   1.0      32              on         True   
ocs-storagecluster-cephfilesystem-metadata  25586                 3.0         6144G  0.3333        0.4900           0.3333   4.0     128              on         True   
ocs-storagecluster-cephfilesystem-data0         0                 3.0         6144G  0.3333        0.4900           0.3333   1.0      32              on         True

malayparida2000 · 2025-02-28T14:35:35Z

Wrote around 110GB(10* 100MB, 10* 1GB, 10* 10GB) of data each on rbd & cephFS

sh-5.1$ ceph osd pool autoscale-status
POOL                                          SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  BULK   
.mgr                                        576.5k                3.0         6144G  0.0000                                  1.0       1              on         False  
ocs-storagecluster-cephblockpool            108.5G                3.0         6144G  0.3333        0.4900           0.3333   1.0      32              on         True   
ocs-storagecluster-cephfilesystem-metadata  162.7M                3.0         6144G  0.3333        0.4900           0.3333   4.0     128              on         True   
ocs-storagecluster-cephfilesystem-data0     108.3G                3.0         6144G  0.3333        0.4900           0.3333   1.0      32              on         True

malayparida2000 · 2025-02-28T14:36:35Z

Wrote around 110GB(10_100MB, 10_1GB, 10*10GB) of data each on rbd & cephFS

sh-5.1$ ceph osd pool autoscale-status
POOL                                          SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  BULK   
.mgr                                        576.5k                3.0         6144G  0.0000                                  1.0       1              on         False  
ocs-storagecluster-cephblockpool            108.5G                3.0         6144G  0.3333        0.4900           0.3333   1.0      32              on         True   
ocs-storagecluster-cephfilesystem-metadata  162.7M                3.0         6144G  0.3333        0.4900           0.3333   4.0     128              on         True   
ocs-storagecluster-cephfilesystem-data0     108.3G                3.0         6144G  0.3333        0.4900           0.3333   1.0      32              on         True

@travisn Why do we have a BIAS of 4 for the cephFS metadata pool? As a result it ends up having 4X amount of PGs.

travisn · 2025-02-28T15:23:49Z

@travisn Why do we have a BIAS of 4 for the cephFS metadata pool? As a result it ends up having 4X amount of PGs.

This must be a default that is set for the cephfs metadata pool. @batrick Is this expected? I didn't realize either that the cephfs metadata pool should have so many PGs.

BlaineEXE · 2025-02-28T18:00:52Z

controllers/storagecluster/cephblockpools.go

 		cephBlockPool.Spec.PoolSpec = storageCluster.Spec.ManagedResources.CephBlockPools.PoolSpec

 		// Set default values in the poolSpec as necessary
 		setDefaultDataPoolSpec(&cephBlockPool.Spec.PoolSpec, storageCluster)
 		cephBlockPool.Spec.PoolSpec.EnableRBDStats = true

+		// The bulk flag is set to true only during new pool creation, as setting it on existing pools can cause data movement.


IMO, it would be helpful to add a note here that explains that this condition should only be possible when the CephBlockPool doesn't already exist, meaning this struct is initialized only in code.

BlaineEXE

I only see a single unit test check for this, regarding CephObjectStores. IMO, it would be best to check that bulk is set on new objects and not set on existing objects for all resource kinds.

BlaineEXE · 2025-02-28T18:03:22Z

controllers/storagecluster/cephobjectstores_test.go

 	expectedCos, err := reconciler.newCephObjectStoreInstances(cr, nil)
+	expectedCos[0].Spec.MetadataPool.Parameters = map[string]string{
+		"bulk": "true",
+	}
+	expectedCos[0].Spec.DataPool.Parameters = map[string]string{
+		"bulk": "true",
+	}
+
 	assert.NoError(t, err)


Is there a test that checks that bulk isn't applied to already-existing stores? That is also important behavior.

Will add soon

Added detailed unit tests

BlaineEXE · 2025-02-28T18:45:59Z

Any response to this? #2980 (review)

malayparida2000 · 2025-03-03T16:53:43Z

I would recommend having some mechanism for enabling/disabling this behavior.

Now all the metadata/data pool's poolSpec are configurable via the storageCluster CR so the bulk can be added/removed from the poolSpec parameters on demand by the customers. This can achive the A/B testing perf team might want to do.

Travis is also asking an important question here.

The RGW & CephFS metadata pools would have the bulk flag by default with this change.

The bulk flag makes the autoscaler start with the max number of PGs, and the autoscaler then decreases the pg count only if the PG usage starts to skew too much. This could improve performance for ODF users due to great amount of parallelism during read/write on large clusters. We are going to set the bulk flag for now on the default cephBlockPool & replica-1 cephBlockpools, default cephFS metadata pool & data pools, default cephObjectStore metadata pool & data pool. The flag is set to true only during new pool creation, as setting it on existing pools can cause data movement. Signed-off-by: Malay Kumar Parida <[email protected]>

Signed-off-by: Malay Kumar Parida <[email protected]>

malayparida2000 · 2025-03-10T17:58:49Z

@travisn @BlaineEXE Can you please review the changes
Also With Kyle's suggestions here #2980 (comment)
Should we also look at the mon_target_pg_per_osd parameter.

BlaineEXE · 2025-03-10T21:02:29Z

@travisn @BlaineEXE Can you please review the changes Also With Kyle's suggestions here #2980 (comment) Should we also look at the mon_target_pg_per_osd parameter.

[reworded and edited for clarity]

The most definitive source for what recommendations should be is from Anthony's proposed updates here: https://github.com/ceph/ceph/pull/60492/files

I see updates to both mon_max_pg_per_osd (500) and mon_target_pg_per_osd (200) in that PR. I think it is safe to say that ODF should update the bootstrapped rook-config-override to use mon_target_pg_per_osd = 200 as long as we can do so only for new ODF clusters -- we don't want existing clusters to be effected.

Here (below) is where ODF currently sets mon_max_pg_per_osd. I think we can add mon_target_pg_per_osd here as well. This value may risk data movement in existing clusters if we are overwriting it in the rook-config-override configmap. Does ocs-operator only set that configmap once during first instantiation, or does it update it every ODF release?

ocs-operator/controllers/storagecluster/cephconfig.go

Line 37 in 7a7f0c5

mon_max_pg_per_osd = 600

Malay, I think your change to make it so pg_num/pgp_num are not set to 16 when we are using the bulk flag makes sense. I think setting them to 16 is what Kyle is referring to as "paying in advance", and I'm assuming we did this to prevent these from getting super low numbers with the non-bulk autoscale behavior. With bulk and a higher target pg/osd, I assume they won't be necessary.

malayparida2000 · 2025-03-11T13:47:29Z

Does ocs-operator only set that configmap once during first instantiation, or does it update it every ODF release?

Yes currently the behaviour is ocs operator reconciles the configmap & keeps it updated.

I think it is safe to say that ODF should update the bootstrapped rook-config-override to use mon_target_pg_per_osd = 200 as long as we can do so only for new ODF clusters -- we don't want existing clusters to be effected.

Right we are already setting the mon_max_pg_per_osd to 600, so for that we don't need to do anything more. I will work on a separate PR to set mon_target_pg_per_osd = 200.

your change to make it so pg_num/pgp_num are not set to 16 when we are using the bulk flag makes sense

That change is actually only for replica-1 ceph block pools, as earlier these pools had issues with auto scaling so we used to set the PG Nums to 16. Now with bulk flag they won't be necessary anymore.

travisn reviewed Jan 28, 2025

View reviewed changes

malayparida2000 force-pushed the bulk_flag branch from b521570 to bd84812 Compare January 28, 2025 18:24

openshift-ci bot requested review from BlaineEXE and travisn January 29, 2025 12:36

BlaineEXE reviewed Jan 29, 2025

View reviewed changes

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 30, 2025

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 16, 2025

travisn mentioned this pull request Feb 27, 2025

Allow specifying poolSpec for metadata pools of cephFS and cephObjectStore #3066

Merged

malayparida2000 force-pushed the bulk_flag branch from bd84812 to e0450e0 Compare February 27, 2025 20:02

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 27, 2025

malayparida2000 force-pushed the bulk_flag branch from e0450e0 to 16ea320 Compare February 28, 2025 09:16

malayparida2000 changed the title ~~Use bulk flag for all odf data pools for performance gain~~ Set bulk flag to true on intended pools for performance gain Feb 28, 2025

malayparida2000 commented Feb 28, 2025

View reviewed changes

malayparida2000 requested a review from BlaineEXE February 28, 2025 11:29

BlaineEXE reviewed Feb 28, 2025

View reviewed changes

malayparida2000 added 2 commits March 5, 2025 12:45

Add unit tests to verify the bulk flag behavior

006b056

Signed-off-by: Malay Kumar Parida <[email protected]>

malayparida2000 force-pushed the bulk_flag branch from bef21d4 to 006b056 Compare March 5, 2025 16:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set bulk flag to true on intended pools for performance gain #2980

Set bulk flag to true on intended pools for performance gain #2980

malayparida2000 commented Jan 28, 2025 •

edited

Loading

openshift-ci bot commented Jan 28, 2025

travisn Jan 28, 2025

BlaineEXE Jan 29, 2025 •

edited

Loading

mmgaggle Mar 6, 2025 •

edited

Loading

malayparida2000 Mar 10, 2025

malayparida2000 commented Jan 29, 2025

BlaineEXE left a comment

malayparida2000 commented Jan 30, 2025

malayparida2000 Feb 28, 2025

malayparida2000 Mar 6, 2025

mmgaggle Mar 10, 2025

malayparida2000 Mar 11, 2025

malayparida2000 commented Feb 28, 2025 •

edited

Loading

malayparida2000 commented Feb 28, 2025 •

edited

Loading

malayparida2000 commented Feb 28, 2025

travisn commented Feb 28, 2025

BlaineEXE Feb 28, 2025

BlaineEXE left a comment

BlaineEXE Feb 28, 2025

malayparida2000 Mar 3, 2025

malayparida2000 Mar 5, 2025

BlaineEXE commented Feb 28, 2025

malayparida2000 commented Mar 3, 2025 •

edited

Loading

malayparida2000 commented Mar 10, 2025

BlaineEXE commented Mar 10, 2025 •

edited

Loading

malayparida2000 commented Mar 11, 2025

Set bulk flag to true on intended pools for performance gain #2980

Are you sure you want to change the base?

Set bulk flag to true on intended pools for performance gain #2980

Conversation

malayparida2000 commented Jan 28, 2025 • edited Loading

openshift-ci bot commented Jan 28, 2025

Choose a reason for hiding this comment

BlaineEXE Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

mmgaggle Mar 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

malayparida2000 commented Jan 29, 2025

BlaineEXE left a comment

Choose a reason for hiding this comment

malayparida2000 commented Jan 30, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

malayparida2000 commented Feb 28, 2025 • edited Loading

malayparida2000 commented Feb 28, 2025 • edited Loading

malayparida2000 commented Feb 28, 2025

travisn commented Feb 28, 2025

Choose a reason for hiding this comment

BlaineEXE left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BlaineEXE commented Feb 28, 2025

malayparida2000 commented Mar 3, 2025 • edited Loading

malayparida2000 commented Mar 10, 2025

BlaineEXE commented Mar 10, 2025 • edited Loading

malayparida2000 commented Mar 11, 2025

malayparida2000 commented Jan 28, 2025 •

edited

Loading

BlaineEXE Jan 29, 2025 •

edited

Loading

mmgaggle Mar 6, 2025 •

edited

Loading

malayparida2000 commented Feb 28, 2025 •

edited

Loading

malayparida2000 commented Feb 28, 2025 •

edited

Loading

malayparida2000 commented Mar 3, 2025 •

edited

Loading

BlaineEXE commented Mar 10, 2025 •

edited

Loading