-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set bulk flag to true on intended pools for performance gain #2980
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: malayparida2000 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -173,6 +173,7 @@ func (r *StorageClusterReconciler) newCephObjectStoreInstances(initData *ocsv1.S | |||
EnableCrushUpdates: true, | |||
FailureDomain: initData.Status.FailureDomain, | |||
Replicated: generateCephReplicatedSpec(initData, "data"), | |||
Parameters: map[string]string{"bulk": "true"}, | |||
}, | |||
MetadataPool: cephv1.PoolSpec{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BlaineEXE Should we be setting the bulk flag on the object and filesystem metadata pools, in addition to the data pools?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anthony and Kyle seemed to be saying yes when we chatted. It sounds a bit counterintuitive to me as well. My best guess is that the performance gains from increased parallelism must end up being a bigger positive effect than the performance loss from having info split into more chunks.
It might be good if we could test the performance effects of bulk
on the meta pools to see if there's an impact either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having more PGs will help ensure more uniform data distribution and will mean there are targets to perform work (pgs). This is detailed here:
As long as there are one or two orders of magnitude more Placement Groups than OSDs, the distribution should be even. For instance, 256 placement groups for 3 OSDs, 512 or 1024 placement groups for 10 OSDs etc.
The number of PGs for a pool does not impact the size of the objects, that is controlled through
RBD:
rbd_default_order
--object-size
RGW:
rgw_obj_stripe_size
CephFS (file layout):
object_size
One way to approach the problem is to think in terms of having a budget for PGs, this budget is guided by mon_target_pg_per_osd
, and the default value is 100
. Personally, I try to aim for 200-250 and might set this to either 200 or 225. Target ratios are used to proportion that budget, and with bulk the PGs will be created on pools according to those proportions. Using the budget analogy, this means you're putting all your capital to work, but it also means that you're now in a pareto efficient situation. If you create a new pool then the absolute number of PGs for pre-existing pools needs to be reduced.
If you know ahead of time what pools you're going to have in the cluster, it's reasonable to "pay in full" for your pools in advance so that you don't need to keep making payments over time (potentially disruptive to your applications). By doing this you can reap the performance advantages immediately. If you don't use bulk you can avoid the need to reduce PGs in existing pools, but you still need to increase PGs as the capacity used grows, which has a similar disruption potential. So in this decision framework, you can't avoid the disruption potential either way, so it's better to reap the benefits of a "pay in full" strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mmgaggle for such a detailed analysis.
b521570
to
bd84812
Compare
/cc @travisn @BlaineEXE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there plans to allow this to be configured via any means? I trust Kyle and Anthony implicitly, but we still don't have a measure of how the bulk
flag impacts ODF specifically. Further, we don't know whether ODF's common 3-OSD/4-OSD clusters will be impacted positively or negatively by this change. My vague understanding is that 3-/4-OSD clusters are much less common for RHCS than for ODF, and my intuition says that cluster size may play an important part in the performance impacts.
I would recommend having some mechanism for enabling/disabling this behavior. We don't have to document it for users, but I think it will still be useful to us internally. That configurability will make it easier for the performance team to do A:B testing to quantify the performance changes, and it will allow us an easy way to disable this change close to ODF release if we have to by changing the "business logic" instead of the code. It'll also give us the ability to disable/enable it for customer environments in cases where that is imporant to them.
Travis is also asking an important question here. His and my intuition both suggest that the RGW metadata pools might respond to the bulk flag differently than other pools. Thus, I would also suggest allowing the CephObjectStore metadata pools have bulk enabled/disabled independently from the rest of the cluster. This is essentially for the same reasons above.
Here's my guess about what we will find with bulk
flag performance testing:
- I think we will find that the bulk flag has no impact on 3-OSD clusters on the whole
- I suspect there will be noticeable positive impact on clusters with 9-12 or more OSDs
- There is a risk we find S3 metadata operation slowdown with 3-OSD clusters, possibly resulting in minor S3 I/O loss until the cluster has 6-9 OSDs, where we will break even and then see gradual improvements
/hold |
bd84812
to
e0450e0
Compare
e0450e0
to
16ea320
Compare
// Ensures the bulk flag set during new pool creation is not removed during updates. | ||
preserveBulkFlagParameter(existingParameters, &poolSpec.Parameters) | ||
|
||
// set the pg_num & pgp_num parameters only when bulk flag is not set or false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@travisn whats the behaviour of ceph when bulk is set to true as well as pg_num, pgp_num are set to 16? Do we have to selectively exclude setting these 2 parameters when bulk flag is true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some testing on a replica-1 cluster & found that when bulk flag is true the settings related to pg_num, pgp_num are ignored.
~ $ oc get cephblockpool ocs-storagecluster-cephblockpool-us-east-1a -o yaml | yq '.spec.parameters'
bulk: "true"
pg_num: "16"
pgp_num: "16"
sh-5.1$ ceph osd pool autoscale-status
POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK
.mgr 576.5k 3.0 6144G 0.0000 1.0 1 on False
ocs-storagecluster-cephblockpool 65219k 3.0 6144G 0.3333 0.4900 0.3333 1.0 32 on True
ocs-storagecluster-cephfilesystem-metadata 39776 3.0 6144G 0.3333 0.4900 0.3333 4.0 128 on True
ocs-storagecluster-cephfilesystem-data0 0 3.0 6144G 0.3333 0.4900 0.3333 1.0 32 on True
ocs-storagecluster-cephblockpool-us-east-1a 19 1.0 2048G 0.0000 1.0 128 on True
ocs-storagecluster-cephblockpool-us-east-1b 19 1.0 2048G 0.0000 1.0 128 on True
ocs-storagecluster-cephblockpool-us-east-1c 19 1.0 2048G 0.0000 1.0 128 on True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my preference would be to employ a strategy like the one outlined here that uses target_ratio
on each pool instead of defining the number of PGs. I don't think bulk
will do anything without either target_ratio
or target-size
, and if you supplied one of those with a static number of PGs the autoscaler would adjust the quantity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right @mmgaggle. This specifying the PG Num as 16 was only for specific kind of replica-1 ceph block pools where we had issues with auto scaling.
For all the data pools we create we specify the target size ratio as 0.49
some basic testing on 4.19 fresh cluster, will do detailed testing after some reviews Bulk flag presence on pools
|
Wrote around 110GB(10* 100MB, 10* 1GB, 10* 10GB) of data each on rbd & cephFS
|
@travisn Why do we have a BIAS of 4 for the cephFS metadata pool? As a result it ends up having 4X amount of PGs. |
cephBlockPool.Spec.PoolSpec = storageCluster.Spec.ManagedResources.CephBlockPools.PoolSpec | ||
|
||
// Set default values in the poolSpec as necessary | ||
setDefaultDataPoolSpec(&cephBlockPool.Spec.PoolSpec, storageCluster) | ||
cephBlockPool.Spec.PoolSpec.EnableRBDStats = true | ||
|
||
// The bulk flag is set to true only during new pool creation, as setting it on existing pools can cause data movement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, it would be helpful to add a note here that explains that this condition should only be possible when the CephBlockPool doesn't already exist, meaning this struct is initialized only in code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only see a single unit test check for this, regarding CephObjectStores. IMO, it would be best to check that bulk
is set on new objects and not set on existing objects for all resource kinds.
expectedCos, err := reconciler.newCephObjectStoreInstances(cr, nil) | ||
expectedCos[0].Spec.MetadataPool.Parameters = map[string]string{ | ||
"bulk": "true", | ||
} | ||
expectedCos[0].Spec.DataPool.Parameters = map[string]string{ | ||
"bulk": "true", | ||
} | ||
|
||
assert.NoError(t, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a test that checks that bulk
isn't applied to already-existing stores? That is also important behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added detailed unit tests
Any response to this? #2980 (review) |
Now all the metadata/data pool's poolSpec are configurable via the storageCluster CR so the bulk can be added/removed from the poolSpec parameters on demand by the customers. This can achive the A/B testing perf team might want to do.
The RGW & CephFS metadata pools would have the bulk flag by default with this change. |
The bulk flag makes the autoscaler start with the max number of PGs, and the autoscaler then decreases the pg count only if the PG usage starts to skew too much. This could improve performance for ODF users due to great amount of parallelism during read/write on large clusters. We are going to set the bulk flag for now on the default cephBlockPool & replica-1 cephBlockpools, default cephFS metadata pool & data pools, default cephObjectStore metadata pool & data pool. The flag is set to true only during new pool creation, as setting it on existing pools can cause data movement. Signed-off-by: Malay Kumar Parida <[email protected]>
Signed-off-by: Malay Kumar Parida <[email protected]>
bef21d4
to
006b056
Compare
@travisn @BlaineEXE Can you please review the changes |
[reworded and edited for clarity] The most definitive source for what recommendations should be is from Anthony's proposed updates here: https://github.com/ceph/ceph/pull/60492/files I see updates to both Here (below) is where ODF currently sets
Malay, I think your change to make it so |
Yes currently the behaviour is ocs operator reconciles the configmap & keeps it updated.
Right we are already setting the
That change is actually only for replica-1 ceph block pools, as earlier these pools had issues with auto scaling so we used to set the PG Nums to 16. Now with bulk flag they won't be necessary anymore. |
The bulk flag makes the autoscaler start with the max number of PGs, and the autoscaler then decreases the pg count only if the PG usage starts to skew too much. This could improve performance for ODF users due to great amount of parallelism during read/write on large clusters.
We are going to set the bulk flag for now on the default cephBlockPool & replica-1 cephBlockpools, default cephFS metadata pool & data pools, default cephObjectStore metadata pool & data pool.
The flag is set to true only during new pool creation, as setting it on existing pools can cause data movement.
Ref-https://issues.redhat.com/browse/RHSTOR-6774