Skip to content

Commit

Permalink
[ceph-osd] improve osdfull reason and suggest workaround
Browse files Browse the repository at this point in the history
There's been evidence that bdev_async_discard was the root
cause of this issue. So if anyone encountering this problem
should disable it.

Signed-off-by: Ponnuvel Palaniyappan <[email protected]>
  • Loading branch information
pponnuvel committed Oct 4, 2024
1 parent cd66455 commit ffadd2e
Show file tree
Hide file tree
Showing 2 changed files with 22 additions and 10 deletions.
16 changes: 11 additions & 5 deletions hotsos/defs/scenarios/storage/ceph/ceph-mon/osd_unusual_raw.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,17 @@ conclusions:
raises:
type: CephOSDWarning
message: >-
Found OSD(s) {bad_osds} with larger raw usage size than the combined
data+meta+omap usage. While a certain discrepancy is to be expected due to
Ceph's using space not accounted by data+meta+omap columns, these are more
than {limit}% and potentially indicate a bug in Ceph. If these OSDs appear
full or misbehave, please restart them and possibly file a bug in Ceph tracker.
Found OSD(s) {bad_osds} with larger raw usage size than data+meta+omap
combined. While a discrepancy is to be expected due to Ceph using space
not accounted by data+meta+omap columns, these are greater than {limit}
and likely indicates high discard ops sent to the disk which is often
the case for workloads with frequent rewrites.
If these OSDs appear full or misbehave please restart them.
If the problem persists (i.e. OSD restarts do not help) you should disable
bdev_async_discard for OSDs. For charmed Ceph, this option is controlled
via bdev-enable-discard flag which should be set to 'disable'.
format-dict:
bad_osds: '@checks.osds_have_unusual_raw_usage.requires.value_actual:comma_join'
limit: hotsos.core.plugins.storage.ceph.CephCluster.OSD_DISCREPANCY_ALLOWED
Original file line number Diff line number Diff line change
Expand Up @@ -115,8 +115,14 @@ data-root:
- sos_commands/systemd/systemctl_list-unit-files
raised-issues:
CephOSDWarning: >-
Found OSD(s) osd.2 with larger raw usage size than the combined
data+meta+omap usage. While a certain discrepancy is to be expected due to
Ceph's using space not accounted by data+meta+omap columns, these are more
than 5% and potentially indicate a bug in Ceph. If these OSDs appear
full or misbehave, please restart them and possibly file a bug in Ceph tracker.
Found OSD(s) osd.2 with larger raw usage size than data+meta+omap
combined. While a discrepancy is to be expected due to Ceph using space
not accounted by data+meta+omap columns, these are greater than 5%
and likely indicates high discard ops sent to the disk which is often
the case for workloads with frequent rewrites.
If these OSDs appear full or misbehave please restart them.
If the problem persists (i.e. OSD restarts do not help) you should disable
bdev_async_discard for OSDs. For charmed Ceph, this option is controlled
via bdev-enable-discard flag which should be set to 'disable'.

0 comments on commit ffadd2e

Please sign in to comment.