Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache does not work unexpectedly #1588

Open
jvinolas opened this issue Nov 23, 2024 · 24 comments
Open

Cache does not work unexpectedly #1588

jvinolas opened this issue Nov 23, 2024 · 24 comments
Labels
bug Something isn't working

Comments

@jvinolas
Copy link

Description

Cache was working for some days and we've got two high iowait that let the cache not working. The cache is caching and we see dirties but suddendly does not work anymore, high iowait and no dmesg core dump messages, only tasks hung infos.

First one was while high dirty cache and swapping from wb to wt. We tried to bring it back to wb but we kept with high iowait from the first swap and no cache response anymore. The only solution was to flush the cache, stop it, force recreate it and start it again.

The second one was about 3am today while we were rsync files to this cache and suddenly it did fail, high iowait again and tasks got hung at dmesg info messages.

Expected Behavior

Cache behaviour should be somewhat predictable and do not stop working.

Actual Behavior

High iowait and cache does seem to not work again.

Steps to Reproduce

We could reproduce the first one:

  1. High dirties, cache being written and swapping from wb to wt.

Context

As the cache mountopoint is being served by an NFS with clients connected to it, we had to stop all clients, stop cache, reformat it again and start everything again.

Possible Fix

Logs

This is what suddenly happened this morning while only doing rsyncs:

image_2024-11-23_08-36-41

[Sat Nov 23 03:00:03 2024] workqueue: blk_mq_run_work_fn hogged CPU for >10000us 5 times, consider switching to WQ_UNBOUND
[Sat Nov 23 03:38:06 2024] workqueue: do_global_cleanup [dm_bufio] hogged CPU for >10000us 7 times, consider switching to WQ_UNBOUND
[Sat Nov 23 04:44:07 2024] workqueue: blk_mq_run_work_fn hogged CPU for >10000us 7 times, consider switching to WQ_UNBOUND
[Sat Nov 23 05:32:23 2024] workqueue: do_global_cleanup [dm_bufio] hogged CPU for >10000us 11 times, consider switching to WQ_UNBOUND
[Sat Nov 23 07:11:04 2024] workqueue: do_global_cleanup [dm_bufio] hogged CPU for >10000us 19 times, consider switching to WQ_UNBOUND

And now happened again:

image_2024-11-23_14-56-12

image_2024-11-23_14-56-38

This time with no log in dmesg.

Configuration files

Your Environment

  • OpenCAS version (commit hash or tag): c142610
  • Operating System: 24.04.1 LTS (Noble Numbat)
  • Kernel version: 6.9.12-060912-generic #202408281052 SMP PREEMPT_DYNAMIC
  • Cache device type (NAND/Optane/other): ram disk (wa) over nvme (wb)
  • Core device type (HDD/SSD/other): dm_vdo over an mdadm raid0 over iscii disks
  • Cache configuration:
# casadm -P -i 1
Cache Id                  1
Cache Size                877915904 [4KiB Blocks] / 3348.98 [GiB]
Cache Device              /dev/nvme0n1p1
Exported Object           -
Core Devices              1
Inactive Core Devices     0
Write Policy              wb
Cleaning Policy           alru
Promotion Policy          always
Cache line size           64 [KiB]
Metadata Memory Footprint 4.0 [GiB]
Dirty for                 16090 [s] / 4 [h] 28 [m] 10 [s]
Status                    Running
# casadm -P -i 2
Cache Id                  2
Cache Size                65453856 [4KiB Blocks] / 249.69 [GiB]
Cache Device              /dev/ram0
Exported Object           -
Core Devices              1
Inactive Core Devices     0
Write Policy              wa
Cleaning Policy           alru
Promotion Policy          always
Cache line size           64 [KiB]
Metadata Memory Footprint 319.3 [MiB]
Dirty for                 0 [s] / Cache clean
Status                    Running
# casadm -L
type    id   disk             status    write policy   device
cache   1    /dev/nvme0n1p1   Running   wb             -
└core   1    /dev/dm-4        Active    -              /dev/cas1-1
cache   2    /dev/ram0        Running   wa             -
└core   1    /dev/cas1-1      Active    -              /dev/cas2-1
# casadm --get-param --name cleaning-alru -i 1
╔═════════════════════════╤═══════╗
║ Parameter name          │ Value ║
╠═════════════════════════╪═══════╣
║ Wake up time [s]        │     5 ║
║ Stale buffer time [s]   │     1 ║
║ Flush max buffers       │  9750 ║
║ Activity threshold [ms] │     0 ║
╚═════════════════════════╧═══════╝
# casadm --get-param --name cleaning-alru -i 2
╔═════════════════════════╤═══════╗
║ Parameter name          │ Value ║
╠═════════════════════════╪═══════╣
║ Wake up time [s]        │    20 ║
║ Stale buffer time [s]   │   120 ║
║ Flush max buffers       │   100 ║
║ Activity threshold [ms] │ 10000 ║
╚═════════════════════════╧═══════╝
    • Cache mode: wa rmdisk over wb nvme
    • Cache line size: 64
    • Promotion policy: always
    • Cleaning policy: alru
    • Sequential cutoff policy: full
# casadm --get-param --name seq-cutoff -i 1 -j 1
╔═════════════════════════════════════════════════════╤═══════╗
║ Parameter name                                      │ Value ║
╠═════════════════════════════════════════════════════╪═══════╣
║ Sequential cutoff threshold [KiB]                   │  1024 ║
║ Sequential cutoff policy                            │ full  ║
║ Sequential cutoff promotion request count threshold │     8 ║
╚═════════════════════════════════════════════════════╧═══════╝
# casadm --get-param --name seq-cutoff -i 2 -j 1
╔═════════════════════════════════════════════════════╤═══════╗
║ Parameter name                                      │ Value ║
╠═════════════════════════════════════════════════════╪═══════╣
║ Sequential cutoff threshold [KiB]                   │  1024 ║
║ Sequential cutoff policy                            │ full  ║
║ Sequential cutoff promotion request count threshold │     8 ║
╚═════════════════════════════════════════════════════╧═══════╝
  • Other (e.g. lsblk, casadm -P, casadm -L)
@jvinolas jvinolas added the bug Something isn't working label Nov 23, 2024
@jvinolas
Copy link
Author

jvinolas commented Nov 23, 2024

Could it be related to having the ramdisk in /dev/ram0? Here you state to not use it http://open-cas.com/guide_advanced_options.html#multi-level-caching even in recent kernels documentations say it is not a problem.

EDIT: We removed the brd ramdisk cache device and doing now the same rsyncs... we'll see if it was that. Also, no errors shown in casadm -P, so we don't know why cache decided to 'deactivate' itself.

@jvinolas
Copy link
Author

jvinolas commented Nov 23, 2024

Without the ram cache over the nvme, just nvme in wb mode with same parameters happened again:

imatge

imatge

No logs in dmesg at all this time also.

Cache Id                  1
Cache Size                877915904 [4KiB Blocks] / 3348.98 [GiB]
Cache Device              /dev/nvme0n1p1
Exported Object           -
Core Devices              1
Inactive Core Devices     0
Write Policy              wb
Cleaning Policy           alru
Promotion Policy          always
Cache line size           64 [KiB]
Metadata Memory Footprint 4.0 [GiB]
Dirty for                 13014 [s] / 3 [h] 36 [m] 54 [s]
Status                    Running

╔══════════════════╤═══════════╤═══════╤═════════════╗
║ Usage statistics │   Count   │   %   │   Units     ║
╠══════════════════╪═══════════╪═══════╪═════════════╣
║ Occupancy        │ 877915904 │ 100.0 │ 4KiB Blocks ║
║ Free             │         0 │   0.0 │ 4KiB Blocks ║
║ Clean            │ 642185680 │  73.1 │ 4KiB Blocks ║
║ Dirty            │ 235730224 │  26.9 │ 4KiB Blocks ║
╚══════════════════╧═══════════╧═══════╧═════════════╝

╔══════════════════════╤════════╤═══════╤══════════╗
║ Request statistics   │ Count  │   %   │ Units    ║
╠══════════════════════╪════════╪═══════╪══════════╣
║ Read hits            │   1039 │   0.2 │ Requests ║
║ Read partial misses  │      0 │   0.0 │ Requests ║
║ Read full misses     │   2309 │   0.4 │ Requests ║
║ Read total           │   3348 │   0.6 │ Requests ║
╟──────────────────────┼────────┼───────┼──────────╢
║ Write hits           │ 154914 │  29.7 │ Requests ║
║ Write partial misses │     14 │   0.0 │ Requests ║
║ Write full misses    │ 345926 │  66.2 │ Requests ║
║ Write total          │ 500854 │  95.9 │ Requests ║
╟──────────────────────┼────────┼───────┼──────────╢
║ Pass-Through reads   │      0 │   0.0 │ Requests ║
║ Pass-Through writes  │  17963 │   3.4 │ Requests ║
║ Serviced requests    │ 504202 │  96.6 │ Requests ║
╟──────────────────────┼────────┼───────┼──────────╢
║ Total requests       │ 522165 │ 100.0 │ Requests ║
╚══════════════════════╧════════╧═══════╧══════════╝

╔══════════════════════════════════╤════════════╤═══════╤═════════════╗
║ Block statistics                 │   Count    │   %   │   Units     ║
╠══════════════════════════════════╪════════════╪═══════╪═════════════╣
║ Reads from core(s)               │       2561 │   0.0 │ 4KiB Blocks ║
║ Writes to core(s)                │  745094229 │ 100.0 │ 4KiB Blocks ║
║ Total to/from core(s)            │  745096790 │ 100.0 │ 4KiB Blocks ║
╟──────────────────────────────────┼────────────┼───────┼─────────────╢
║ Reads from cache                 │  644049412 │  42.2 │ 4KiB Blocks ║
║ Writes to cache                  │  881997809 │  57.8 │ 4KiB Blocks ║
║ Total to/from cache              │ 1526047221 │ 100.0 │ 4KiB Blocks ║
╟──────────────────────────────────┼────────────┼───────┼─────────────╢
║ Reads from exported object(s)    │       4661 │   0.0 │ 4KiB Blocks ║
║ Writes to exported object(s)     │  983061770 │ 100.0 │ 4KiB Blocks ║
║ Total to/from exported object(s) │  983066431 │ 100.0 │ 4KiB Blocks ║
╚══════════════════════════════════╧════════════╧═══════╧═════════════╝

╔════════════════════╤═══════╤═════╤══════════╗
║ Error statistics   │ Count │  %  │ Units    ║
╠════════════════════╪═══════╪═════╪══════════╣
║ Cache read errors  │     0 │ 0.0 │ Requests ║
║ Cache write errors │     0 │ 0.0 │ Requests ║
║ Cache total errors │     0 │ 0.0 │ Requests ║
╟────────────────────┼───────┼─────┼──────────╢
║ Core read errors   │     0 │ 0.0 │ Requests ║
║ Core write errors  │     0 │ 0.0 │ Requests ║
║ Core total errors  │     0 │ 0.0 │ Requests ║
╟────────────────────┼───────┼─────┼──────────╢
║ Total errors       │     0 │ 0.0 │ Requests ║
╚════════════════════╧═══════╧═════╧══════════╝

The pattern that seems to be repeating is 3 hours after we start the rsyncs...

This is the actual server status while still doing the rsyncs and cache not working suddenly:

imatge

@jvinolas
Copy link
Author

Could it be due to unresponsiveness of core device (high iowait) and fullfilling this buffers?

imatge

Some dmesg messages if that is the case would be helpful. We now activated the echo 8 kernel debug level messages and see if anything is thrown.

@jvinolas
Copy link
Author

jvinolas commented Nov 24, 2024

It is reproducible, now with

╔═════════════════════════╤═══════╗
║ Parameter name          │ Value ║
╠═════════════════════════╪═══════╣
║ Wake up time [s]        │     5 ║
║ Stale buffer time [s]   │     1 ║
║ Flush max buffers       │   500 ║
║ Activity threshold [ms] │    25 ║
╚═════════════════════════╧═══════╝

imatge

@robertbaldyga
Copy link
Member

Hi @jvinolas ! Do you trace cache occupancy over time? Can the failure be correlated with the moment the occupancy hits 100%?

@jvinolas
Copy link
Author

jvinolas commented Nov 25, 2024

Well, it seems... what's the solution? I mean, it goes pass through as I assume, but is there any way we can avoid that? Or event any way we could activate it again without having to stop and recreate the cache?

imatge

We are not using any io class right now. Do the default io class solve that problem?

@jvinolas
Copy link
Author

We tried the default io class, also without success in evicting old data...

@jvinolas
Copy link
Author

Looking at the code, could it be this function having problems in multi cpu (processor) socket servers? Ours is dual socket and we've got also numad active. We'll try to disable numad to see if that could interfere.

image

@robertbaldyga
Copy link
Member

@jvinolas I think we got a reproduction of that behavior. I'll let you know once we identify the root cause.

@jvinolas
Copy link
Author

Thanks. Meanwhile, is there any setup that gives us at least write cache and does not increment occupancy over the dirty so we don't get stuck when it reaches full occupancy?

@robertbaldyga
Copy link
Member

I think the problem is not only the occupancy, but also the fact that it's 100% dirty data, which means that cache needs to perform on-demand cleaning in order to be able to perform any eviction. If you need it to work as a write buffer it may we worth trying ACP cleaning policy. For long, continuous stream of intensive writes it will not help much, but if your workload has periodic bursts of writes, maybe ACP will be able to clean the cache between the bursts.

@robertbaldyga
Copy link
Member

robertbaldyga commented Nov 29, 2024

Also if there is any chance that your workload contains long ranges of sequential writes, setting sequential cutoff policy to never may help.

@jvinolas
Copy link
Author

we'll try and see, thanks

@jvinolas
Copy link
Author

jvinolas commented Dec 3, 2024

With write back, we set up at 23:10 ACP and seq-cutoff to never the dirty data keeps lower, as it should. And seems that eviction came correctly after we hit 100% occupancy doing some rsyncs: dirty still working, so cache working and not deactivated as it happened with ALRU:

imatge

@jvinolas
Copy link
Author

jvinolas commented Dec 9, 2024

We found that if we change to pt and back to wb the acp policy does not do on-demand cleaning, so we have to stop and start it again. For now we keep it in wb.

@jvinolas
Copy link
Author

jvinolas commented Dec 11, 2024

We are hitting more or less the same problem when occupancy gets 100%, low performance... Will wo mode avoid having clean lines at the cost of not caching reads? If we got a ram cache in wa mode in front of the nvme wo will it do the trick and keep occupancy just at the dirty line in the underlying nvme cache, thus avoiding the bug when hitting 100% occupancy?

This shows how the cache is not working well after occupancy hits 100%

imatge

UPDATED: Aplying seq-cutoff to never (we forgot after last cache stop) seems to bring back performance

imatge

UPDATE2: It worked for about 24 minutes, then came back to low performance...

imatge

@tb3088
Copy link

tb3088 commented Feb 6, 2025

has anyone tried WB with cleaner-ACP policy? It seems to me that the ALRU is entirely too lazy in the face of exceeding 90% of max (65536). Frankly I'd put the 'go into panic (ACP)' mode at about 75-80% utilization. The whole point of a cache is to absorb writes and for CAS to fall on its face because it didn't stay well clear of 100% used seems extremely short-sighted.

Over in ZFS-land, they default to keeping 30% of the L2arc or 'special device' cache reserved for "metadata" occupancy instead of letting user data crowd out everything else and neuter the whole point of having the cache.

@robertbaldyga
Copy link
Member

@tb3088 WB+ALRU is more focused on caching scenario, where the main benefit comes from write/read hits served by the cache device under circumstances where majority of the active work set fits into the cache. In that situation aggressive cleaning may actually worsen the performance, as the backend device would be occupied by handing the cleaning request instead of serving the small fraction of the remaining requests that happen to be cache misses.

Looks like your use is more of write buffer scenario, where the main benefit comes not from cache hits, but from accommodating a write burst in the faster block device, and then swiftly flushing it to the backend before the next write burst occurs. In that case indeed ACP should be a better choice.

@robertbaldyga
Copy link
Member

We are hitting more or less the same problem when occupancy gets 100%, low performance... Will wo mode avoid having clean lines at the cost of not caching reads? If we got a ram cache in wa mode in front of the nvme wo will it do the trick and keep occupancy just at the dirty line in the underlying nvme cache, thus avoiding the bug when hitting 100% occupancy?

This shows how the cache is not working well after occupancy hits 100%

imatge

UPDATED: Aplying seq-cutoff to never (we forgot after last cache stop) seems to bring back performance

imatge

UPDATE2: It worked for about 24 minutes, then came back to low performance...

imatge

@jvinolas A few questions:

  1. Does this still produce hung task info in the dmesg, or just causes slow down after some time?
  2. Can you estimate what is the size of the write burst compared to the cache size?
  3. We just noticed SMP PREEMPT_DYNAMIC - do you have preemption enabled in the kernel?
  4. In the following case, do you call casadm -Q with -f flag?

We found that if we change to pt and back to wb the acp policy does not do on-demand cleaning, so we have to stop and start it again. For now we keep it in wb.

@tb3088
Copy link

tb3088 commented Feb 18, 2025

WB+ALRU is more focused on caching scenario, where the main benefit comes from write/read hits served by the cache device under circumstances where majority of the active work set fits into the cache.

that's what WT caching is for. But if it's CAS' intent to (re)define WB this way it NEEDS to be documented. Because no storage professional considers that the standard working definition, nor would they size their WB with any regard for size of (read) working set. Again, that's what WT is for.

The whole point of WB is absorb WRITEs. That there might be sufficient slack such that hot-reads stick around having dodged eviction from cache pressure is happy circumstance.

Therefor a WB policy that by default allows a state where writes can not be absorbed is WRONG. The ALRU sits on dirty buffers for 2 freaking minutes before they are even eligible for de-stage? That's insane on its face. But even if I concede the policy wants to write as little as possible till cache pressure demands more aggression, there has to be an inflection point where it goes into "try to stay ahead of incoming writes" mode. Not saying you couldn't still end up in a situation where incoming writes overwhelm the ability to de-stage and we hit 100% used/dirty and performance craters.

@robertbaldyga
Copy link
Member

robertbaldyga commented Feb 21, 2025

@tb3088

WB+ALRU is more focused on caching scenario, where the main benefit comes from write/read hits served by the cache device under circumstances where majority of the active work set fits into the cache.

that's what WT caching is for. But if it's CAS' intent to (re)define WB this way it NEEDS to be documented. Because no storage professional considers that the standard working definition, nor would they size their WB with any regard for size of (read) working set. Again, that's what WT is for.

The whole point of WB is absorb WRITEs. That there might be sufficient slack such that hot-reads stick around having dodged eviction from cache pressure is happy circumstance.

Absorbing WRITEs = cache write hits. We agree on that one, that WB is supposed to do that.
With no write hits, under constant I/O pressure there is no benefit from the WB cache.
With significant number of write hits, WB cache absorbs a lot of writes, and async cleaning policy role is marginal.
With no write hits, under changing I/O pressure there is no benefit from WB cache, but write buffer may help, and then aggressive cleaning policy is important.

@jvinolas
Copy link
Author

We are hitting more or less the same problem when occupancy gets 100%, low performance... Will wo mode avoid having clean lines at the cost of not caching reads? If we got a ram cache in wa mode in front of the nvme wo will it do the trick and keep occupancy just at the dirty line in the underlying nvme cache, thus avoiding the bug when hitting 100% occupancy?
This shows how the cache is not working well after occupancy hits 100%
imatge
UPDATED: Aplying seq-cutoff to never (we forgot after last cache stop) seems to bring back performance
imatge
UPDATE2: It worked for about 24 minutes, then came back to low performance...
imatge

@jvinolas A few questions:

1. Does this still produce hung task info in the dmesg, or just causes slow down after some time?

2. Can you estimate what is the size of the write burst compared to the cache size?

3. We just noticed `SMP PREEMPT_DYNAMIC` - do you have preemption enabled in the kernel?

4. In the following case, do you call `casadm -Q` with `-f` flag?

We found that if we change to pt and back to wb the acp policy does not do on-demand cleaning, so we have to stop and start it again. For now we keep it in wb.

Hi,

I'm sorry but we removed cache as was unusable when reached full dirty data. I'll look at our setup scripts and system configuration we used and will tell you those parameters.

@tb3088
Copy link

tb3088 commented Feb 21, 2025

With no write hits, under constant I/O pressure there is no benefit from the WB cache.

It is somewhat unusual for a write to be followed by a re-write of the same extent. But (multiple) read of the recently written does happen with more likelihood, largely if the writing process is doing O_DIRECT. Otherwise the client should rarely re-read it since it (should be) resident in the buffer cache.

The big win for WB regardless of hits on now cached-on-write elements, and of a pure write buffer mode, is the ability to return success to the writer such that the writer process is not constrained to the speed of the cached device. Eg. SSD on top of HDD. As long as the de-stage rate is not completely out of whack with the incoming write rate, everyone enjoys "blazing fast" writes to the target.

What OP is running into is the de-stage logic (ALRU) in its default form is not compatible with his write activity rate. And it appears the ALRU doesn't self-heal very well, nor quickly. Similar to what ZFS resorted to, CAS should provide a tunable that always maintains X amount of 'write buffer' or if you prefer, a designated amount of space to aggressively evict LRU to hit, when faced with incoming writes, as well as a complementary tunable that puts a floor on how much cached items to preserve no matter how heavy the inbound writes. That way you don't completely obliterate your cached extents and force all reads to hit the source device.

The Linux LVM/WB cache also has a pathologically bad behavior pattern such that it will allow even "cold" READ data to fill the cache such that there is no room for a flurry of writes and the user is basically stuck in Write-Thru mode with no hope of recovery.

@robertbaldyga
Copy link
Member

With no write hits, under constant I/O pressure there is no benefit from the WB cache.

It is somewhat unusual for a write to be followed by a re-write of the same extent. But (multiple) read of the recently written does happen with more likelihood, largely if the writing process is doing O_DIRECT. Otherwise the client should rarely re-read it since it (should be) resident in the buffer cache.

That strongly depends on the size of the active work set. If it fits into the buffer cache, then indeed there is very limited benefit from using a storage cache.

The big win for WB regardless of hits on now cached-on-write elements, and of a pure write buffer mode, is the ability to return success to the writer such that the writer process is not constrained to the speed of the cached device. Eg. SSD on top of HDD. As long as the de-stage rate is not completely out of whack with the incoming write rate, everyone enjoys "blazing fast" writes to the target.

Here we talk about the "speed" in terms of latency, not throughput. That is more important for compute-side workloads, which may not be able to generate huge queue depth, and thus become latency-bound. The storage-side workload, which typically comprises large number of parallel requests from multiple clients using the storage system at the same time, would be most likely throughput-bound, thus latency improvements, although not completely insignificant, typically are not the major concern.

What OP is running into is the de-stage logic (ALRU) in its default form is not compatible with his write activity rate. And it appears the ALRU doesn't self-heal very well, nor quickly. Similar to what ZFS resorted to, CAS should provide a tunable that always maintains X amount of 'write buffer' or if you prefer, a designated amount of space to aggressively evict LRU to hit, when faced with incoming writes, as well as a complementary tunable that puts a floor on how much cached items to preserve no matter how heavy the inbound writes. That way you don't completely obliterate your cached extents and force all reads to hit the source device.

Workload type described by OP appears to be a write-buffer use case, and using ACP has been suggested.

Limiting amount of dirty data in the cache should be possible using io classification mechanism. I don't know why we are missing this one simple classification rule (io direction). It should be fairly easy to add.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants