-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache does not work unexpectedly #1588
Comments
Could it be related to having the ramdisk in /dev/ram0? Here you state to not use it http://open-cas.com/guide_advanced_options.html#multi-level-caching even in recent kernels documentations say it is not a problem. EDIT: We removed the brd ramdisk cache device and doing now the same rsyncs... we'll see if it was that. Also, no errors shown in casadm -P, so we don't know why cache decided to 'deactivate' itself. |
Without the ram cache over the nvme, just nvme in wb mode with same parameters happened again: No logs in dmesg at all this time also.
The pattern that seems to be repeating is 3 hours after we start the rsyncs... This is the actual server status while still doing the rsyncs and cache not working suddenly: |
Hi @jvinolas ! Do you trace cache occupancy over time? Can the failure be correlated with the moment the occupancy hits 100%? |
We tried the default io class, also without success in evicting old data... |
@jvinolas I think we got a reproduction of that behavior. I'll let you know once we identify the root cause. |
Thanks. Meanwhile, is there any setup that gives us at least write cache and does not increment occupancy over the dirty so we don't get stuck when it reaches full occupancy? |
I think the problem is not only the occupancy, but also the fact that it's 100% dirty data, which means that cache needs to perform on-demand cleaning in order to be able to perform any eviction. If you need it to work as a write buffer it may we worth trying ACP cleaning policy. For long, continuous stream of intensive writes it will not help much, but if your workload has periodic bursts of writes, maybe ACP will be able to clean the cache between the bursts. |
Also if there is any chance that your workload contains long ranges of sequential writes, setting sequential cutoff policy to |
we'll try and see, thanks |
We found that if we change to |
We are hitting more or less the same problem when occupancy gets 100%, low performance... Will This shows how the cache is not working well after occupancy hits 100% UPDATED: Aplying seq-cutoff to never (we forgot after last cache stop) seems to bring back performance UPDATE2: It worked for about 24 minutes, then came back to low performance... |
has anyone tried WB with cleaner-ACP policy? It seems to me that the ALRU is entirely too lazy in the face of exceeding 90% of max (65536). Frankly I'd put the 'go into panic (ACP)' mode at about 75-80% utilization. The whole point of a cache is to absorb writes and for CAS to fall on its face because it didn't stay well clear of 100% used seems extremely short-sighted. Over in ZFS-land, they default to keeping 30% of the L2arc or 'special device' cache reserved for "metadata" occupancy instead of letting user data crowd out everything else and neuter the whole point of having the cache. |
@tb3088 WB+ALRU is more focused on caching scenario, where the main benefit comes from write/read hits served by the cache device under circumstances where majority of the active work set fits into the cache. In that situation aggressive cleaning may actually worsen the performance, as the backend device would be occupied by handing the cleaning request instead of serving the small fraction of the remaining requests that happen to be cache misses. Looks like your use is more of write buffer scenario, where the main benefit comes not from cache hits, but from accommodating a write burst in the faster block device, and then swiftly flushing it to the backend before the next write burst occurs. In that case indeed ACP should be a better choice. |
@jvinolas A few questions:
|
that's what WT caching is for. But if it's CAS' intent to (re)define WB this way it NEEDS to be documented. Because no storage professional considers that the standard working definition, nor would they size their WB with any regard for size of (read) working set. Again, that's what WT is for. The whole point of WB is absorb WRITEs. That there might be sufficient slack such that hot-reads stick around having dodged eviction from cache pressure is happy circumstance. Therefor a WB policy that by default allows a state where writes can not be absorbed is WRONG. The ALRU sits on dirty buffers for 2 freaking minutes before they are even eligible for de-stage? That's insane on its face. But even if I concede the policy wants to write as little as possible till cache pressure demands more aggression, there has to be an inflection point where it goes into "try to stay ahead of incoming writes" mode. Not saying you couldn't still end up in a situation where incoming writes overwhelm the ability to de-stage and we hit 100% used/dirty and performance craters. |
Absorbing WRITEs = cache write hits. We agree on that one, that WB is supposed to do that. |
Hi, I'm sorry but we removed cache as was unusable when reached full dirty data. I'll look at our setup scripts and system configuration we used and will tell you those parameters. |
It is somewhat unusual for a write to be followed by a re-write of the same extent. But (multiple) read of the recently written does happen with more likelihood, largely if the writing process is doing O_DIRECT. Otherwise the client should rarely re-read it since it (should be) resident in the buffer cache. The big win for WB regardless of hits on now cached-on-write elements, and of a pure write buffer mode, is the ability to return success to the writer such that the writer process is not constrained to the speed of the cached device. Eg. SSD on top of HDD. As long as the de-stage rate is not completely out of whack with the incoming write rate, everyone enjoys "blazing fast" writes to the target. What OP is running into is the de-stage logic (ALRU) in its default form is not compatible with his write activity rate. And it appears the ALRU doesn't self-heal very well, nor quickly. Similar to what ZFS resorted to, CAS should provide a tunable that always maintains X amount of 'write buffer' or if you prefer, a designated amount of space to aggressively evict LRU to hit, when faced with incoming writes, as well as a complementary tunable that puts a floor on how much cached items to preserve no matter how heavy the inbound writes. That way you don't completely obliterate your cached extents and force all reads to hit the source device. The Linux LVM/WB cache also has a pathologically bad behavior pattern such that it will allow even "cold" READ data to fill the cache such that there is no room for a flurry of writes and the user is basically stuck in Write-Thru mode with no hope of recovery. |
That strongly depends on the size of the active work set. If it fits into the buffer cache, then indeed there is very limited benefit from using a storage cache.
Here we talk about the "speed" in terms of latency, not throughput. That is more important for compute-side workloads, which may not be able to generate huge queue depth, and thus become latency-bound. The storage-side workload, which typically comprises large number of parallel requests from multiple clients using the storage system at the same time, would be most likely throughput-bound, thus latency improvements, although not completely insignificant, typically are not the major concern.
Workload type described by OP appears to be a write-buffer use case, and using ACP has been suggested. Limiting amount of dirty data in the cache should be possible using io classification mechanism. I don't know why we are missing this one simple classification rule (io direction). It should be fairly easy to add. |
Description
Cache was working for some days and we've got two high iowait that let the cache not working. The cache is caching and we see dirties but suddendly does not work anymore, high iowait and no dmesg core dump messages, only tasks hung infos.
First one was while high dirty cache and swapping from wb to wt. We tried to bring it back to wb but we kept with high iowait from the first swap and no cache response anymore. The only solution was to flush the cache, stop it, force recreate it and start it again.
The second one was about 3am today while we were rsync files to this cache and suddenly it did fail, high iowait again and tasks got hung at dmesg info messages.
Expected Behavior
Cache behaviour should be somewhat predictable and do not stop working.
Actual Behavior
High iowait and cache does seem to not work again.
Steps to Reproduce
We could reproduce the first one:
Context
As the cache mountopoint is being served by an NFS with clients connected to it, we had to stop all clients, stop cache, reformat it again and start everything again.
Possible Fix
Logs
This is what suddenly happened this morning while only doing rsyncs:
And now happened again:
This time with no log in dmesg.
Configuration files
Your Environment
The text was updated successfully, but these errors were encountered: