SSD Cache

ZeusyBoy · Jul 28, 2024

I've heard a bit on YouTube about ssd caching, and I'm wondering if it's possible in proxmox. In my case, I have 7 600gb HDDs set up in a raid 6 array, and would like to set up a 1tb ssd for cache.

LnxBil · Jul 28, 2024

It's Linux, so any guide that describes it would work to some extend. It is completely not supported, so you're on your own.

ucholak · Jul 29, 2024

proxmox supports zfs. zfs supports "caching" drive.

Code:

zpool add $ZC_NAME cache nvme-...
zpool add $ZC_NAME log nvme-...

esi_y · Jul 29, 2024

ucholak said:
proxmox supports zfs. zfs supports "caching" drive.

Code:

zpool add $ZC_NAME cache nvme-... zpool add $ZC_NAME log nvme-...

Depending on what kind of NVMe it is, the log might not be such a great idea. The cache works well with ZFS.

Also, filesystem agnostic you can have a look at bcache:

https://docs.kernel.org/admin-guide/bcache.html

esi_y · Jul 29, 2024

LnxBil said:
It is completely not supported, so you're on your own.

What does that even mean for a non-subscription user, the "not supported"? Of course he cannot get support that he does not pay for.

Just saying, because to a lot of people something not supported means "it would not work" which is not the case.

LnxBil · Aug 2, 2024

esi_y said:
What does that even mean for a non-subscription user, the "not supported"? Of course he cannot get support that he does not pay for.

He will most certainly not get support for bcachefs even if he pays for it.

esi_y said:
Just saying, because to a lot of people something not supported means "it would not work" which is not the case.

I wasn't aware of that. I meant 'support' as in 'help'. What word would describe it better? I just used the word like it is used on the english proxmox ve homepage.

milew · Aug 2, 2024

ZeusyBoy said:
I've heard a bit on YouTube about ssd caching, and I'm wondering if it's possible in proxmox. In my case, I have 7 600gb HDDs set up in a raid 6 array, and would like to set up a 1tb ssd for cache.

If you RAID is on zfs zpool, you can add disk cache z your zpool

Code:

zpool add your pool cache /dev/sdxxxx

add/remove vdev cache is non destructive

esi_y · Aug 2, 2024

LnxBil said:
He will most certainly not get support for bcachefs even if he pays for it.

Well, strictly speaking that is also not true (in the "help" sense) as e.g. I would be happy to help here with bcache (not fs).

LnxBil said:
I wasn't aware of that. I meant 'support' as in 'help'. What word would describe it better?

I am not saying I have a better vocabulary, but I do not like certain (ambiguous) terms. The most neutral term would be probably non-standard - if one wants to emphasise the precariousness of the situation of running a setup like that, sure one can say untested.

Some features when marked as "preview" or "experimental" (lots of words used synonymously, even though to me it means more "subject to change" than unreliable), nothing wrong with using that. If docs explicitly mention something is discouraged, also no issue (at least it implies it is something they'd rather not see you doing, but it's possible to do). If something does not work, to me that's unsupported - e.g. CPU does only support up to such amount of RAM in certain configurations - then fair enough, that's a clear cut. There are also things which are e.g. "undocumented" altogether (but work). But if something is documented and mature solution and one is combining it with another such, I do not see any problem with that.

LnxBil said:
I just used the word like it is used on the english proxmox ve homepage.

Then again, maybe it's just me, right? I don't like other words used there too. E.g. "no-subscription" repo which does not communicate well that it's actually "testing" while the so-called testing repo should have been unstable instead.

LnxBil · Aug 2, 2024

esi_y said:
Well, strictly speaking that is also not true (in the "help" sense) as e.g. I would be happy to help here with bcache (not fs).

Again, I meant the official support. I always look through the enterprise class glasses.
Sure there are people on the interwebs helping. I ran flashcache years ago also very successfull with Proxmox VE without any (software) problems.

esi_y said:
Then again, maybe it's just me, right? I don't like other words used there too. E.g. "no-subscription" repo which does not communicate well that it's actually "testing" while the so-called testing repo should have been unstable instead.

Yes, nomenclature again. I can see your point, yet I would argue that the no-subscription repo is more stable than testing in the Debian sense of the difference. I think they have internally another repository that is actually the "real" unstable.

milew said:
If you RAID is on zfs zpool, you can add disk cache z your zpool

ZFS L2ARC is not going to be a huge help. That's my experience and and others reported the same.

The best performance gain with a mix of HDDs and SSDs is to use the SSDs as a special device and put the metdata on there and control with the dataset property with special_small_blocks, which block you would also get like to be on the SSDs. Then another device (very fast IOPS) as a SLOG device, e.g. 16 GB Intel optane. Use the same redundancy for the special devices as for the data devices, this is technically a RAID0-like setup. If you loose the special device, everything will be gone.

esi_y · Aug 2, 2024

LnxBil said:
ZFS L2ARC is not going to be a huge help. That's my experience and and others reported the same.

I am a bit surprised on this one. I can only imagine this would be because you have lots of random writes going on at all times.

LnxBil said:
The best performance gain with a mix of HDDs and SSDs is to use the SSDs as a special device and put the metdata on there and control with the dataset property with special_small_blocks, which block you would also get like to be on the SSDs.

But then again, this needs a mirror of the SSDs, adding a sole one (unlike L2ARC cache) would be madness.

LnxBil said:
Then another device (very fast IOPS) as a SLOG device, e.g. 16 GB Intel optane.

Optanes are EOL and the cost was always such that I wondered If may have as well had the pool be SSD only instead. Again I would consider this only in a mirror.

LnxBil said:
Use the same redundancy for the special devices as for the data devices, this is technically a RAID0-like setup. If you loose the special device, everything will be gone.

I might be wrong but costs wise nowadays it would still eat away the savings on the 6 spinning drives.

milew · Aug 2, 2024

LnxBil said:
ZFS L2ARC is not going to be a huge help. That's my experience and and others reported the same.

If I had the option of putting in just one SSD, I would do it as a cache and not think too much about it. I don't see any particular disadvantages of this solution.

esi_y · Aug 2, 2024

milew said:
If I had the option of putting in just one SSD, I would do it as a cache and not think too much about it. I don't see any particular disadvantages of this solution.

The other thing is, this is often zero additional cost as you can e.g. have 256G SSD in a machine idling on nothing else than Debian install which requires 10G. The ZFS cache dev could be separate partition of the same and it can literally fail anytime, it's just read cache.

LnxBil · Aug 5, 2024

esi_y said:
Optanes are EOL and the cost was always such that I wondered If may have as well had the pool be SSD only instead. Again I would consider this only in a mirror.

My NVMe 16 GB Intel Optane costs no 30 euros and is perfectly fast:

Code:

min/avg/max/mdev = 59.5 us / 117.0 us / 226.7 us / 45.6 us

The slowest speed on the Optane is on par with the fastest times of my enterprise SSD. This is a very huge improvement, which can be seen, e.g. by applying debian updates. Each transaction therein is usually sycned so that you see an improvement, the update is much faster, albeit there is not much written.

esi_y said:
I am a bit surprised on this one. I can only imagine this would be because you have lots of random writes going on at all times.

It wasn't worth it. Compared to bcachefs it was almost not noticeable.

esi_y · Aug 5, 2024

LnxBil said:
My NVMe 16 GB Intel Optane costs no 30 euros and is perfectly fast:

Code:

min/avg/max/mdev = 59.5 us / 117.0 us / 226.7 us / 45.6 us

Optane is was of course very fast, but it's not ~~sold~~ made anymore ...

EDIT: Wait a minute, how is that 16G helpful for the OP?

LnxBil · Aug 13, 2024

esi_y said:
Optane is was of course very fast, but it's not ~~sold~~ made anymore ...

EDIT: Wait a minute, how is that 16G helpful for the OP?

SLOG will improve the (sync write) performance of a disk pool significantly and a small 30 Euro Optane is a nobrainer.

esi_y · Aug 14, 2024

LnxBil said:
SLOG will improve the (sync write) performance of a disk pool significantly and a small 30 Euro Optane is a nobrainer.

Ok, I will just break it down - and see where we disagree:

1. I cannot find anything in stock now, but one item that was last selling off in that price range at 16G capacity was:
https://ark.intel.com/content/www/u...es-16gb-m-2-80mm-pcie-3-0-20nm-3d-xpoint.html

2. PCIe 3.0 x2, but the max seq r/w 900 / 145 MB/s, the IOPS looks nice, but nothing special nowadays.

3. The OP mentioned "1TB SSD for cache", so I assumed L2ARC, not really SLOG.

4. For L2ARC, 16G is tiny and even cheaper SSDs will provide much better job in terms of bandwidth AND capacity. It will also increase RAM usage, but that's another story.

Now for the last part, even I do not think OP was after this:

5. For SLOG, I do not think I would recommend to run it other than in a mirror (feel free to tell me I should not care - I understand power loss is not a problem, but a corrupt one is, as rubbish would get flushed undected over time). That means it would occupy 2 M.2s and cost 60. But the product has sequential write less than a modern 7K HDD, so it would be just for the IOPS. But is there really that many random writes?

LnxBil · Aug 23, 2024

esi_y said:
But is there really that many random writes?

That depends on your workload, yet I can say that working with a harddisk pool feels much, much faster with an SLOG and special metadata. L2ARC was not noticeable in our 100 TB pool and we decomissioned it in favor our slog/special device which wasn't available at the time we build this array.

esi_y said:
PCIe 3.0 x2, but the max seq r/w 900 / 145 MB/s, the IOPS looks nice, but nothing special nowadays.

Of course it's not special nowadays for enterprise U.2 drives, yet not for the price point. If you're running a harddisk pool, you may not have the bucks for fast enterprise NVMe. If you have ... go with it and use two of them for slog and special device (partitioned one). sizing the slog depends on the sequential write performance * 5 seconds (default flush time), more is never used unless you change settings.

esi_y said:
But the product has sequential write less than a modern 7K HDD, so it would be just for the IOPS. But is there really that many random writes?

It's the IOPS you're after ... and in ZFS, sequential is out of the picture in a fragmented disk pool pool.

UdoB · Aug 23, 2024

I know you have much more experience with ZFS than I do, but let me a bit picky (or wrong?):

LnxBil said:
sizing the slog depends on the sequential write performance * 5 seconds (default flush time), more is never used

In my understanding there might be up to three TXGs active - and "active" for me means they occupy the storage and/or(?) Ram for the data they are handling in that moment.

Cited from delphix.com/blog/zfs-fundamentals-transaction-groups :

"... There are three active transaction group states: open, quiescing, or syncing. At any given time, there may be an active txg associated with each state; each active txg may either be processing, or blocked waiting to enter the next state. There may be up to three active txgs, ..."

This looks like it would possibly need space for up to 3*5 seconds...

esi_y · Aug 23, 2024

LnxBil said:
It's the IOPS you're after ... and in ZFS, sequential is out of the picture in a fragmented disk pool pool.

Hang on, the random writes into the pool that the slog will be useful for - the slog itself acts as sequential file object ... so I do not think I am wrong looking at seq numbers on any ZIL candidate device.

esi_y · Aug 23, 2024

UdoB said:
I know you have much more experience with ZFS than I do, but let me a bit picky (or wrong?):

In my understanding there might be up to three TXGs active - and "active" for me means they occupy the storage and/or(?) Ram for the data they are handling in that moment.

I have not idea how the implementation looks TODAY, but I remember - and I found it again! [1] - that there's way more going on when it comes how it's all flushed (within the given timeframe).

[1] https://blogs.oracle.com/solaris/post/the-new-zfs-write-throttle

SSD Cache

New Member

Distinguished Member

Member

Active Member

Active Member

Distinguished Member

Renowned Member

Active Member

Distinguished Member

Active Member

Renowned Member

Active Member

Distinguished Member

Active Member

Distinguished Member

Active Member

Distinguished Member

Famous Member

Active Member

Active Member