2 Node Cluster with the the least amount of "clusterization" - how?

Iacov · Jan 25, 2024

hey

i tried to read up on clusters, the quorum, the votes, qdevices etc but i still don't know how and if i should create a 2 node cluster
my pve1 (amd based) is my "main server" (funny to call a mini pc that) that is tasked to host all the little vms i need in my home.
i plan on getting a second mini pc (pve2, intel based)) to be primarily a host for plex/jellyfin, but probably within a proxmox environment

ideally i want to achieve, that i can manage both pve1 and pve2 over a single gui and being able to(manually) move a vm from pve1 to pve2 and vice versa

i don't need a shared storage, HA or probably many other features that come with a cluster

will creating a 2 node cluster open up more issues than i plan to actually achieve?
can i create a 2 node cluster without backing up/deleting/restoring every vm?
do i have to keep the quorum in mind or could this be somehow negated for my tuned-down needs? can i simply use the synology quorum server of my synology nas or should i re-purpose a raspberry pi?
what happens if i ever needed to change one of the cluster-nodes? as long as the q-device and pve1 are online, pve2 could be dropped from the cluster and a new node could join? would i have then to repeat the backup/delete/restore-step for the remaining node?
is there anything else that i should take into consideration or that is often missed by noobs?

thank you very much for your time and experience

UdoB · Jan 25, 2024

Iacov said:
and being able to(manually) move a vm from pve1 to pve2 and vice versa

i don't need a shared storage,

To move a VM from one node to the other you need to have some common storage. The exception is ZFS, which is local. ZFS has the capability to get "replicated", which simulates shared storage.

At my job and at my homelab I use this strategy.

Iacov said:
pve1 (amd based) is my "main server" ...
i plan on getting a second mini pc (pve2, intel based))

"Live migration" will probably not work well. But "offline" (read: turned off VM) will work fine.

Iacov said:
can i create a 2 node cluster without backing up/deleting/restoring every vm?

Yes. When you add the new node it should be "fresh" and empty = there should be no VM on this one.

Iacov said:
do i have to keep the quorum in mind or could this be somehow negated for my tuned-down needs?

Yes. When the cluster does not quorate it is not usable: for example you can not start any VM. (But currently running VMs will stay active.)
There are workarounds like "pvecm expected 1" which you should use carefully.

You may run a very small debian-VM on the Synology so create that famous Quorum-Device. A Raspberry Pi is also fine!

Iacov said:
what happens if i ever needed to change one of the cluster-nodes? as long as the q-device and pve1 are online, pve2 could be dropped from the cluster and a new node could join?

Adding and removing nodes is nothing special and well documented.

Iacov said:
would i have then to repeat the backup/delete/restore-step for the remaining node?

Why? The cluster configuration is not lost during adding/removing nodes.

Good luck. And have fun!

Iacov · Jan 25, 2024

UdoB said:
To move a VM from one node to the other you need to have some common storage. The exception is ZFS, which is local. ZFS has the capability to get "replicated", which simulates shared storage.

"Live migration" will probably not work well. But "offline" (read: turned off VM) will work fine.

thank you, this helps a lot

if i understand correctly:
pve1 does not have to be cleared out to create the cluster, but pve2 has to be a clean slate to join, correct?

do i need ZFS or a shared storage if i don't want to live-migrate?
do i need a ZFS storage if don't want a failover?

esi_y · Jan 26, 2024

Iacov said:
ideally i want to achieve, that i can manage both pve1 and pve2 over a single gui and being able to(manually) move a vm from pve1 to pve2 and vice versa

i don't need a shared storage, HA or probably many other features that come with a cluster

So maybe it's time to say it ouright, using something else than PVE is a good option for these circumstances. Because you do not really need a cluster and PVE forces you to have one in case you want to do the above described within a GUI.

Iacov said:
will creating a 2 node cluster open up more issues than i plan to actually achieve?

Depends, but if it's a two node only (and you are not using any of the cluster-only features like HA), at least set one of them (your "main") as having 2 votes.

Iacov said:
can i create a 2 node cluster without backing up/deleting/restoring every vm?

You are supposedly asking about 2 existing nodes, whether they can be put into a cluster. In a way, yes, one node populated with VMs is fine. From the other node, you can do qm remote-migrate (see [1]) to move over everything to your "main", then let the empty node join, then shuffle it back.

Iacov said:
do i have to keep the quorum in mind or could this be somehow negated for my tuned-down needs? can i simply use the synology quorum server of my synology nas or should i re-purpose a raspberry pi?

I would use a QDevice with the two nodes if you have extra hardware that it can run on.

Iacov said:
what happens if i ever needed to change one of the cluster-nodes? as long as the q-device and pve1 are online, pve2 could be dropped from the cluster and a new node could join?

You can even just manually override it with pvecm expected 1 and do what you need to do without a quorum, see [2].

Iacov said:
would i have then to repeat the backup/delete/restore-step for the remaining node?

I am not sure what this one was about, but generally having backups (on that Synology?) is something you should always have anyhow.

Iacov said:
is there anything else that i should take into consideration or that is often missed by noobs?

My opinion only - avoid using clusters where you absolutely do not need them. If you want single control pane, look for some other solution, maybe even something with PVE, like 3rd party GUI taking advantage of the API.

[1] https://pve.proxmox.com/pve-docs/qm.1.html
[2] https://pve.proxmox.com/wiki/Cluster_Manager

UdoB · Jan 26, 2024

Iacov said:
pve1 does not have to be cleared out to create the cluster, but pve2 has to be a clean slate to join, correct?

Yes.

Iacov said:
do i need ZFS or a shared storage if i don't want to live-migrate?
do i need a ZFS storage if don't want a failover?

No, you do not need to use ZFS at all.

Did you already find this table? https://pve.proxmox.com/pve-docs/chapter-pvesm.html#_storage_types

For me personally ZFS is the storage. It has so many advantages that I can live with the existing disadvantages. It gives you garanteed data integrity, bitrot protection and self healing. It can handle snapshots, transparent compression and the capability of replication through "zfs send/receive". (There is more, e.g. deduplication, which I do not use.)

Best regards

esi_y · Jan 26, 2024

UdoB said:
For me personally ZFS is the storage. It has so many advantages that I can live with the existing disadvantages. It gives you garanteed data integrity, bitrot protection and self healing. It can handle snapshots, transparent compression and the capability of replication through "zfs send/receive". (There is more, e.g. deduplication, which I do not use.)

The disadvantage or rather the thing to pay attention to when setting up a cluster is shredding NVMEs particularly with ZFS write amplification - you will find more on this by searching the terms in existing forum posts. It's not that you should not use it, just that see if you do not need to tweak some params.

Also, you can consider e.g.:
https://github.com/isasmendiagus/pmxcfs-ram

esi_y · Jan 26, 2024

Iacov said:
do i need ZFS or a shared storage if i don't want to live-migrate?
do i need a ZFS storage if don't want a failover?

You would mostly appreciate ZFS if you do NOT have or do not want to use shared storage - in an actual small (2-3 nodes) cluster, it would be pretty silly setup to have e.g. shared storage for VMs to be that one Synology, as it defeats the purpose. In those cases, it is very handy to have ZFS on the nodes as you can set up replication very efficiently.

UdoB · Jan 26, 2024

tempacc346235 said:
Also, you can consider e.g.:
https://github.com/isasmendiagus/pmxcfs-ram

Oh, that is interesting, thanks for the link.

But I really try hard to tweak the PVE software stack as less as possible, so I just use hardware that fits my need. This means of course to pay the price for enterprise SSD/NVMe also for the homelab.

(And just to state the obvious: always use redundant devices: mirrors for VMs and possibly RaidZ1/Z2/Z3 for "normal" data. And also: snapshots and redundant disks do NOT count as backups...)

Just my 2€¢, ymmv

esi_y · Jan 26, 2024

UdoB said:
Oh, that is interesting, thanks for the link.

It's something which would be best part of internal tweaks for pmxcfs, one could argue there's a reason it's not there, but I'd rather say for large clusters having that in RAM is even less of a risk due to power loss not being an issue due to UPS.

UdoB said:
But I really try hard to tweak the PVE software stack as less as possible, so I just use hardware that fits my need. This means of course to pay the price for enterprise SSD/NVMe also for the homelab.

It's also becoming less of an issue, for a homelab, I have seen 2TB ones with 2PB TBW for consumer grade. It's still less than e.g. 1TB enterprise would have, but for a homelab it's becoming acceptable, mostly missing out the PLP features only.

There were however forum posts where people discovered it was not the pmxcfs itself shredding the SSD, but something as benign as excessive logging AND write amplification.

UdoB said:
(And just to state the obvious: always use redundant devices: mirrors for VMs and possibly RaidZ1/Z2/Z3 for "normal" data. And also: snapshots and redundant disks do NOT count as backups...)

Just my 2€¢, ymmv

I actually would argue from the opposite end. AT THE LEAST have backups, then especially with a cluster, lots can be weathered without the rest. I mean, arguably this is not setup with eg ECC RAM, redundant power supplies, etc.

The OP didn't even seem to want that cluster per se, just migrating between two separate systems. For lots of scenarios, that really is the safer way to set it up.

Iacov · Jan 27, 2024

tempacc346235 said:
The disadvantage or rather the thing to pay attention to when setting up a cluster is shredding NVMEs particularly with ZFS write amplification - you will find more on this by searching the terms in existing forum posts. It's not that you should not use it, just that see if you do not need to tweak some params.

Also, you can consider e.g.:
https://github.com/isasmendiagus/pmxcfs-ram

in your experience, am i safe with nas-graded nvme/sata ssds?

Iacov · Jan 27, 2024

tempacc346235 said:
The OP didn't even seem to want that cluster per se, just migrating between two separate systems. For lots of scenarios, that really is the safer way to set it up.

what is the safer way? to go with ZFS or to not go with ZFS? i have no first hand experience with ZFS yet but the more i read about it the more it seems to be the better choice than lvm/lvm-thin, right? and if i understood correctly, moving a vm from node1 to node2 with lvm-thin requires a manual transfer of configs, while moving between zfs-pools is pretty "drag and drop" or have i misunderstood the difference?

esi_y · Jan 28, 2024

Iacov said:
in your experience, am i safe with nas-graded nvme/sata ssds?

If you wonder about sudden deaths, I would say that could happen with anything, anytime and I am not going to endorse enterprise SSD just so it has high TBW and built-in power loss protection. In my perspective for homelab it makes perfect sense to buy cheapest anything at any given time (e.g. on sale) with common sense in mind (e.g. do NOT get DRAM-less SSD). The reason I say that is because maybe you anyways have UPS, so there goes the need for power loss protection. Except when your power supply which is not redundant dies, etc.

If you wonder about how your SSD is doing, set up what you intend to use and check how it fares with a tool like [1]. Check after a day, a week, and compare. If you see it's shredding a lot more now that you have cluster, you can literally calculate how many more months or years the SSD has to go. If you see it's shredding on ZFS much more than it was with LVM, you know you have some tweaking to do.

You are never "safe", not even with UPS, redundant PSUs, ECC RAM, DC-grade SSDs in a RAID/ZFS mirror, etc, etc. The chances are just lower and lots of that makes mostly sense from availability point of view. The solution in my eyes is to e.g. have that cluster (with some automation if you really need some sort of HA), but most importantly have backups, so you just simply do a disaster recovery after a disaster, which WILL happen sooner or later.

Everything fails all of the time. I'd rather have e.g. 2 NUC-like machines with a new gaming 1G+ TBW in a cluster with a QDevice than a single 10+ y.o. R720 packed with all the enterprise gear.*

*Except for some special applications when it's actually cost-effective.

[1] https://packages.debian.org/bookworm/nvme-cli

esi_y · Jan 28, 2024

Iacov said:
what is the safer way? to go with ZFS or to not go with ZFS? i have no first hand experience with ZFS yet but the more i read about it the more it seems to be the better choice than lvm/lvm-thin, right? and if i understood correctly, moving a vm from node1 to node2 with lvm-thin requires a manual transfer of configs, while moving between zfs-pools is pretty "drag and drop" or have i misunderstood the difference?

I mostly was getting at the fact you would be better off WITHOUT the two machines in a cluster, having them standalone.

ZFS is something nice to use for lots of specific reasons (e.g. able to do zfs send|receive) with a snapshot (something PVE does not even support natively for backups, but it works nice with replication, which you would not care about if it's not a cluster anyhow). I do not see any benefit having the hypervisor itself on ZFS (you will have fun troubleshooting when it does not boot). It's nice for the VMs (zvols) and especially CTs (that's just ordinary dataset - you might need to research more on ZFS if interested).

If the only thing you care about is migrating, you really should give a try to the remote-migrate from the post above [1]. The idea that one has to have things in a cluster just so it can migrate is wrong and any limitations of PVE are purely the issue of not having a need for that feature-set (so far).

If you need something drag'n'drop for those scenarios and that is the reason why you are into clustering, I think you really should look for other open-source solutions than PVE. I do not want to sound like bashing it, but it's not really great for you with 2 nodes just so you can d'n'd migrate because sooner or later you will be back to the forum asking about corosync issues which will not be a GUI fix. There are solutions out there which do not require you to have a cluster just so you can migrate between two nodes and have nice GUI.

[1] https://forum.proxmox.com/threads/2...unt-of-clusterization-how.140434/#post-628163

Dunuin · Jan 28, 2024

tempacc346235 said:
The reason I say that is because maybe you anyways have UPS, so there goes the need for power loss protection. Except when your power supply which is not redundant dies, etc.

The power-loss protection is there to be able to cache and optimize sync writes so those won't shred your NAND. An UPS won't help with DRAM caching, as a consumer SSD missing a PLP still can't cache sync writes so the write amplification will be way higher. The SSDs firmware doesn'T know if there is a UPS or not so it won't try to cache sync writes in volatile DRAM. Only with a Enterprise/Datacenter SSD with PLP the SSD is able to optimize sync writes before writing them to the NAND (as few big writes will wear less than many small writes, so its great to be able to collect and cache them in DRAM and only write them to NAND if there is enough to not waste a erase cycle). Other beneift is that sync write performance won't drop by multiple magnitudes, so IOPS performance won'T drop down to HDD levels.

By the way...if you don't plan to move the guests between servers all the time its perfectly fine to do a backup+restore to move them. Then all you would be missing are managing both nodes via one webUI (but I don't see the point...not that bad to have both webUIs open in the browser and switch between tabs...) and doing every edit of IP sets, security groups and aliases twice as you have to manually keep them in sync so guests will work after moving them between nodes. Only the last thing is what really annoys me (running 5 unclustered nodes right now).

esi_y · Jan 28, 2024

Dunuin said:
The power-loss protection is there to be able to cache and optimize sync writes so those won't shred your NAND.

I do not know where your information comes from (not saying it is wrong, just maybe quote a better-than-my-marketing-fuss), but I took it literally like what it sounded like - protection against power loss = data loss/integrity feature.

See e.g. https://www.kingston.com/en/blog/servers-and-data-centers/ssd-power-loss-protection

Dunuin said:
An UPS won't help with DRAM caching, as a consumer SSD missing a PLP still can't cache sync writes so the write amplification will be way higher.

See above. There might be indeed some features in DC SSDs I do not know about that nicely reorder the writes before flushing it, but not aware that was PLP. When I look at e.g. consumer Micron SSDs (i.e. Crucial) they literally brag about "power loss immunity" as they lack the built-in capacitors a DC SSD would have, but still somehow ensure that data is not lost on unexpected shutdowns. Not sure how they do it there.

Dunuin said:
The SSDs firmware doesn'T know if there is a UPS or not so it won't try to cache sync writes in volatile DRAM.

I thought the whole point of DRAM was it's caching it there for e.g. reordering. That's on hardware level. Then there's things like ZFS transaction groups where a sudden / uncontrolled power loss definitely can result in data loss, especially when ZIL was used, but that's beyond the PLP topic.

Dunuin said:
Only with a Enterprise/Datacenter SSD with PLP the SSD is able to optimize sync writes before writing them to the NAND (as few big writes will wear less than many small writes, so its great to be able to collect and cache them in DRAM and only write them to NAND if there is enough to not waste a erase cycle).

This would need source for me personally to believe.

Dunuin said:
Other beneift is that sync write performance won't drop by multiple magnitudes, so IOPS performance won'T drop down to HDD levels.

I had a few consumer NVMe SSDs (all PCIe 4) that happen to behave quite differently when it comes to this. The Samsung PROs are e.g. doing better than others.

Dunuin said:
By the way...if you don't plan to move the guests between servers all the time its perfectly fine to do a backup+restore to move them. Then all you would be missing are managing both nodes via one webUI (but I don't see the point...not that bad to have both webUIs open in the browser and switch between tabs...) and doing every edit of IP sets, security groups and aliases twice as you have to manually keep them in sync so guests will work after moving them between nodes. Only the last thing is what really annoys me (running 5 unclustered nodes right now).

This is the thing with PVE, somehow because they want to make all cluster nodes be essentially a monitor of the whole cluster and relay everything to the other nodes if need be, they omitted the industry habit of e.g. being able to deploy this away from the cluster which I would frankly prefer. The API would allow for this, but this would need some product development which I think is not going to happen. You might better off with something like OpenNebula with 5 nodes scenario like-that, but I can see this remark will not end well for me now.

I really wish PVE added standalone monitor SPA as an option.

esi_y · Jan 28, 2024

Just to give a specific example (not my favourite, just for comparison what I was getting at):

Taking TBW as important, you can get either:

Kingston DC600M SATA PLP which happens to have TBW of 1.752 PB at 1TB capacity
https://www.kingston.com/en/ssd/dc600m-data-center-solid-state-drive?capacity=960gb

or

Kingston Fury NVMe 4 consumer SSD which happens to have TBW of 2PB at 2TB capacity
https://www.kingston.com/en/ssd/gaming/kingston-fury-renegade-nvme-m2-ssd?partnum=sfyrd/2000g

They both happen to cost about the same (when on discount), both have MTBF 2 mln hrs. Of course the gaming one does not have e.g. PLP, TCG OPAL, etc., but unlike the DC SSD with SATA which peaks at 100k IOPS, you can really get 1mln IOPS from it.

An average user in a homelab is better off with the gaming one. For the expected workload, so to say. I still returned that one because it was running at over 70C even with a passive cooler on.

It terms of shredding, it's better with the gaming one (at least paper value, no issue RMA it one after another, right?) at double the capacity.

Dunuin · Jan 28, 2024

tempacc346235 said:
See above. There might be indeed some features in DC SSDs I do not know about that nicely reorder the writes before flushing it, but not aware that was PLP. When I look at e.g. consumer Micron SSDs (i.e. Crucial) they literally brag about "power loss immunity" as they lack the built-in capacitors a DC SSD would have, but still somehow ensure that data is not lost on unexpected shutdowns. Not sure how they do it there.

Just do yourself two benchmarks of a consumer and an enterprise SSD with the same type of NAND, same interface, same manufacturer and same size. One benchmark is doing 4K random async writes, and another is doing 4K random sync writes. Every consumer SSD will be terribly slow doing these sync writes while being fast doing async writes. The enterprise SSD will be fast on both and 100 times faster than the consumer SSD when doing sync writes. That's all about the PLP as no consumer SSD can cache sync writes in volatile DRAM while the DRAM cache of the enterprise SSD isn't volatile as it is backed by the PLP so the DRAM write cache could be quickly dumped to non-volatile SLC-cache while running on backup power.
And you probably know that sync writes will shred NAND while async writes are not that bad. With PLP those heavy sync writes will be handled internally more like async writes by the enterprise SSD and therefore causing less wear.
An SSD without PLP is like running a HW raid card without BBU+cache. So if you say enterprise SSDs aren't worth the money you would probably also argue that a HW raid without BBU+cache is a better choice for a homelab server?

tempacc346235 said:
I thought the whole point of DRAM was it's caching it there for e.g. reordering. That's on hardware level.

But the point is that sync writes have to be safe 100% of the time. You are not allowed to cache them in volatile memory. So only thing you can do with a consumer SSD is to skip the DRAM-cache and write them directly to SLC-cache and reordering the block there. And the SLC-cache is slow and will wear with every write. Would be way faster and less wear if you got a PLP so you can cache and reorder those in DRAM-cache that won't wear with each write and then do a single optimized write from DRAM to the NAND.

tempacc346235 said:
They both happen to cost about the same (when on discount), both have MTBF 2 mln hrs. Of course the gaming one does not have e.g. PLP, TCG OPAL, etc., but unlike the DC SSD with SATA which peaks at 100k IOPS, you can really get 1mln IOPS from it.

Don't trust any manufacturers datasheets. They don't do sync writes. Run your own fio benchmarks. The official Proxmox ZFS benchmark paper for example did sync writes and its not hard to identify the SSDs without PLP by just looking at the IOPS performance:
https://www.proxmox.com/en/downloads/item/proxmox-ve-zfs-benchmark-2020

Its not that a sync write benchmark of Enterprise SSDs is faster because the NAND would be faster. NAND performance is the same between comparable consumer/enterpeise models. Its just that you are benchmarking the SSDs internal DRAM with enterprise SSD and the NAND with consumer SSDs, as Consumer SSD simply can't use it.

Consumer SSDs are fine as long as you only do bursts of async sequential writes. Then they are even faster than Enterprise SSDs. As soon as you got sync or continuous writes they really suck. So great for workloads like gaming but not good for running something like a DB.

tempacc346235 said:
This would need source for me personally to believe.

See for example here why DRAM is needed to optimize writes and lower write amplification and why PLP is needed to protect contents of the DRAM cache: https://image-us.samsung.com/Samsun...rce_WHP-SSD-POWERLOSSPROTECTION-r1-JUL16J.pdf
Journaling of consumer SSDs will only protect the Flash Translation Layer from corrupting. The cached data will still be lost, so the sync write has to skip the RAM and write directly to the NAND and only return an ACK once the data got written to the NAND. The Enterprise SSD with PLP is so fast when doing sync writes because it can actually use the DRAM and send an ACK once the data hits the DRAM but while not being written to NAND yet.

esi_y · Jan 29, 2024

He Thank you for the extensive answer, @Dunuin , first of all.

Dunuin said:
Just do yourself two benchmarks of a consumer and an enterprise SSD with the same type of NAND, same interface, same manufacturer and same size.

This is virtually impossible to do which you must be aware of. I had difficulty comparing the Kingstons above as with the consumer NVMe I would need to take something U.3 from them (I think there's none) or I think they had NVMe meant as a server boot drive but Gen3 to begin with, but that's about it. Even then, the controller would not be the same, the amount of DRAM, etc, etc.

Dunuin said:
One benchmark is doing 4K random async writes, and another is doing 4K random sync writes. Every consumer SSD will be terribly slow doing these sync writes while being fast doing async writes. The enterprise SSD will be fast on both and 100 times faster than the consumer SSD when doing sync writes. That's all about the PLP as no consumer SSD can cache sync writes in volatile DRAM while the DRAM cache of the enterprise SSD isn't volatile as it is backed by the PLP so the DRAM write cache could be quickly dumped to non-volatile SLC-cache while running on backup power.

Despite I do not have exactly a good way to test this, I will take this as something I might agree* on as it is reasonable explanation within that hypothetical environment as it sounds logical in terms of how sync writes could work on an SSD with capacitors - that is, until I fill up the DRAM, for instance.

I do not want to argue about synthetic tests (doing majority 4k sync writes is such as is filling up DRAM) however because the entire premise of this thread and arguably this entire forum is that we are talking homelabs. There's not much an average homelab would resemble a DC server rack in terms of use case. So everything I posted so far was very much taking that into consideration, which I believe I made clear.

Dunuin said:
And you probably know that sync writes will shred NAND while async writes are not that bad. With PLP those heavy sync writes will be handled internally more like async writes by the enterprise SSD and therefore causing less wear.

My use of the term "shred", admittedly inaccurate, was in the sense of "consuming the drive", it was all related to biting off the TBW. In this sense, sync or async does not matter. As I do not want to detract from the original topic, I will stay away from e.g. ZFS or any other specific filesystem and the params and how the sync/async would impact the drive's lifespan. It will, I am aware of that, but it's not that a non-PLP drive will be "shredded" in a homelab much more than PLP one.

Dunuin said:
An SSD without PLP is like running a HW raid card without BBU+cache. So if you say enterprise SSDs aren't worth the money you would probably also argue that a HW raid without BBU+cache is a better choice for a homelab server?

For a homelab server in 2024, the RAID-anything debate has been stale since long, let alone HW card. I know it's not your point here, but the example is wrong because even if we lived in 2000s and someone was debating HW RAID, they would be pointed to just mdadm it and use HBA in a homelab anyways.

The reason I mention this is because maybe in a few years' time the whole debate of "enterprise vs consumer" SSDs will be equally silly.

Dunuin said:
But the point is that sync writes have to be safe 100% of the time. You are not allowed to cache them in volatile memory. So only thing you can do with a consumer SSD is to skip the DRAM-cache and write them directly to SLC-cache and reordering the block there. And the SLC-cache is slow and will wear with every write. Would be way faster and less wear if you got a PLP so you can cache and reorder those in DRAM-cache that won't wear with each write and then do a single optimized write from DRAM to the NAND.

I am allowed to do anything really. It's a homelab. And I have a backup. Say we are talking iSCSI LUN, whatever non-PLP thing is there, the syncs will be async. Talking ZFS, that's terrible for sync anything really, but suppose nothing prevents one to have a ZIL on a separate SLOG (most of us do not need it), someone might have been using Optane since a while (which is now dead for a reason) for that (and still stay with a consumer SSD and go crazy on his sync writes too). Had this discussion happened some time ago, someone would have chipped in they really encourage us to go buy used ZeusRAMs for our homelab.

I think this sort-of-elitist debates have to stop. This is a homelab discussion, arguably it's a essentially a homelab-majority forum.

Dunuin said:
Don't trust any manufacturers datasheets. They don't do sync writes. Run your own fio benchmarks.

My homelab is not doing fio-style barrage of sync writes. Yours? I just want to be clear on that I agree with what you say on a technical level, but it's not relevant in the real homelab world. There's not that many writes going on to begin with in homelab. Or even a corporate webserver.

Dunuin said:
The official Proxmox ZFS benchmark paper for example did sync writes and its not hard to identify the SSDs without PLP by just looking at the IOPS performance:
https://www.proxmox.com/en/downloads/item/proxmox-ve-zfs-benchmark-2020
View attachment 62186

This is now stale document, no one uses those consumer drives anymore, the EVOs are not PROs, etc. I don't want to sound like i disregard it because it's Proxmox-branded, but it clearly had an agenda at the time. It is very specific to ZFS and wants to make a point that is valid in a real datacentre, not in a homelab, not with 2024 (or even Gen3+) drives.

Dunuin said:
Its not that a sync write benchmark of Enterprise SSDs is faster because the NAND would be faster. NAND performance is the same between comparable consumer/enterpeise models. Its just that you are benchmarking the SSDs internal DRAM with enterprise SSD and the NAND with consumer SSDs, as Consumer SSD simply can't use it.

This is skewed interpretation, of course the consumer SSDs can and use DRAM, unless you force them not to. Which is fine if you want to prove a point of technical difference, but not relevant for a homelab use case.

Dunuin said:
Consumer SSDs are fine as long as you only do bursts of async sequential writes. Then they are even faster than Enterprise SSDs.

It seems to me that we agree that for homelab use they might be actually even "faster" after all.

Dunuin said:
As soon as you got sync or continuous writes they really suck. So great for workloads like gaming but not good for running something like a DB.

Which DBs with actual heavy write workload is the most common in homelab use?

Dunuin said:
See for example here why DRAM is needed to optimize writes and lower write amplification and why PLP is needed to protect contents of the DRAM cache: https://image-us.samsung.com/Samsun...rce_WHP-SSD-POWERLOSSPROTECTION-r1-JUL16J.pdf

*Alright, I am at a loss with this source, even considering your own reasoning above which I had just agreed with sounded reasonable. Because
a "technical marketing specialist" at Samsung in 2016 writes first of all that:

"Usually SSDs send an acknowledgement to the OS that the data has been written once it has been committed to DRAM. In other words, the OS considers that the data is now safe, even though it hasn’t been written to NAND yet."

So he actually wants to push sales of his PLP SSDs as they are better for data integrity - nothing about performance.

He talks about the FTL, but again, only in the context of that should you buy a non-PLP drive a "power loss may corrupt all data". I do not know if this was true in maybe early 2010s, but I have never had this happen and I have some with quite high unsafe shutdown counters.

He then goes on to say that consumer (client) SSDs are actually fine because they flush the FTL regularly and have a journal. He makes a terrible case for his cause because he says the power loss and small amount of data loss does not actually matter that much because it's nothing new (compared to HDDs) and there's filesystem journalling for that in the consumer space, as if there was no journalling in a DC.

Dunuin said:
Journaling of consumer SSDs will only protect the Flash Translation Layer from corrupting. The cached data will still be lost, so the sync write has to skip the RAM and write directly to the NAND and only return an ACK once the data got written to the NAND. The Enterprise SSD with PLP is so fast when doing sync writes because it can actually use the DRAM and send an ACK once the data hits the DRAM but while not being written to NAND yet.

The guy was all over the place in his "reasoning", so I would be no wiser from his contribution. First of all, the FTL in my understanding really only is needed to be up to date when it comes to what has been written to the actual NAND, so this would make a case for you saying it is a stress test of the SSD's capabilities and well, making it slow. I do not call that "shredding" and I do not see how that bites off TBW somehow faster than with a PLP drive. And again flushing the FTL often minimises the amount of data not written to NAND which he then admits would be in a filesystem's journal.

But notably - nowhere did he ever write that that the only PLP SSDs can satisfy a sync write instantly purely from writing into DRAM. So this is not the source I hoped for to confirm that.

Some food for thought would also be - what if I have some ancient PLP drive with rather small DRAM and I am actually doing crazy amount of intermittent sustained writes. And then I have an option to use modern consumer SSD with larger DRAM - and I do not need those to be sync at all. But again, not in a homelab, usually.

EDIT: What we got into discussing seems to have happened already in multiple places before:
https://www.reddit.com/r/storage/comments/iop7ef/performance_gain_of_plp_powerlossprotection_drives/

I really wonder what would all those tests looked like without ZFS on various "gaming" drives.

esi_y · Jan 29, 2024

Re ZFS I'll drop this reference to here for the OP or anyone interested as well:
https://forum.proxmox.com/threads/zfs-sync-disabled-safeness.59215/

No comments from me necessary.

alexskysilk · Jan 29, 2024

tempacc346235 said:
, but I'd rather say for large clusters having that in RAM is even less of a risk due to power loss not being an issue due to UPS.

For larger clusters (or, ANY cluster) this is probably a very bad idea. cluster information may be updated from other nodes. since your in-ram database is updated UNIDIRECTIONALLY (meaning, written down but not read up) you end up with conflicting dbs. You'll have out of sync nodes that are unaware of their condition.

2 Node Cluster with the the least amount of "clusterization" - how?

New Member

Famous Member

New Member

Active Member

Famous Member

Active Member

Active Member

Famous Member

Active Member

New Member

New Member

Active Member

Active Member

Distinguished Member

Active Member

Active Member

Distinguished Member

Active Member

Active Member

Distinguished Member