Ceph vs ZFS - Which is "best"?

devawpz · Dec 22, 2021

TLDR: Ceph vs ZFS: advantages and disadvantages?

Looking for thoughts on implementing a shared filesystem in a cluster with 3 nodes.

They all serve a mix of websites for clients that should be served with minimal downtime, and some infrastructure nodes such as Ansible and pfSense, as well as some content management systems.

The intention is to have Hight Availability between the nodes, as when one is unreachable, it (seamlessly?) moves the VMs to a failover. For that, it's required to have a shared filesystem, so considering between Ceph and ZFS at the moment.

There's already disks and corresponding volumes on each of the nodes for the OS, general VMs and backups. Each of the nodes also has 2 x 200GB SAS SSDs in Raid0. These are intended to have the VMs which should be in the High Availability configuration.

So I was wondering what would be the advantages of choosing one over the other, are they even possible to compare side by side? I have read a few threads and some documentation, have had a few experiences too, and some opinions coming in. Tradeoffs? Anything you may point out that can be taken as a starting point? Would appreciate any input you may have.

aaron · Dec 22, 2021

If a node dies and the VMs that were on it are configured as HA, it will take 2 or 3 minutes until they are started on one of the remaining nodes, no matter if you use ZFS or Ceph.

Ceph has quite some requirements if you want decent performance. Fast network (only for ceph ideally) with low latency, needs more CPU and memory ressources on the nodes for its services but is a fully clustered storage. That means that all nodes see the same all the time.

ZFS is a local storage so each node has its own. You can use the VM replication to keep a recent version of the VM disks on the other nodes so that a live migration is faster and in case of a node failure, the other nodes have a recent copy of the disk. But there could be potential data loss since they only have the disk since the last replication run.

ZFS is also useful if you have a bit higher latency between the nodes as for Ceph it should definitely be in the sub milli second area.

I hope this helps you a bit further on deciding what works best in your situation.

Check the requirements section for Ceph in the Admin guide. The requirement page of Ceph itself is also linked there.

Henrique Freitas · Aug 24, 2022

aaron said:
If a node dies and the VMs that were on it are configured as HA, it will take 2 or 3 minutes until they are started on one of the remaining nodes, no matter if you use ZFS or Ceph.

Ceph has quite some requirements if you want decent performance. Fast network (only for ceph ideally) with low latency, needs more CPU and memory ressources on the nodes for its services but is a fully clustered storage. That means that all nodes see the same all the time.

ZFS is a local storage so each node has its own. You can use the VM replication to keep a recent version of the VM disks on the other nodes so that a live migration is faster and in case of a node failure, the other nodes have a recent copy of the disk. But there could be potential data loss since they only have the disk since the last replication run.

ZFS is also useful if you have a bit higher latency between the nodes as for Ceph it should definitely be in the sub milli second area.

I hope this helps you a bit further on deciding what works best in your situation.

Check the requirements section for Ceph in the Admin guide. The requirement page of Ceph itself is also linked there.

Hi, I don't have a good experience with CEPH.
The VM's that run use independent database (PSQL) in the VM itself.
I have a 10gb network and SSD's, but it didn't give a good performance, on the contrary, customers complained about slowness using CEPH.
What could cause this slowness in reading?

Translated by Google Translate

Dunuin · Aug 24, 2022

What SSDs do you got? I hope some enterprise/datacenter grade SSDs.

Henrique Freitas · Aug 25, 2022

common Kingston 500MB/S, 450MB/S...

I think I made a mistake =/

Neobin · Aug 25, 2022

Henrique Freitas said:
common Kingston 500MB/S, 450MB/S...

I think I made a mistake =/

Yes. With Ceph [1] as well as ZFS/BTRFS [2], you absolutely want enterprise SSDs; preferably NVMes.

[1] https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2020-09-hyper-converged-with-nvme.76516
[2] https://forum.proxmox.com/threads/proxmox-ve-zfs-benchmark-with-nvme.80744

Dunuin · Aug 25, 2022

Proxmox VE Ceph Benchmark 2020 paper - page 16:

Can I use consumer or pro-sumer SSDs, as these are much cheaper than enterprise-class SSDs?
No. Never. These SSDs wont provide the required performance, reliability or endurance. See the fio results from before and/or run your own fio tests

Henrique Freitas · Aug 25, 2022

I'll replace it with NVMes soon, and I'll be back to talk about the results.

Thank you all

Dunuin · Aug 25, 2022

Enterprise NVMes for mixed/write intense workloads

Neobin · Aug 26, 2022

Henrique Freitas said:
I'll replace it with NVMes soon, and I'll be back to talk about the results.

Thank you all

Only to be sure: The important point here is enterprise SSDs (highly preferable NVMes) with PLP (Power Loss Protection)!

Upgrading from consumer SATA SSDs to consumer NVMe SSDs will not be satisfying in this regard.

kenneth104 · Aug 26, 2022

Henrique Freitas said:
Hi, I don't have a good experience with CEPH.
The VM's that run use independent database (PSQL) in the VM itself.
I have a 10gb network and SSD's, but it didn't give a good performance, on the contrary, customers complained about slowness using CEPH.
What could cause this slowness in reading?

Translated by Google Translate

if you hope get alot of performance, ceph need many node and many osd(hdd or ssd)
but ,ceph have some "magic technology" can help you pull up performance

like cache tier, can input a cache pool between client and backend
like persistent writeback cache, it's move remote writes to near-end

aaron · Aug 26, 2022

When you think about using cache tiering, be aware of the following: the warnings in the docs, that we do not support it officially and that it is generally not widely used. Therefore, people don't know it that well (should there be problems or bugs).

The recommended way is to run separate pools for different device classes. With that you can have a fast SSD or NVME pool and a slow HDD pool and place VM disks on them as you need it.

kenneth104 · Aug 26, 2022

Henrique Freitas said:
Hi, I don't have a good experience with CEPH.
The VM's that run use independent database (PSQL) in the VM itself.
I have a 10gb network and SSD's, but it didn't give a good performance, on the contrary, customers complained about slowness using CEPH.
What could cause this slowness in reading?

Translated by Google Translate

Yes,like #12 Proxmox Staff say, performance gains are hard to quantify
And database load need a lot of performance

So,If you only have a small cluster, don't use Ceph
Personally recommend to use PostgreSQL distributed, and use local disk, zfs RAIDz1 it's good choice

you can look like this
https://docs.oracle.com/cd/E36784_01/html/E36845/chapterzfs-db3.html

Dunuin · Aug 26, 2022

ZFS raidz1 won't be a good choice for running databases as you need to increase the volblocksize to atleast 16K (atleast as long as using ashift=12) if you don't want to waste alot of capacity because of padding overhead. So a 3 disk raidz1 should be terrible for postgres with its 8K writes because volblocksize has to be atleast 16K. And MySQL with its 16K writes will be terrible when using a 4 or more disk raidz1 where the volblocksize has to be 32K or higher.
For databases you should use a striped mirror.

Henrique Freitas · Nov 1, 2022

As promised, I'm coming back to thank you all, I set up an NVMe (enterprise) pool and the performance was excellent.

Now I have another doubt, the Recovery/Rebalance is not at 100%, what could have happened?

aaron · Nov 2, 2022

Henrique Freitas said:
Now I have another doubt, the Recovery/Rebalance is not at 100%, what could have happened?

Please create a new thread, as this has nothing to do with the original question in this one

khuffmanjr · Jul 23, 2023

Found this thread while researching... Nobody questioned the OP's use of RAID 0? One drive gone and the node is useless. Both drives need to be rebuilt.

Ceph vs ZFS - Which is "best"?

devawpz

Member

aaron

Proxmox Staff Member

Henrique Freitas

Active Member

Dunuin

Distinguished Member

Henrique Freitas

Active Member

Neobin

Distinguished Member

Dunuin

Distinguished Member

Henrique Freitas

Active Member

Dunuin

Distinguished Member

Neobin

Distinguished Member

kenneth104

New Member

aaron

Proxmox Staff Member

kenneth104

New Member

Dunuin

Distinguished Member

Henrique Freitas

Active Member

aaron

Proxmox Staff Member

khuffmanjr

New Member

We value your privacy