Ceph vs ZFS - Which is "best"?

devawpz

Member
Sep 21, 2020
30
0
6
TLDR: Ceph vs ZFS: advantages and disadvantages?

Looking for thoughts on implementing a shared filesystem in a cluster with 3 nodes.

They all serve a mix of websites for clients that should be served with minimal downtime, and some infrastructure nodes such as Ansible and pfSense, as well as some content management systems.

The intention is to have Hight Availability between the nodes, as when one is unreachable, it (seamlessly?) moves the VMs to a failover. For that, it's required to have a shared filesystem, so considering between Ceph and ZFS at the moment.

There's already disks and corresponding volumes on each of the nodes for the OS, general VMs and backups. Each of the nodes also has 2 x 200GB SAS SSDs in Raid0. These are intended to have the VMs which should be in the High Availability configuration.

So I was wondering what would be the advantages of choosing one over the other, are they even possible to compare side by side? I have read a few threads and some documentation, have had a few experiences too, and some opinions coming in. Tradeoffs? Anything you may point out that can be taken as a starting point? Would appreciate any input you may have.
 
If a node dies and the VMs that were on it are configured as HA, it will take 2 or 3 minutes until they are started on one of the remaining nodes, no matter if you use ZFS or Ceph.

Ceph has quite some requirements if you want decent performance. Fast network (only for ceph ideally) with low latency, needs more CPU and memory ressources on the nodes for its services but is a fully clustered storage. That means that all nodes see the same all the time.

ZFS is a local storage so each node has its own. You can use the VM replication to keep a recent version of the VM disks on the other nodes so that a live migration is faster and in case of a node failure, the other nodes have a recent copy of the disk. But there could be potential data loss since they only have the disk since the last replication run.

ZFS is also useful if you have a bit higher latency between the nodes as for Ceph it should definitely be in the sub milli second area.

I hope this helps you a bit further on deciding what works best in your situation.

Check the requirements section for Ceph in the Admin guide. The requirement page of Ceph itself is also linked there.
 
If a node dies and the VMs that were on it are configured as HA, it will take 2 or 3 minutes until they are started on one of the remaining nodes, no matter if you use ZFS or Ceph.

Ceph has quite some requirements if you want decent performance. Fast network (only for ceph ideally) with low latency, needs more CPU and memory ressources on the nodes for its services but is a fully clustered storage. That means that all nodes see the same all the time.

ZFS is a local storage so each node has its own. You can use the VM replication to keep a recent version of the VM disks on the other nodes so that a live migration is faster and in case of a node failure, the other nodes have a recent copy of the disk. But there could be potential data loss since they only have the disk since the last replication run.

ZFS is also useful if you have a bit higher latency between the nodes as for Ceph it should definitely be in the sub milli second area.

I hope this helps you a bit further on deciding what works best in your situation.

Check the requirements section for Ceph in the Admin guide. The requirement page of Ceph itself is also linked there.
Hi, I don't have a good experience with CEPH.
The VM's that run use independent database (PSQL) in the VM itself.
I have a 10gb network and SSD's, but it didn't give a good performance, on the contrary, customers complained about slowness using CEPH.
What could cause this slowness in reading?

Translated by Google Translate
 
Proxmox VE Ceph Benchmark 2020 paper - page 16:
Can I use consumer or pro-sumer SSDs, as these are much cheaper than enterprise-class SSDs?
No. Never. These SSDs wont provide the required performance, reliability or endurance. See the fio results from before and/or run your own fio tests
 
  • Like
Reactions: Henrique Freitas
Hi, I don't have a good experience with CEPH.
The VM's that run use independent database (PSQL) in the VM itself.
I have a 10gb network and SSD's, but it didn't give a good performance, on the contrary, customers complained about slowness using CEPH.
What could cause this slowness in reading?

Translated by Google Translate

if you hope get alot of performance, ceph need many node and many osd(hdd or ssd)
but ,ceph have some "magic technology" can help you pull up performance

like cache tier, can input a cache pool between client and backend
like persistent writeback cache, it's move remote writes to near-end
 
  • Like
Reactions: Henrique Freitas
When you think about using cache tiering, be aware of the following: the warnings in the docs, that we do not support it officially and that it is generally not widely used. Therefore, people don't know it that well (should there be problems or bugs).

The recommended way is to run separate pools for different device classes. With that you can have a fast SSD or NVME pool and a slow HDD pool and place VM disks on them as you need it.
 
Hi, I don't have a good experience with CEPH.
The VM's that run use independent database (PSQL) in the VM itself.
I have a 10gb network and SSD's, but it didn't give a good performance, on the contrary, customers complained about slowness using CEPH.
What could cause this slowness in reading?

Translated by Google Translate
Yes,like #12 Proxmox Staff say, performance gains are hard to quantify
And database load need a lot of performance

So,If you only have a small cluster, don't use Ceph
Personally recommend to use PostgreSQL distributed, and use local disk, zfs RAIDz1 it's good choice

you can look like this
https://docs.oracle.com/cd/E36784_01/html/E36845/chapterzfs-db3.html
 
  • Like
Reactions: Henrique Freitas
ZFS raidz1 won't be a good choice for running databases as you need to increase the volblocksize to atleast 16K (atleast as long as using ashift=12) if you don't want to waste alot of capacity because of padding overhead. So a 3 disk raidz1 should be terrible for postgres with its 8K writes because volblocksize has to be atleast 16K. And MySQL with its 16K writes will be terrible when using a 4 or more disk raidz1 where the volblocksize has to be 32K or higher.
For databases you should use a striped mirror.
 
As promised, I'm coming back to thank you all, I set up an NVMe (enterprise) pool and the performance was excellent.

Now I have another doubt, the Recovery/Rebalance is not at 100%, what could have happened?

Captura de tela 2022-11-01 100507.png
 
Found this thread while researching... Nobody questioned the OP's use of RAID 0? One drive gone and the node is useless. Both drives need to be rebuilt.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!