Well like often the answer is: It depends
Or to be more precise what workload are you planning to run, which downtimes/riscs you can live with and what tradeoffs you are willing to make.
First: ZFS and Ceph both want to mange the discs directly, so if you happen to have a HW-RAID controller you need to reconfigure it to "HBA-mode/IT-Mode" or something like that.
Second: Local ZFS will propably be faster than Ceph but even with Storage replication (
https://pve.proxmox.com/wiki/Storage_Replication ) you will have a minimal dataloss due to it's asyncronous nature. Basically with Storage replication the data on the VM is transferred to any other configured node. By default the schedule is every 15 minutes, this can be reduced to one minute. So you need to ask yourself, whether this is a loss you can live with or not.
Third: Ceph is syncron but as explained by others with three nodes you won't have auto-healing and you can only loose one node. So the Cluster will continue to run if one node fails or is down due to maintenance (e.G. system updates + reboot) but not both on the same time. Depending on your needs this might still be enough even if you don't want to invest in a fourth or fifth node. Udos writeup is great but more aimed at homelabbers with very basic hardware. If you implement Ceph according to the recommendations (
https://pve.proxmox.com/wiki/Deploy...r#_recommendations_for_a_healthy_ceph_cluster ) you should have good performance and no issues with the given constraints of a three-node-cluster (aka not allowing failure/downtime of two nodes at the same time).
Fourth: ZFS will propably perform worse than ext4 or XFS but it also has way more features (see:
https://forum.proxmox.com/threads/f...y-a-few-disks-should-i-use-zfs-at-all.160037/ ). Nontheless if you really need it you could benchmark and compare your workloads to ext4 or xfs. This might especially interesting for mariadb/mysql/postgres databases since XFS has a reputation for having better performance for large databases. However since with this you won't have the replication of ZFS or Ceph you would need to implement it on the application level, e.g. the internal clustering/replication mechanisms of SQL databases. Please note, that Veeam at the moment doesn't support replication/cluster for mysql/mariadb and postgres (not sure about MS SQL).
Five: For both cases (ceph+zfs) the Proxmox Website has success storys which might be worth a read:
https://www.proxmox.com/en/about/about-us/stories and of course this forum. Several of the professionals here reported on smb customers who are happy with very small (two-node + qdevice clusters for zfs replication, three nodes for ceph) setups since they could live with the limitations but couldn't afford more hardware. Here for example is a recent thread on MS SQL Server and Ceph (but with a five-node cluster with four storage nodes):
https://forum.proxmox.com/threads/anyone-using-ceph-mssql-server.104073/#post-823988
And here a quote from the German forum from
@Falk R. ( he is a freelancing consultant who supports businesses in their migration to ProxmoxVE), translated with deepl:
"Wenn du heutzutage neue Server kaufst, würde ich für Ceph immer 100G einplanen. Ich baue seit Jahren keine neuen Cluster mit unter 100G für Ceph auf.
Auch wenn viele meinen, man braucht mehr als 3 Nodes, dann ist man sofort bei 5 und das sprengt oft den Rahmen kleinerer Firmen. 3 Nodes reichen locker, wenn man damit leben kann, dass während der 5 Minuten Reboot eines Nodes, kein anderer sterben darf."
"If you're buying new servers these days, I would always plan for 100G for Ceph. I haven't built any new clusters with less than 100G for Ceph in years.
Even though many people think you need more than 3 nodes, you immediately end up with 5, which often exceeds the budget of smaller companies. 3 nodes are easily sufficient if you can live with the fact that no other node is allowed to die during the 5-minute reboot of one node.
Translated with DeepL.com (free version)"
Hallo Zusammen,
ich bin gerade dabei mich mit Ceph zu beschäftigen und möchte für mein LAB ein 3 Knoten Cluster bauen. Mit Proxmox habe ich bereits einige Erfahrungen gesammelt. Ceph fehlt noch. Ich möchte das so Nahe wie möglich an einem Produktiv Cluster bauen um auch Performance Test etc. machen zu können. Ich habe mich in die Empfehlungen bereits etwas eingelesen allerdings sind noch ein paar Fragen offen. Wenn ihr hier Tipps habt Empfehlungen habt wäre ich sehr dankbar.
- Alle 3 Cluster haben identische Mainboards. Chipsatz ist ein W880 - Jeweils 64GB ECC Speicher sowie jeweils 2 x...
Falk explained several times his reasoning for 100G for new clusters: In his experience normally the network and slow (non-nvme) storage are the bottlenecks. So if you need to buy new hardware anyhow it makes sense to plan with some headspace for further growth and more performance. This means: U2-NVME-SSDs with power-loss-protection and a dedicated 100GB network for the Ceph storage and two further (slower) network links for corosync cluster communication and the regular guest traffic . This fit's the officithal recommendations for a healthy cluster who recommend to have at least three network links:
- one very high bandwidth (25+ Gbps) network for Ceph (internal) cluster traffic.
- one high bandwidth (10+ Gpbs) network for Ceph (public) traffic between the ceph server and ceph client storage traffic. Depending on your needs this can also be used to host the virtual guest traffic and the VM live-migration traffic.
- one medium bandwidth (1 Gbps) exclusive for the latency sensitive corosync cluster communication.
https://pve.proxmox.com/wiki/Deploy...r#_recommendations_for_a_healthy_ceph_cluster
Please note, that Falk replaced the 25+ Gbps network with 100G. Depending on your workload needs you might also want to switch out the 10 with a faster network. For corsosync 1Gbps is usually enough though.
I would recommend to get a ProxmoxVE partner on board to plan the architecture for your migration according to your business needs: -
https://proxmox.com/en/partners/find-partner/explore
It might also be worth to book a training so you don't need to figure everything out by reading the documentation.