Proxmox Cluster Setup - What is the "best"?

oops404

New Member
May 27, 2023
24
1
1
Hello all,

I built actual a Proxmox Cluster in my private home with 3 MiniPCs which every node have two 256GB SSDs (ZFS Raid1) and a Single 4TB M.2 NVM SSD for data. I have a "old" 2bay Diskstation NAS which I let run with 2 HDDs Raid0 to get 5TB space. This I use over NFS for backup the container and the data if it is possible so I plan to use it as backup.

I want to use the local ZFS Raid1 for installing proxmox itself and the containers and then mount the storage for data. So I plan some containers and a Nextcloud container with all our documents and pictures over the years. So the data is important for me.

I planned to use Ceph with my 4TB SSDs of all 3 nodes with a replication factor of 3 but I have checked this Calculator: ceph-calculator and the notes (It's surprisingly easy to get into trouble. Mainly because the default safety mechanisms (nearfull and full ratios) assume that you are running a cluster with at least 7 nodes. For smaller clusters the defaults are too risky.) and the values which I get there seems that I need maybe a replication factor of 2 but I`m not so happy to have only 2 copies but if the cluster is self healing and running out of storage like described I´m maybe lost if backup not working ;-)

How would you setup your proxmox cluster?

Is it maybe a better way to make every 4TB SSD as Single ZFS Pool and then replicate between each other with some command? Is this possible that a container can then failover and use the local 4TB storage on which node it runs?

So I want 4TB or more useable hard disk. Better more lets say 6TB would be wonderful but 4TB is okay ;-)

Any idea? Some other protocol or keep with Ceph but what about replication factor? Only 2? What about GlusterFS? :D :D :D

Best,
Dominic
 
Last edited:
I'm running a lot of small 3 nodes cluster since years without any problem (replicat 3).

for monitors, you can loose 1 node.

for osd, you can loose 2 disk at the same time.

for disk, the risk is: how many disk can you loose at the same time, before the replication have finish to repair it.

So if you have 18TB hdd from 1gb network, and it take 1 week to repair, it's more risky than fast 2TB nvme with 50gb network where it take 5min to repair.
 
I'm running a lot of small 3 nodes cluster since years without any problem (replicat 3).

for monitors, you can loose 1 node.

for osd, you can loose 2 disk at the same time.

for disk, the risk is: how many disk can you loose at the same time, before the replication have finish to repair it.

So if you have 18TB hdd from 1gb network, and it take 1 week to repair, it's more risky than fast 2TB nvme with 50gb network where it take 5min to repair.
Thanks for response.

I would say this is not totally correct if i understand it right but im new in Ceph and storage topics.

As example i habe 3x 4 TB disks and i do replicas 3 then i can use 4 TB because on every disk is a copy. Correct? So in theory two hosts can failure and i still have no data loss. Right?

Well so easy is it not ;-) Lets say we have this example and now one 4 TB dies. What happens? No data loss - still 2 copies but then the self repair is starting or? So i have 4TB data which should replicate with factor 3 but my cluster size is no longer 12TB in total because of the failure it is 8TB. So now the system try to get 3 replicas so it can do this only if i dont use more space as 2.67TB. If i have 3.5TB the disk space will run out space and we get real issues.

So if you have really fast network you are earlier in trouble.

You need more nodes to fix this but i dont want to pay and so i think ablut replica 2.

Correct?
 
As example i habe 3x 4 TB disks and i do replicas 3 then i can use 4 TB because on every disk is a copy. Correct? So in theory two hosts can failure and i still have no data loss. Right?
Yes, but youll have downtime if you loose 2 of 3 replicas as ceph blocks storage io. And if something bad happens during recovery with the only remaining data disk, you will have data loss.

Well so easy is it not ;-) Lets say we have this example and now one 4 TB dies. What happens? No data loss - still 2 copies but then the self repair is starting or?

In your case: no! Only if you have disks left to recover and they only recover regarding your crush rule, means usually on a host basis. As you only have one disk per host, ceph wont recover anything.

So i have 4TB data which should replicate with factor 3 but my cluster size is no longer 12TB in total because of the failure it is 8TB. So now the system try to get 3 replicas so it can do this only if i dont use more space as 2.67TB. If i have 3.5TB the disk space will run out space and we get real issues.
No thats not how default proxmox ceph works, default crush rules replicates across hosts. so if you have 3 hosts with each host having a 4TB disk, you can only loose one host or in your case also loose only one disk as disk failure also means host failure in your case.

Ceph does not create a third copy again on the remaining hosts (makes no sense, why should it, you already have a copy on the remaining hosts/disks).

So if you have really fast network you are earlier in trouble. You need more nodes to fix this but i dont want to pay and so i think ablut replica 2.

Replica two is not recommended and as you cant loose any copy anymore, only if you set size 1 which is also NOT recommended and very dangerous, because your last replica would not be protected for any issues that might happen to data.

No.


I would recommend using zfs async replication with proxmox ha, you need 3 servers for that, as you need a quorum vote for proxmox quorum. You can use direct cabling for the replication, so no need for switches. HA is working automatically but you have data loss on node failure, depending on when the last sync happened.
 
Yes, but youll have downtime if you loose 2 of 3 replicas as ceph blocks storage io. And if something bad happens during recovery with the only remaining data disk, you will have data loss.



In your case: no! Only if you have disks left to recover and they only recover regarding your crush rule, means usually on a host basis. As you only have one disk per host, ceph wont recover anything.


No thats not how default proxmox ceph works, default crush rules replicates across hosts. so if you have 3 hosts with each host having a 4TB disk, you can only loose one host or in your case also loose only one disk as disk failure also means host failure in your case.

Ceph does not create a third copy again on the remaining hosts (makes no sense, why should it, you already have a copy on the remaining hosts/disks).



Replica two is not recommended and as you cant loose any copy anymore, only if you set size 1 which is also NOT recommended and very dangerous, because your last replica would not be protected for any issues that might happen to data.


No.


I would recommend using zfs async replication with proxmox ha, you need 3 servers for that, as you need a quorum vote for proxmox quorum. You can use direct cabling for the replication, so no need for switches. HA is working automatically but you have data loss on node failure, depending on when the last sync happened.
Thank you!! May can i ask why not Ceph if it works like you described? ZFS Sync sounds like a not well working solution but maybe im wrong?
 
Thank you!! May can i ask why not Ceph if it works like you described? ZFS Sync sounds like a not well working solution but maybe im wrong?

Can you specify or repeat the question, I did not understand the first part. ZFS Sync does have 5-15 MIN data loss, dependin on how much data you write per minute, what sync interval you have set and when the last zfs sync job run.
 
Can you specify or repeat the question, I did not understand the first part. ZFS Sync does have 5-15 MIN data loss, dependin on how much data you write per minute, what sync interval you have set and when the last zfs sync job run.
If Ceph is working like you described i can set replication factor to 3 and all is fine.

What is better about the solution with ZFS Sync? Faster access to storage or?

I find not so much about how exactly i should configure ZFS Sync. Do you have maybe a guide or example?
 
Or you mean the function from Proxmox itself?

https://pve.proxmox.com/wiki/PVE-zsync

Or Storage Replication?

Can i use this if the ZFS Pool is not created frm Proxmox? I have created it to mount it on different container. This not worked with ZFS Pool which proxmox created.
 
Last edited:
Yeah SIZE 3 and MIN-SIZE 2 and you are fine. This means each host gets one copy, no matter how much disks you have per host. But if you only have one disk per host, you can only loose one disk in the complete 3-NODE-Cluster-Setup. Why? because loosing 2 disks in a setup with 3 nodes each having one disk, means loosing 2 SERVERS effectivly. You could still recover from that last copy you have, but you will have downtime.

At least in enterprise this would be a no go, thats why u usually start with 4 osds per node, so you have enough space / disks to create a copy again, that is missing because of a disk failure on that specific hosts.
 
Yeah SIZE 3 and MIN-SIZE 2 and you are fine. This means each host gets one copy, no matter how much disks you have per host. But if you only have one disk per host, you can only loose one disk in the complete 3-NODE-Cluster-Setup. Why? because loosing 2 disks in a setup with 3 nodes each having one disk, means loosing 2 SERVERS effectivly. You could still recover from that last copy you have, but you will have downtime.

At least in enterprise this would be a no go, thats why u usually start with 4 osds per node, so you have enough space / disks to create a copy again, that is missing because of a disk failure on that specific hosts.
Thank you. I´m building for me private at home but the data is important to yeah. I have three of this old Mini Computers as my Proxmox cluster:

Fujitsu Q556 i5-6400T, Esprimo Mini-PC


There is one M.2 slot which I have used for the 4TB SSD in each of the three Nodes and plugged 16GB RAM in each.

So you recommend to buy another 4TB hard drive and add it to one of the three nodes - correct? It feels not good that one node have two disks and the other not.

It is limited about my connection methods of new hard disk. I have USB3.0 and SATA. What about if I mix it up? So adding 3x 1TB SSD to each node or something like this?

I´ m actual reading about Ceph erasure code but it seems to have minimum 4 disks is a good idea for this too.

What means downtime? So I have there private pictures and movies from us on the disk so it is not critical if the Ceph is self healing for some time and I can't access it. I only want that it is not lost.
 
OK if downtime is not important then you might go with 3/2 and only one disk per host.
Im not sure if ceph is something to hassle arround in a home setup, but why not. New things to learn might be cool :)
 
Last edited:
Yeah SIZE 3 and MIN-SIZE 2 and you are fine. This means each host gets one copy, no matter how much disks you have per host. But if you only have one disk per host, you can only loose one disk in the complete 3-NODE-Cluster-Setup. Why? because loosing 2 disks in a setup with 3 nodes each having one disk, means loosing 2 SERVERS effectivly. You could still recover from that last copy you have, but you will have downtime.

I believe min-size 2 applies for writes, so in 3-node cluster with 1 OSD each and 3/2 factor, if 2 OSD down you should still be able to read everything, but not able to write... So not a complete downtime, depending on the use case.

But that only applies to the disks themselves going down. You will have a complete downtime when 2 ceph monitors out of 3 are down.
 
Mmmmh. Maybe i try local ZFS Single Pool which i give the same name on each node?

Then try if data replication is working and the failover of the container.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!