Proxmox Cluster Setup - What is the "best"?

oops404 · Jun 12, 2023

Hello all,

I built actual a Proxmox Cluster in my private home with 3 MiniPCs which every node have two 256GB SSDs (ZFS Raid1) and a Single 4TB M.2 NVM SSD for data. I have a "old" 2bay Diskstation NAS which I let run with 2 HDDs Raid0 to get 5TB space. This I use over NFS for backup the container and the data if it is possible so I plan to use it as backup.

I want to use the local ZFS Raid1 for installing proxmox itself and the containers and then mount the storage for data. So I plan some containers and a Nextcloud container with all our documents and pictures over the years. So the data is important for me.

I planned to use Ceph with my 4TB SSDs of all 3 nodes with a replication factor of 3 but I have checked this Calculator: ceph-calculator and the notes (It's surprisingly easy to get into trouble. Mainly because the default safety mechanisms (nearfull and full ratios) assume that you are running a cluster with at least 7 nodes. For smaller clusters the defaults are too risky.) and the values which I get there seems that I need maybe a replication factor of 2 but I`m not so happy to have only 2 copies but if the cluster is self healing and running out of storage like described I´m maybe lost if backup not working ;-)

How would you setup your proxmox cluster?

Is it maybe a better way to make every 4TB SSD as Single ZFS Pool and then replicate between each other with some command? Is this possible that a container can then failover and use the local 4TB storage on which node it runs?

So I want 4TB or more useable hard disk. Better more lets say 6TB would be wonderful but 4TB is okay ;-)

Any idea? Some other protocol or keep with Ceph but what about replication factor? Only 2? What about GlusterFS?

Best,
Dominic

spirit · Jun 13, 2023

I'm running a lot of small 3 nodes cluster since years without any problem (replicat 3).

for monitors, you can loose 1 node.

for osd, you can loose 2 disk at the same time.

for disk, the risk is: how many disk can you loose at the same time, before the replication have finish to repair it.

So if you have 18TB hdd from 1gb network, and it take 1 week to repair, it's more risky than fast 2TB nvme with 50gb network where it take 5min to repair.

oops404 · Jun 13, 2023

spirit said:
I'm running a lot of small 3 nodes cluster since years without any problem (replicat 3).

for monitors, you can loose 1 node.

for osd, you can loose 2 disk at the same time.

for disk, the risk is: how many disk can you loose at the same time, before the replication have finish to repair it.

So if you have 18TB hdd from 1gb network, and it take 1 week to repair, it's more risky than fast 2TB nvme with 50gb network where it take 5min to repair.

Thanks for response.

I would say this is not totally correct if i understand it right but im new in Ceph and storage topics.

As example i habe 3x 4 TB disks and i do replicas 3 then i can use 4 TB because on every disk is a copy. Correct? So in theory two hosts can failure and i still have no data loss. Right?

Well so easy is it not ;-) Lets say we have this example and now one 4 TB dies. What happens? No data loss - still 2 copies but then the self repair is starting or? So i have 4TB data which should replicate with factor 3 but my cluster size is no longer 12TB in total because of the failure it is 8TB. So now the system try to get 3 replicas so it can do this only if i dont use more space as 2.67TB. If i have 3.5TB the disk space will run out space and we get real issues.

So if you have really fast network you are earlier in trouble.

You need more nodes to fix this but i dont want to pay and so i think ablut replica 2.

Correct?

jsterr · Jun 13, 2023

oops404 said:
As example i habe 3x 4 TB disks and i do replicas 3 then i can use 4 TB because on every disk is a copy. Correct? So in theory two hosts can failure and i still have no data loss. Right?

Yes, but youll have downtime if you loose 2 of 3 replicas as ceph blocks storage io. And if something bad happens during recovery with the only remaining data disk, you will have data loss.

oops404 said:
Well so easy is it not ;-) Lets say we have this example and now one 4 TB dies. What happens? No data loss - still 2 copies but then the self repair is starting or?

In your case: no! Only if you have disks left to recover and they only recover regarding your crush rule, means usually on a host basis. As you only have one disk per host, ceph wont recover anything.

oops404 said:
So i have 4TB data which should replicate with factor 3 but my cluster size is no longer 12TB in total because of the failure it is 8TB. So now the system try to get 3 replicas so it can do this only if i dont use more space as 2.67TB. If i have 3.5TB the disk space will run out space and we get real issues.

No thats not how default proxmox ceph works, default crush rules replicates across hosts. so if you have 3 hosts with each host having a 4TB disk, you can only loose one host or in your case also loose only one disk as disk failure also means host failure in your case.

Ceph does not create a third copy again on the remaining hosts (makes no sense, why should it, you already have a copy on the remaining hosts/disks).

oops404 said:
So if you have really fast network you are earlier in trouble. You need more nodes to fix this but i dont want to pay and so i think ablut replica 2.

Replica two is not recommended and as you cant loose any copy anymore, only if you set size 1 which is also NOT recommended and very dangerous, because your last replica would not be protected for any issues that might happen to data.

oops404 said:
Correct?

No.

I would recommend using zfs async replication with proxmox ha, you need 3 servers for that, as you need a quorum vote for proxmox quorum. You can use direct cabling for the replication, so no need for switches. HA is working automatically but you have data loss on node failure, depending on when the last sync happened.

oops404 · Jun 13, 2023

jsterr said:
Yes, but youll have downtime if you loose 2 of 3 replicas as ceph blocks storage io. And if something bad happens during recovery with the only remaining data disk, you will have data loss.

In your case: no! Only if you have disks left to recover and they only recover regarding your crush rule, means usually on a host basis. As you only have one disk per host, ceph wont recover anything.

No thats not how default proxmox ceph works, default crush rules replicates across hosts. so if you have 3 hosts with each host having a 4TB disk, you can only loose one host or in your case also loose only one disk as disk failure also means host failure in your case.

Ceph does not create a third copy again on the remaining hosts (makes no sense, why should it, you already have a copy on the remaining hosts/disks).

Replica two is not recommended and as you cant loose any copy anymore, only if you set size 1 which is also NOT recommended and very dangerous, because your last replica would not be protected for any issues that might happen to data.

No.

I would recommend using zfs async replication with proxmox ha, you need 3 servers for that, as you need a quorum vote for proxmox quorum. You can use direct cabling for the replication, so no need for switches. HA is working automatically but you have data loss on node failure, depending on when the last sync happened.

Thank you!! May can i ask why not Ceph if it works like you described? ZFS Sync sounds like a not well working solution but maybe im wrong?

jsterr · Jun 13, 2023

oops404 said:
Thank you!! May can i ask why not Ceph if it works like you described? ZFS Sync sounds like a not well working solution but maybe im wrong?

Can you specify or repeat the question, I did not understand the first part. ZFS Sync does have 5-15 MIN data loss, dependin on how much data you write per minute, what sync interval you have set and when the last zfs sync job run.

oops404 · Jun 13, 2023

jsterr said:
Can you specify or repeat the question, I did not understand the first part. ZFS Sync does have 5-15 MIN data loss, dependin on how much data you write per minute, what sync interval you have set and when the last zfs sync job run.

If Ceph is working like you described i can set replication factor to 3 and all is fine.

What is better about the solution with ZFS Sync? Faster access to storage or?

I find not so much about how exactly i should configure ZFS Sync. Do you have maybe a guide or example?

oops404 · Jun 13, 2023

Or you mean the function from Proxmox itself?

https://pve.proxmox.com/wiki/PVE-zsync

Or Storage Replication?

Can i use this if the ZFS Pool is not created frm Proxmox? I have created it to mount it on different container. This not worked with ZFS Pool which proxmox created.

jsterr · Jun 13, 2023

Yeah SIZE 3 and MIN-SIZE 2 and you are fine. This means each host gets one copy, no matter how much disks you have per host. But if you only have one disk per host, you can only loose one disk in the complete 3-NODE-Cluster-Setup. Why? because loosing 2 disks in a setup with 3 nodes each having one disk, means loosing 2 SERVERS effectivly. You could still recover from that last copy you have, but you will have downtime.

At least in enterprise this would be a no go, thats why u usually start with 4 osds per node, so you have enough space / disks to create a copy again, that is missing because of a disk failure on that specific hosts.

oops404 · Jun 13, 2023

jsterr said:
Yeah SIZE 3 and MIN-SIZE 2 and you are fine. This means each host gets one copy, no matter how much disks you have per host. But if you only have one disk per host, you can only loose one disk in the complete 3-NODE-Cluster-Setup. Why? because loosing 2 disks in a setup with 3 nodes each having one disk, means loosing 2 SERVERS effectivly. You could still recover from that last copy you have, but you will have downtime.

At least in enterprise this would be a no go, thats why u usually start with 4 osds per node, so you have enough space / disks to create a copy again, that is missing because of a disk failure on that specific hosts.

Thank you. I´m building for me private at home but the data is important to yeah. I have three of this old Mini Computers as my Proxmox cluster:

Fujitsu Q556 i5-6400T, Esprimo Mini-PC

There is one M.2 slot which I have used for the 4TB SSD in each of the three Nodes and plugged 16GB RAM in each.

So you recommend to buy another 4TB hard drive and add it to one of the three nodes - correct? It feels not good that one node have two disks and the other not.

It is limited about my connection methods of new hard disk. I have USB3.0 and SATA. What about if I mix it up? So adding 3x 1TB SSD to each node or something like this?

I´ m actual reading about Ceph erasure code but it seems to have minimum 4 disks is a good idea for this too.

What means downtime? So I have there private pictures and movies from us on the disk so it is not critical if the Ceph is self healing for some time and I can't access it. I only want that it is not lost.

jsterr · Jun 13, 2023

OK if downtime is not important then you might go with 3/2 and only one disk per host.
Im not sure if ceph is something to hassle arround in a home setup, but why not. New things to learn might be cool

mfed · Jun 14, 2023

jsterr said:
Yeah SIZE 3 and MIN-SIZE 2 and you are fine. This means each host gets one copy, no matter how much disks you have per host. But if you only have one disk per host, you can only loose one disk in the complete 3-NODE-Cluster-Setup. Why? because loosing 2 disks in a setup with 3 nodes each having one disk, means loosing 2 SERVERS effectivly. You could still recover from that last copy you have, but you will have downtime.

I believe min-size 2 applies for writes, so in 3-node cluster with 1 OSD each and 3/2 factor, if 2 OSD down you should still be able to read everything, but not able to write... So not a complete downtime, depending on the use case.

But that only applies to the disks themselves going down. You will have a complete downtime when 2 ceph monitors out of 3 are down.

oops404 · Jun 14, 2023

Mmmmh. Maybe i try local ZFS Single Pool which i give the same name on each node?

Then try if data replication is working and the failover of the container.

Search

Search

Proxmox Cluster Setup - What is the "best"?

oops404

New Member

spirit

Distinguished Member

oops404

New Member

jsterr

Renowned Member

oops404

New Member

jsterr

Renowned Member

oops404

New Member

oops404

New Member

jsterr

Renowned Member

oops404

New Member

Fujitsu Q556 i5-6400T, Esprimo Mini-PC

jsterr

Renowned Member

mfed

Active Member

oops404

New Member

Proxmox Cluster Setup - What is the "best"?

New Member

Distinguished Member

New Member

Renowned Member

New Member

Renowned Member

New Member

New Member

Renowned Member

New Member

Fujitsu Q556 i5-6400T, Esprimo Mini-PC​

Renowned Member

Active Member

New Member

Fujitsu Q556 i5-6400T, Esprimo Mini-PC