Given my modest hardware, what is the best HA configuration?

danielo515 · May 8, 2021

Hello.
About a year ago I started using proxmox on a small and cheap intel nuc with 8 GB of ram. I was so pleased that I bought a more powerful and more capable nuc node to start a cluster. I quickly realised that there is little difference between 1 and 2 nodes: the limitations were roughly the same. So I decided to buy a third modest node, this time an used hp elitedesk. So my hardware is as follows:

Intel nuc with a single SSD (sata 3) of 128GB (very poor, I know) and 8GB of ram
Intel nuc with a NVE SSD unit of 240GB and a SSD (sata 3) of 1TB and 32GB of ram
HP elitedesk 800G1 with a single SSD disk of 500GB and 8GB of ram

Now I have a happy and HA cluster... but no. I already discarded the idea of using CIFS because it requires too powerful hardware, a lot of bandwidth and identical hardware on each node.
Then I realised there is a replication feature so I tried to use it, but all my nodes were installed using ext4 as filesystem for the root partition so no luck here either.
Because of this situation I started guessing which combination of filesystems, disks and stuff may allow me to have a more tolerant setup.

But before I start doing stupid experiments, I decided to ask if what I want to try makes sense for me.
This is my idea: Because my most powerful node also has an extra disk I will format that one using ZFS, so I can use it to target replication.
Because the other two nodes only have 1 disk I think about reinstall proxmox using ZFS for the root partition, one node at a time while moving the running VMs to the other more powerful node while doing it.
I am also considering buying another 1TB disk for my first modest node and install proxmox on it with ZFS.

Does my plan make any sense? Can you think of a better strategy to take advantage of my modest hardware?
Please note that this is just a small homelab cluster that I use for domotic and fun, so please don't look at it with enterprise level eyes

I also have a QNAP nas with plenty of storage and 16GB of ram. I use it to mount a CIFS share for backups. Can It help with this plan?

Dunuin · May 8, 2021

If you want to use ZFS with replication all 3 nodes need a pool with the same name and the replication will keep all 3 synced. In that case you can only use up tp 128GB because that is your smallest disk. Also keep in mind that ZFS will need alot of RAM (4GB RAM for the ARC should work for your setup) but then 2 of your 3 nodes will only got like 2-3 GB usable for VMs.

Another option would be to use NFS as a shared storage. In that case your NUC with 32GB RAM or your QNAP could store the VMs for all nodes but that isn'T really fast if you run all VMs over the network and because everything is stored at the same place that isn't good for HA. If the host working as NFS server fails all 3 nodes will fail.

danielo515 · May 8, 2021

Thanks for your answer. I can consider upgrading the disk of the EliteDesk to 1TB and put the 512GB one on the smallest nuc. However seems that ZFS can't work on si gle disk configurations? So I need a minimum of 2 disks on each node?
I already tried running vms over nfs, and it was not a pleasant experience because my Nas likes to be rebooted very often.

ph0x · May 8, 2021

With two disks you can benefit of the self healing features of ZFS, but it will work with single disks as well.
I can totally relate to your considerations, since I also started with a single node, became annoyed because of the down times and ended up with a three-node cluster running Ceph over 10 GbE.
It's hard to give advice on such hardware, since Ceph is out because of missing 10 GbE. You could consider running one of the nodes as a storage node, but that would still inflict down time upon a reboot. A stable kernel could reduce the need for reboots.
For true HA you'll definitely need bigger pipes in your network or a reliable SAN solution.

danielo515 · May 9, 2021

ph0x said:
With two disks you can benefit of the self healing features of ZFS, but it will work with single disks as well.
I can totally relate to your considerations, since I also started with a single node, became annoyed because of the down times and ended up with a three-node cluster running Ceph over 10 GbE.
It's hard to give advice on such hardware, since Ceph is out because of missing 10 GbE. You could consider running one of the nodes as a storage node, but that would still inflict down time upon a reboot. A stable kernel could reduce the need for reboots.
For true HA you'll definitely need bigger pipes in your network or a reliable SAN solution.

Even if I decide to buy a 10Gb switch I don't think any of my nodes has enough power to take advantage out of it
I saw that I can create a ZFS pool with just one disk on the sparse disk I have on my powerful NUC, that's a first step.
I will try to re-install the EliteDesk with ZFS on the root partition and see if I can schedule replication between those. Not sure however if it may be better to create a partition and use ext4 for OS and the other partition for ZFS pool (so the host doesn't need to deal with ZFS all the time).

Dunuin · May 9, 2021

danielo515 said:
Even if I decide to buy a 10Gb switch I don't think any of my nodes has enough power to take advantage out of it
I saw that I can create a ZFS pool with just one disk on the sparse disk I have on my powerful NUC, that's a first step.
I will try to re-install the EliteDesk with ZFS on the root partition and see if I can schedule replication between those. Not sure however if it may be better to create a partition and use ext4 for OS and the other partition for ZFS pool (so the host doesn't need to deal with ZFS all the time).

ZFS isn't that compute intense...your bigger problem is the RAM and that wouldn't be much better if you use ext4 for a 128/256GB your root partition. Its still using 4GB of RAM.
And again...if you want to use replication all your nodes need to use the same pool name. If you install proxmox using ZFS this will always be "rpool" so your spare disk needs to be named "rpool" too. And depending on your workload ZFS might kill your SSDs quite fast if that are cheap consumer SSDs. So you should monitor them with smartctl to see how high the wear out is.

danielo515 · May 9, 2021

Hello again.
I just "suffered" the pool name not being able to be customised on the proxmox installation. I had to remove the ZFS pool from the cluster, remove the ZFS pool on the host and then delete all disk partitions.
Now I have created a new pool named rpool, scheduled a replication and to my surprise it failed with the following log:

Bash:

2021-05-09 13:47:01 106-0: start replication job
2021-05-09 13:47:01 106-0: guest => CT 106, running => 0
2021-05-09 13:47:01 106-0: volumes => rpool:subvol-106-disk-0
2021-05-09 13:47:01 106-0: (remote_prepare_local_job) storage 'rpool' is not available on node 'proxmox-3'
2021-05-09 13:47:01 106-0: end replication job with error: command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=proxmox-3' root@192.168.0.10 -- pvesr prepare-local-job 106-0 rpool:subvol-106-disk-0 --last_sync 0' failed: exit code 255

Why is that? the pool exists on both servers and has the same name:

Bash:

root@proxmox-3:~# zpool status
  pool: rpool
 state: ONLINE
  scan: none requested
config:

    NAME                                         STATE     READ WRITE CKSUM
    rpool                                        ONLINE       0     0     0
      ata-ST500LM000-1EJ162-SSHD_W765EYRH-part3  ONLINE       0     0     0

errors: No known data errors
root@proxmox-3:~#



root@proxmox-2:~# zpool status
  pool: rpool
 state: ONLINE
  scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     0
      sdb       ONLINE       0     0     0

errors: No known data errors

Thanks for the advise about disks being killed by zfs. I'll keep an eye on that

ph0x · May 9, 2021

Did you define the storage entry on every node? It's not about the name of the pool but about the name of the storage, which has to be the same.

danielo515 · May 9, 2021

Ok, I found what it's required.
I have to go to the cluster/Datacenter and allow the ZFS pool on both nodes. Now it seems to work

danielo515 · May 9, 2021

ph0x said:
Did you define the storage entry on every node? It's not about the name of the pool but about the name of the storage, which has to be the same.

Not sure if that is the same as I said (enabling it on all nodes at Datacenter level).
Sho it is not required that the ZFS pool is named the same? I had the impression that it is needed.

ph0x · May 9, 2021

You could probably define storage entries that are restricted to certain nodes in order to have the same pool name but for different target zpools.
It's definitely easier if all the pools are named the same.
Like you said, enabling the storage for every is what it takes.

danielo515 · May 9, 2021

I just opened my newest node, the EliteDesk 800 G1 mini to find two nice surprises:

It has a M2 slot. It is limited to Sata III but that will allow me to have two disks, which is very nice. I ordered a 500GB M2 module for a total of 1TB. Of course I don't plan to put both disks on raid, because the HDD is a spinning one, but having some extra space on the node is always a nice thing, even if it is only for backups,images, etc
It has 1 single module of 8gb of ram. This allows me to upgrade it to 16GB very cheaply. So I also ordered an extra 8GB memory module

With this new enhancements now I have, at least, two nodes that can have replication between them which is one step closer to the HA I was dreaming about

iGadget · May 18, 2021

You seem to be in somewhat the same boat as me, @danielo515. I too started out with a single simple Proxmox server (an old Dell Optiplex 7010 in my case) which I soon upgraded to 16GB ram and a consumer SSD. Worked pretty nice, until I got my hands on a second Optiplex 7010 and decided I wanted to have the cluster features to do migration and HA.
From that point on, my investments pretty much skyrocketed since the consumer SSD's were indeed eaten alive by the ZFS + cluster + HA combination. The wearout level on the primary node's SSD (a Samsung 840 EVO) went from 2 to 24 in a matter of weeks...
So I bought second hand Samsung 863a's for all my nodes and a couple of old 10GBase-T nics while I was at it.
Added a third node, my old i5-2500 desktop which was supposed to become a backup node + storage server. Unfortunately, the thing failed me on nearly every step I took with it, forcing me to eventually replace that with a second hand 'storage machine' (sporting 6 HTSG 3TB NAS disks, a fourth-gen i5 and 16GB RAM) which had been running FreeNAS for years by the previous owner.

I had to re-install the cluster nodes several times, re-configured (and expanded) my entire network and I pretty much ran into every pitfall there is along the way, but now I'm slowly getting there:

2 primary nodes run my main VM's / LXC's, the most important ones HA
Primary storage is now 'server grade' with the sm863a's (though redundancy is achieved only by replication jobs between the nodes)
The 10GBE NICs are AWESOME for fast replication (even though I have not seen the traffic get above about 400MB/sec, so there is still room for improvement)
The onboard NICs have been re-purposed for dedicated Corosync traffic, with the 10GBE as fallback
Node 3, the 'storage machine' is my next target, it will need to replace my old QNAP TS-639Pro which, after over 11(!) years of faithful service, has crossed the reliability (and perfomance) threshold for me. All I need to find out now is how to configure the 6x 3TB most efficiently for use as shared storage... Perhaps a TrueNAS VM with storage pass-through?

Long story short - you're not alone in your quest. The fast NUC with 32GB ram sounds like a pretty nice machine (none of my nodes have that much ram!) and the EliteDesk 800 G1 might provide all the expansion options you need. As for SSD's - look at the TBW spec carefully before giving them a demanding role in your cluster. Anything below a TBW of several PB's will probably not cut it.
If you want to explore the 10GBE route - I know you can get second hand dual-10GBE server nics fairly cheap in many places. Especially if you go for the SFP+ models. If your nodes are in close proximity of each other, you can get away with just buying cheap DAC cables to connect your nodes directly to each other, no 10GBE switch needed.
The only thing I'm unsure of is if this will work for the NUC(s), as these probably don't have PCIe slots.
For those, 5GBE USB3 dongles exist, but I have no experience with those unfortunately.
I do have an old NUC myself (Celeron model with 8GB max), still thinking of adding that to the cluster as node 4. In which case I'll be running into a lot of the same challenges, probably.
Keep us posted!

ph0x · May 18, 2021

Isn't it a bit frightening that we all get there, eventually?

iGadget · May 18, 2021

ph0x said:
Isn't it a bit frightening that we all get there, eventually?

It's a conspiracy by the hardware manufacturers, I tell you! They're using Proxmox to get us to spend all our savings on hardware!

danielo515 · May 30, 2021

iGadget said:
You seem to be in somewhat the same boat as me, @danielo515. I too started out with a single simple Proxmox server (an old Dell Optiplex 7010 in my case) which I soon upgraded to 16GB ram and a consumer SSD. Worked pretty nice, until I got my hands on a second Optiplex 7010 and decided I wanted to have the cluster features to do migration and HA.
From that point on, my investments pretty much skyrocketed since the consumer SSD's were indeed eaten alive by the ZFS + cluster + HA combination. The wearout level on the primary node's SSD (a Samsung 840 EVO) went from 2 to 24 in a matter of weeks...
So I bought second hand Samsung 863a's for all my nodes and a couple of old 10GBase-T nics while I was at it.
Added a third node, my old i5-2500 desktop which was supposed to become a backup node + storage server. Unfortunately, the thing failed me on nearly every step I took with it, forcing me to eventually replace that with a second hand 'storage machine' (sporting 6 HTSG 3TB NAS disks, a fourth-gen i5 and 16GB RAM) which had been running FreeNAS for years by the previous owner.

I had to re-install the cluster nodes several times, re-configured (and expanded) my entire network and I pretty much ran into every pitfall there is along the way, but now I'm slowly getting there:

2 primary nodes run my main VM's / LXC's, the most important ones HA

Primary storage is now 'server grade' with the sm863a's (though redundancy is achieved only by replication jobs between the nodes)

The 10GBE NICs are AWESOME for fast replication (even though I have not seen the traffic get above about 400MB/sec, so there is still room for improvement)

The onboard NICs have been re-purposed for dedicated Corosync traffic, with the 10GBE as fallback

Node 3, the 'storage machine' is my next target, it will need to replace my old QNAP TS-639Pro which, after over 11(!) years of faithful service, has crossed the reliability (and perfomance) threshold for me. All I need to find out now is how to configure the 6x 3TB most efficiently for use as shared storage... Perhaps a TrueNAS VM with storage pass-through?

Long story short - you're not alone in your quest. The fast NUC with 32GB ram sounds like a pretty nice machine (none of my nodes have that much ram!) and the EliteDesk 800 G1 might provide all the expansion options you need. As for SSD's - look at the TBW spec carefully before giving them a demanding role in your cluster. Anything below a TBW of several PB's will probably not cut it.
If you want to explore the 10GBE route - I know you can get second hand dual-10GBE server nics fairly cheap in many places. Especially if you go for the SFP+ models. If your nodes are in close proximity of each other, you can get away with just buying cheap DAC cables to connect your nodes directly to each other, no 10GBE switch needed.
The only thing I'm unsure of is if this will work for the NUC(s), as these probably don't have PCIe slots.
For those, 5GBE USB3 dongles exist, but I have no experience with those unfortunately.
I do have an old NUC myself (Celeron model with 8GB max), still thinking of adding that to the cluster as node 4. In which case I'll be running into a lot of the same challenges, probably.
Keep us posted!

Wow, thank you very much for sharing your story.
Indeed it is easy to go down the rabbit hole. I thought my mechanical keyboard addiction was expensive... until I discovered home lab. Dam it, each hobby is more expensive than the previous.

My main limitation is that I want to keep everything on my office, so the noise level can not go that high (ventilation), that's why I'm using NUCS and old office hardware like micro-pcs

Thanks for the advice on wearout, I didn't notice there is such a thing. I've just started using ZFS so it makes sense that I didn't put much attention to those. I'm not using high grade disks, but they seem to perform well.

10Gbe is indeed very tempting, but I don't think any of my hardware (other than my NAS) will be able to take advantage of it. Keep in mind that my desired nodes are micro-pcs. I bet they can't even do 2.5Gb

Some weeks ago I discovered that my HP elitedesk 800G1 has a slot for a m2 drive ( sata-3), so I bought one and installed it along with extra 8GB of ram. Now I'm in doubt about what to do with that disk, because the disk where the OS is installed is mechanical. Is it better to keep the OS on the mechanical disk and use the m2 ssd just for VMs? or is it better to install the OS on the m2 ssd and also use it for VMs and keep the optical disk for backups/images? I think it is performing well on the mechanical disk (at least in terms of booting and that), so maybe it is not worth the effort moving out?

ph0x · May 30, 2021

danielo515 said:
Is it better to keep the OS on the mechanical disk and use the m2 ssd just for VMs? or is it better to install the OS on the m2 ssd and also use it for VMs and keep the optical disk for backups/images? I think it is performing well on the mechanical disk (at least in terms of booting and that), so maybe it is not worth the effort moving out?

If the performance of the system on the spinning disks is okay, I wouldn't change it and use the m.2 disk for images only. Keep an eye on the wearout, though.

Dunuin · May 30, 2021

danielo515 said:
My main limitation is that I want to keep everything on my office, so the noise level can not go that high (ventilation), that's why I'm using NUCS and old office hardware like micro-pcs

Noise level really isn't the problem if you buy the right stuff. You can get quiet tower servers that use all enterprise grade server hardware but normal 120mm fans and tower coolers. I would bet that these are more quiet and less annoying than a small high pitched 40mm fan inside a thin client.

danielo515 said:
Is it better to keep the OS on the mechanical disk and use the m2 ssd just for VMs? or is it better to install the OS on the m2 ssd and also use it for VMs and keep the optical disk for backups/images? I think it is performing well on the mechanical disk (at least in terms of booting and that), so maybe it is not worth the effort moving out?

A fast (low latency) SSD is way more important for the VM storage than for the system drive. So I would use it for the VMs if the M.2 SSD can handle all the writes. Like ph0x said you really should monitor the usage because server workloads might kill consumer SDDs in weeks/months.

danielo515 · May 30, 2021

Thank you all for the responses!
I will keep the OS on the mechanical disk and I'll keep an eye on the wearout. Is it enough to take a look at the wearout % on the disks list? Or should I run some specific test!

I will also consider an 1U server with proper 120mm fans.

ph0x · May 30, 2021

Good luck with that, 1U means 40mm fans, for 120mm you'll need at least 3U.

You can monitor the wearout through the GUI, given you setup smartmontools to run recurrent self tests AND the wearout gets reported in the GUI.

Given my modest hardware, what is the best HA configuration?

Member

Distinguished Member

Member

Renowned Member

Member

Distinguished Member

Member

Renowned Member

Member

Member

Renowned Member

Member

Member

Renowned Member

Member

Member

Renowned Member

Distinguished Member

Member

Renowned Member