[SOLVED] 3-node Cluster with a qdevice / running with a single node online.

ajeffco

Renowned Member
Nov 17, 2014
26
8
68
Orlando, Fl USA
Hello,

I have an idea that I'd like feedback on from the proxmox community and devs if possible.

I have a 3 node proxmox cluster. Two nodes are identical hardware. The 3rd node is similar to the other two nodes, just less resources (NIC and memory). I have ZFS replication running between the two "worker" nodes. These two nodes are also configured in an HA group. This is working very well when all things are operating normally, even when one nodes is down for whatever reason. I have had a few issues unrelated to proxmox that cause some initial problems with two nodes going offline, taking down all vm's on the cluster (as expected), which have been mostly resolved. I also have two servers external to proxmox, one running proxmox backup server and one running plex.

My idea is, if I added a qdevice to the two external servers, would that allow me to keep one of the main "worker" nodes online of the other two "real" nodes in the cluster go offline for whatever reason? And would this allow HA to function with a single "real worker" node and two qdevices? If this works, is there any downside to this?

Thanks for any response,

Al


I explained this pretty badly I think.

I currently have a 3-node cluster with two equal nodes where any VM's that need HA run in an HA group. The 3rd node is a device with lower resources (smaller memory and less network ports) that is not in the group, and has a few VM's that do not need HA. ZFS replication is running between the two HA nodes.

I'm planning on splitting the 2 HA nodes into separate racks. So far no matter how I arrange nodes, and regardless of the number of nodes, if the rack with more nodes goes offline, then so does the node(s)s in the rack with fewer devices. I'm trying to find a way to keep nodes in one rack online if the other fails, regardless of the node counts in either rack.

Would it make more sense to remove the 3rd node and add 2 qdevices, if adding 2 qdevices is even an option? The 3rd node is a leftover from running a 3-node ceph cluster, i really don't need it to stay if I can find resiliency with 2 nodes.

Again any feedback and/or advice is appreciated.

Al
 
Last edited:
I'm a little confused. Do you have HA using a shared storage for the VMs? Is ZFS replication only for some Non-HA VMs on the HA nodes?

Anyway, for node counts and qdevice recommendations, there is a small section in our documentation: https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_supported_setups

I'm planning on splitting the 2 HA nodes into separate racks.
2 HA nodes and into separate racks means 1 node per rack.
So far no matter how I arrange nodes
=> You now mean the additional "Non-HA" nodes / qdevice?
 
Hello Dominic,

Thank you. I read that section of the documentation but it didn't help me. I might also being using the wrong terminology on some of this. I apologize for the wall of words but I'm hoping it will explain what I have running and what I'm trying to achieve.

I have a 3-node cluster, the nodes are pve-wan-01, pve-wan-02 and pve-wan-03. pve-wan-01 and pve-wan-02 are identical hardware. pve-wan-03 has much less memory and 2 less network ports than pve-01 and pve-02. All nodes have a pair of disks in a ZFS mirrored configuration, with both proxmox and vm's using rpool. I have two other servers running unrelated applications that I can install the qdevice onto and use them for the qdevice function only, in the context of the proxmox cluster.

All hardware is UPS protected, running dual network switches, etc. So there's a lot of redundancy but sometimes things still do fail no matter how much protection we add, which is why I'm looking to split the nodes between two racks.

I have replication configured for 6 vm's between pve-wan-01 and pve-wan-02. I am able to live migrate vm's between pve-wan-01 and pve-wan-02 without issue. It works every time I try it and is a great thing!

I have HA configured , with an HA group for nodes pve-wan-01 and pve-wan-02. Within the HA configuration, I have two vm's configured and in the group. I have pulled the power on the node these vm's were active on and within a short amount of time they came back online on the other node.

Node pve-wan-03 does not participate in ZFS replication and is not defined in the group in the HA configuration, so no vm's configured for HA should ever try to move to pve-wan-03. This seems to have worked well when I committed unnatural acts against the nodes to test ZFS replication and HA. I understand that any vm's that are not defined in the HA configuration will not move anywhere and if the node they are active goes offline, they are offline until some action is taken.

When I split the current setup, no matter where I physically install the three nodes between the two racks, if the rack with 2 nodes goes offline, the 3rd node in the other physically separated rack will go offline as well. I was hoping to add a qdevice to add some "balance" and resiliency however Iread earlier that qdevice is not supported on a cluster with an odd node count. A 3-node setup with qdevice can be forced with the understanding that it might not work as expected, and in the right situation can get into a split-brain situation. "If you understand the drawbacks and implications you can decide yourself if you should use this technology in an odd numbered cluster setup." I am not aware of additional drawbacks and implications other than a split-brain scenario. "The fact that all but one node plus QDevice may fail sound promising at first, but this may result in a mass recovery of HA services that would overload the single node left." I think for a large environment this is accurate however for my small setup I doubt this would be an issue. "Also ceph server will stop to provide services after only ((N-1)/2) nodes are online." I'm not using Ceph in this cluster so this does not apply.

I hoped that maybe instead of 3rd node (pve-wan-03) I maybe could remove that node, taking the cluster two a two node configuration, and then add two qdevices and physically separate them with the nodes in the two different racks, but I cannot find if two qdevices with two nodes is supported. I was starting to try it but ran out of time for the day.

Or maybe putting the qdevice with the single node and forcing it to join the cluster is the way to go. From the documenation: "This algorithm would allow that all nodes but one (and naturally the QDevice itself) could fail."

Thanks again for your time and help!

Al
 
So, I've tested for myself today with a 3-node setup, with a single qdevice. So far this appears to do what I need it to do. All forms of testing against the cluster, ranging from pulling the power, live migration between two nodes, HA failover on power failure, etc appear to have the results I've been trying to get to. In all scenarios,

@Dominic Thank you for pointing me to that documentation. I'd read it earlier but the line regarding the algorithm allowing a single node and the qdevice to work, just didn't sink in.

Have a great day!

Al
 
  • Like
Reactions: Dominic
Just a question, when you talk abot "ZFS replication" you mean ZFS storage replication and you know that is not real time, so when you switch off the node that is running the VM, the VM "migrates" and you loose "n" seconds of data written, correct?
I ask because I would love to have real time replication like drbd9 was doing, before they overcomplicated it. It was simple to configure and fast enough, a sort of "raid1 through lan", and had not the split brain problem of drbd8. Now they have added too much stuff and "orchestration" VM and also is not directly supported by Proxmox team. I'm really looking for an alternative.
 
  • Like
Reactions: Dominic
Just a question, when you talk abot "ZFS replication" you mean ZFS storage replication and you know that is not real time, so when you switch off the node that is running the VM, the VM "migrates" and you loose "n" seconds of data written, correct?
I ask because I would love to have real time replication like drbd9 was doing, before they overcomplicated it. It was simple to configure and fast enough, a sort of "raid1 through lan", and had not the split brain problem of drbd8. Now they have added too much stuff and "orchestration" VM and also is not directly supported by Proxmox team. I'm really looking for an alternative.
See, I knew I had the terms jacked up :)

Yes, I mean ZFS replication, configured via Datacenter -> Replication, and then picking the alnternate node from where the vm is live at. And I am aware that it's not real time. In my configuration it works fairly well because I do not have large vms on this cluster. All nodes use 32Gb disks, which under the cover are really using like 3-4Gb on each ZFS file set. So I have over time tested and currently have them cranked down to /1, and are completing in like 20 seconds or so.

I have a 3-node cluster with Ceph on 10Gb, and there's just not way to get down to things being alive if 2 nodes fail, which isn't a complaint, it's just the reality. I'm migrating from things from the ceph cluster to this new design of 3-nodes with a qdevice. I've been beating it to death throughout the day and it's working great as far as keeping services online.

So the final design is:

  • "HA Nodes": pve-01 <-- zfs rep --> pve-02, configured as a group. There are a few vms that need HA, and they will live here. Proxmox won't let you accidentally move HA vms to the node not in the group.
  • "Non-HA Node": pve-03. No expectation of vm's staying alive if this node goes down.
  • Qdevice: This really my Proxmox Backup Server server. It will serve double duty as qdevice.

I can migrate as necessary vms between nodes if all nodes are alive, except for nodes in the HA group.

The HA vms are my opnsense firewall, bitwarden server, systems monitor server and smtp relay. They stayed alive or failed over during the testing today. A minute or two of lag on replication for all of those should be fine.

It feels stupid to be talking about HA for home use in more than a homelab setup, but in the new reality, internet has become critical for us. My wife and I both work for local hospitals. I'm in IT, so I could turn on my phones hotspot and just keep moving. My wife works in a clinical business unit, so disruptions for her are something they won't tolerate. And to add to it, my youngest is virtual learning for her high school and college classes. So the internet has become kinda critical for us, at least until things return to normal.

I tried a ceph setup using HP hardware to do this and it was an epic failure. For whatever reason I couldn't get more than @60MB/s down and 2MB/s up on that hardware. Other than that it worked great :).

So, bought these qotom nodes and am getting @400MB/s down and 22MB/s up as expected. And and am able to tolerate failures down to a single node. All things are on UPS, but things still sometimes happen to break, my goal was to minimize disruptions as much as possible for my wife and kids online needs. Spectrum and Duke occasionally have problems as well. Spectrum recently had a 3 hour outage. Thank goodness it was in the middle of the night and we slept right through it.

I'm just hoping there's not some side effect I didn't think of that nukes this plan ;)

Have a good weekend!

Al
 
  • Like
Reactions: Dominic
For clarification, 2 racks and 3 nodes with high-availability configured is not ideal. This is because of the concept of majority.

Suppose the rack with 2 out of 3 HA nodes fails. This means the majority of nodes is offline. Then the remaining single node will fence (poweroff) itself because it does not have a connection to enough nodes anymore. Therefore, if you want to have your nodes split up into racks you should have 3 racks. With 3 racks, any single rack can go down and there will still be the majority of nodes online.

This behavior is indeed useful. Suppose again that you have 2 racks with 3 nodes and just the network connection between them is broken. Then again, the single node will fence itself. But also, the remaining 2 nodes in the remaining rack form a majority and HA services from the fenced node will be started on one of the remaining 2 nodes. If the single node did not fence itself, then you would have the HA service running
  1. on the single node as well as
  2. the now separated majority of nodes (= the 2 other nodes)
which would be bad.
 
Last edited:
Hello Dominic,

Sorry for the delayed response.

I understand that 2 racks with 3 nodes with HA configured is not ideal. Thank you for the detailed explanation as to what will happen in that setup. Unfortunately two racks is what I have to work with for now.

So I have been heavily testing a 3-node setup with a qdevice, which I know is not officially supported, split between two racks. The only thing I have found that did not work is ZFS replication would not run when I failed two nodes and was left with one node and the qdevice. Which, if I'm in this situation, I'm ok with. In this test (and all other tests I have tried so far), vm's failed over between the nodes defined in the HA group, and services were restored/continued to work. Which was the goal all along.

Thanks for your time and have a great day!
 
  • Like
Reactions: Dominic
[cut]

So I have been heavily testing a 3-node setup with a qdevice, which I know is not officially supported, split between two racks. The only thing I have found that did not work is ZFS replication would not run when I failed two nodes and was left with one node and the qdevice.
Just asking for a clarification:
a) why do you say that qdevice is not supported? It's officially documented
b) what do you mean with "ZFS replication would not run when I failed two nodes and was left with one node and the qdevice"? Of course if you have only one storage node, the replication does not work (where could replicate?), or you mean that after that, the replication stuff broke and you had to do something special to make it work again once all nodes were up?
 
Just asking for a clarification:
a) why do you say that qdevice is not supported? It's officially documented
b) what do you mean with "ZFS replication would not run when I failed two nodes and was left with one node and the qdevice"? Of course if you have only one storage node, the replication does not work (where could replicate?), or you mean that after that, the replication stuff broke and you had to do something special to make it work again once all nodes were up?

a) qdevice is not officially supported with an odd node count. When I setup the qdevice with a 3-node cluster, I was warned of this on the CLI, and you have to force it.

Bash:
root@pve-01:~# pvecm qdevice setup 192.168.100.9
Clusters with an odd node count are not officially supported!
root@pve-01:~# pvecm qdevice setup 192.168.100.9 --force
/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
.
.
.
Reloading corosync.conf...
Done

b) I wrote that wrong. When I shutdown the qdevice server (my proxmox backup server) and pve-03, leaving pve-01 and pve-02 online (these two nodes replicate vms using ZFS replication) is when ZFS replication doesn't work. In syslog when it tries is

Code:
Jan 20 04:16:06 pve-01 pvesr[10870]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 20 04:16:07 pve-01 pvesr[10870]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 20 04:16:08 pve-01 pvesr[10870]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 20 04:16:09 pve-01 pvesr[10870]: trying to acquire cfs lock 'file-replication_cfg' ...

Once I powered up the "failed" nodes things just fixed themselves. The only thing I see this morning after all the testing overnight is
Code:
Jan 20 10:57:26 pve-01 pve-ha-lrm[1990]: watchdog update failed - Broken pipe
I see this sometimes without doing anything to the nodes, usually a reboot stops it. It doesn't appear to interfere with anything. I had planned to investigate it, just haven't had the time yet.

I understand there will be "oddities" in this setup when something breaks, but at the end of the day, so far in testing difference failure scenarios, VMs configured under HA continue working when different failure scenarios were tested. And if things go bad enough, I have backups (Thank you Proxmox for being so good at backups AND for Proxmox Backup Server!!!)

Thank you.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!