Shutting down/rebooting one node reboots the other. HA VM doesn't transfer.

Frozenstiff

Member
Feb 10, 2020
16
1
8
44
I realize this is my first post, because I tend to avoid asking questions if I can find the answer on my own. Currently I am unable to find a solution so:

I was running one system with Proxmox and recent got some servers for cheap, so I decided to test out clusters and making my VMs HA enabled.
I currently run FreeNAS with NFS shares as the storage for my backups and VM images. Each NFS share is a unique drive pool to avoid bottle necking a pool.

I have a R610 which I created the cluster on, and a R710 which I mirrored my original Proxmox install to see if it would have problems joining. It did(401 unauthorized ticket error after I put in the password for the server, if I remember correctly) and I had to reset the Corosync and other cluster files to even be able to access the R610's fresh install, but I couldn't salvage the R710 so I did a fresh install.

I was then able to join the cluster and I moved some backups of my VMs over. I setup HA on one of them and tested to see if it would push the VM over if the node it was running on was shutdown by mistake or whatever. It worked, but then the VM got stuck in a "fenced" state and wouldn't start back up when shutdown, which I was able to figure out how to re-enable it and it started again. I tested it again and the VM moved back and forth between nodes when one was shutdown without issue, until today.

So, I was shutting down the R610 to install some more memory and after about 1 minute, the R710 reboots. I finished installing the memory and let them both come up. I shutdown the R610 again, and the same thing happened with the R710. So I tried it the other way to see if the R710 had the same effect. The first 2 times it didn't, but now when I shutdown/restart/power reset either machine then the other one will reboot within a minute without shutting down and the active VM doesn't seem to be attempting to move over at all before the system shuts down. I'm getting more parts to upgrade the systems and I'd like to be able to install them without bringing everything down during the day. It just seems like any loss of communication between the two nodes kills the other.

I've included dmesg as .txt files from each server and pictures, but please let me know if there is anything else I can include to help someone point me in the right direction.

Also, I'm not sure if it is relevant, but the "ACPI: SPCR: Unexpected SPCR Access Width" and the "FS-Cache Duplicate cookie detected" are new errors since I upgraded today. I wasn't able to find a fix for the SPCR one, and the FS-Cache just seems to be from remounting the NFS share. I was previously on 6.1-5. and the only thing I changed config-wise after the update was to add the MAC addresses of each machine's primary network card so I can use the WOL feature from the web interface.

BTW, weirdly enough, when I just pull the network cable from one the other doesn't reboot. Then I can shutdown the system and once one of the systems shuts down, then I can re-plug the cable and nothing changes. It is almost like the machine being shutdown is telling the other to hard reset for some reason after it gets powered off.

Thanks in advance for any help that anyone can give.

Proxmox HA Screen.pngProxmox HA Group.png
 

Attachments

  • Node1-R710.txt
    72.7 KB · Views: 7
  • Node2-R610.txt
    70.8 KB · Views: 1
It just seems like any loss of communication between the two nodes kills the other.

Yes sure, I mean if a node goes down in a two-node cluster the other one will loose Quorum too.
HA is not possible with two node clusters.

See: https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#_requirements

If you do not have a third node, but a storage box (like your FreeNAS) or another linux box running somewhere 24/7 you could also use a QDevice as vote arbiter, it gives a third vote and can thus decide a tie:
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_corosync_external_vote_support
 
  • Like
Reactions: Frozenstiff
Yes sure, I mean if a node goes down in a two-node cluster the other one will loose Quorum too.
HA is not possible with two node clusters.

See: https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#_requirements

So just not having a quorum forces the other device to reboot without properly shutting down?

Why doesn't the other system reboot when I pull the network cable first?

And if HA isn't possible, then why was the VM moving over before when I would shutdown the node it was on? Just a fluke?(I did forget to mention that I had set the R610 to have 2 votes so I could adjust some settings when the R710 was down and I had changed it back before the upgrade, so if not having a quorum was shutting the other system down it might be because it was back to 1 vote of 2 instead of 2 votes of 3).

If you do not have a third node, but a storage box (like your FreeNAS) or another linux box running somewhere 24/7 you could also use a QDevice as vote arbiter, it gives a third vote and can thus decide a tie:
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_corosync_external_vote_support

So, even though FreeNAS is FreeBSD it can act as a vote without joining the cluster?
 
So just not having a quorum forces the other device to reboot without properly shutting down?
Yes, that's required for HA, see fencing in the HA docs. Avoids corruption of shared resources.

Why doesn't the other system reboot when I pull the network cable first?

That should only happen if no HA service is active there, else the watchdog should trigger and reset that node.

And if HA isn't possible, then why was the VM moving over before when I would shutdown the node it was on? Just a fluke?

Depends, did you changed the HA Shutdown policy? As there are policies which move servies on graceful shutdown to avoid downtime. Else, if no such policy was set, it can happen if the cluster stack (corosync) stay much longer online than the HA services. As only then the other node has still quorum, but sees that the HA stack of the node currently being shut down got stopped and recovers it. That can also only happen on graceful shutdown, and yes rather by chance.
 
  • Like
Reactions: Frozenstiff
Yes, that's required for HA, see fencing in the HA docs. Avoids corruption of shared resources.

Ah, K. Well from a quick search it seems that FreeNAS won't function directly as a QDevice, since it still doesn't have the package from what I read so far. Plus your reply here:
https://forum.proxmox.com/threads/pve-cluster-5-0-34-new-features.52870/

That should only happen if no HA service is active there, else the watchdog should trigger and reset that node.

Yeah, I tried it multiple times for each server. As long as I pull the networking from one until either shutdown, the other stays up. I don't know the backend to be able to postulate a reason though.

Depends, did you changed the HA Shutdown policy? As there are policies which move servies on graceful shutdown to avoid downtime. Else, if no such policy was set, it can happen if the cluster stack (corosync) stay much longer online than the HA services. As only then the other node has still quorum, but sees that the HA stack of the node currently being shut down got stopped and recovers it. That can also only happen on graceful shutdown, and yes rather by chance.

I think it might be because my servers were taking a while to shutdown before I fixed them. They were hanging on processes that didn't want to quit. Seems me fixing that caused the auto-migration not to work anymore. I might look into adding the necessary settings to the shutdown policy, since it would be useful until I have all my nodes setup.


Anyway, thanks for all the help. I really appreciate it.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!