Which events can trigger automatic pve nodes reboot?

Oct 11, 2020
35
2
13
36
Problem:
My cluster rebooted two bare metals without being asked to do.

What I was doing:
I was updating the proxmox/debian packages (proxmox 6.x and debian 10.11). I have one cluster with 3 nodes. I will call them pve1, pve2 and pve3.

  1. I started to move pve3 Vms to pve2 and pve1 (taking care about memory usage). Then I rebooted pve3. All good.
  2. Then I returned the the pve3 Vms to pve3 and started to move pve2 Vms. I was waiting the live migration and suddenly, with two Vms still working in the live migration, pve2 and pve1 rebooted!
I checked pve3 iptables possible blocks, but all good there (no rules blocking nothing). And even if the cluster is being blocked, it must not trigger any automatic reboot in other nodes. pve3 will just be considered offline. And during pve1 and pve2 reboot, pve3 was working normally with the VMs running.

I checked the journalctl but I did not find the logs when the node started to do the reboot. The logs were running normally and suddenly the server is booting again. Pve1 and pve2. I was in the proxmox web dashboard when it happened. I did not see proxmox closing the VMs like in graceful reboot. And after the nodes returned I checked some vm Servers and I saw some services like docker exited abruptly.

The question:
  1. What can trigger this behavior?
  2. If the VMs are using memory balloning (but using less than 20% of total capacity), but this exceeds the total capacity of the bare metal, it can cause pve nodes reboot? This does not make sense to me, but I have to consider the only one thing that maybe can cause some problem (but not a pve reboot!).
 
If I have 3 nodes and I execute reboot in just one node, is it possible to that node be temporary removed from proxmox, causing HA quorum problem?

This can explain why the other 2 servers rebooted automatically.
 
If the 2 remaining nodes still form a quorum, then no. If they run into some network issues and lose quorum and have HA guests, then yes.

Check the status of the cluster with pvecm status.

How long after you shut down that node this the other 2 reboot?
 
How long after you shut down that node this the other 2 reboot?

less than 20 minutes.

Check the status of the cluster with pvecm status.

I will check in next reboot.

Code:
Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2

I executed now. Is not expected quorum 3?
 
Quorum means majority. 2 out 3 is the majority. And right now you have all 3 votes.

less than 20 minutes.
Then it really needs to be something else or if it is the HA stack, then it is caused by something else. If the nodes lose contact with the quorate part of the cluster, and they have HA guests, they will fence themselves about one minute unless they can reestablish the connection to the cluster.

What are the minutes before the fresh boot in the /var/log/syslog files? Be aware that it is possible that not everything is written down to disk prior the crash / power off, so some info might be missing.
 
The nodes reboot when there is no quorum, BUT it may happen for various reasons, including high local load, network packet or connectivity loss (often due to high network load), or any kinds of iowait stopping the ha manager (or corosync) to update its state.

I am not sure whether you can freeze the HA manually to prevent reboot in known risky situations; probably stoping the ha manager temporarily prevents reboot, but aaron possibly knows the internals better.
 
Today we are seeing this again. We shutdown one node, and the others too reboot also.
We tried to add qdevice to help with the quorum, but is not working; And we are seeing very weird informations...

Code:
root@pve3 ~ # pvecm qdevice setup X.Y.Z.W -f
All nodes must be online! Node pve1 is offline, aborting.
root@pve3 ~ # pvecm qdevice remove
All nodes must be online! Node pve1 is offline, aborting.
root@pve3 ~ #

root@pve3 ~ # pvecm qdevice setup X.Y.Z.W -f
All nodes must be online! Node pve1 is offline, aborting.
root@pve3 ~ # pvecm qdevice remove
All nodes must be online! Node pve1 is offline, aborting.

root@pve2 ~ # pvecm status
Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      2
Quorum:           3 Activity blocked
Flags:            Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000002          1   A,NV,NMW 192.168.1.253 (local)
0x00000003          1    A,V,NMW 192.168.1.252
0x00000000          0            Qdevice (votes 2)


root@pve3 ~ # pvecm status
Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      4
Quorum:           3
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000002          1   A,NV,NMW 192.168.1.253
0x00000003          1    A,V,NMW 192.168.1.252 (local)
0x00000000          2            Qdevice

At this moment I have pve1 running all vms, and pve2 offline because there is no quorum (????). The pve1 is off becase we are changing the hardware.

Is being hard to get HA with and without qdevice. with 3 pve que can not have one pve off. with 3 pve + qdevice, we can not have 1 pve off (and other pve will be out of quorum).


this is not making any sense for me.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!