Which events can trigger automatic pve nodes reboot?

franciscopaniskaseker · Mar 23, 2022

Problem:
My cluster rebooted two bare metals without being asked to do.

What I was doing:
I was updating the proxmox/debian packages (proxmox 6.x and debian 10.11). I have one cluster with 3 nodes. I will call them pve1, pve2 and pve3.

I started to move pve3 Vms to pve2 and pve1 (taking care about memory usage). Then I rebooted pve3. All good.
Then I returned the the pve3 Vms to pve3 and started to move pve2 Vms. I was waiting the live migration and suddenly, with two Vms still working in the live migration, pve2 and pve1 rebooted!

I checked pve3 iptables possible blocks, but all good there (no rules blocking nothing). And even if the cluster is being blocked, it must not trigger any automatic reboot in other nodes. pve3 will just be considered offline. And during pve1 and pve2 reboot, pve3 was working normally with the VMs running.

I checked the journalctl but I did not find the logs when the node started to do the reboot. The logs were running normally and suddenly the server is booting again. Pve1 and pve2. I was in the proxmox web dashboard when it happened. I did not see proxmox closing the VMs like in graceful reboot. And after the nodes returned I checked some vm Servers and I saw some services like docker exited abruptly.

The question:

What can trigger this behavior?
If the VMs are using memory balloning (but using less than 20% of total capacity), but this exceeds the total capacity of the bare metal, it can cause pve nodes reboot? This does not make sense to me, but I have to consider the only one thing that maybe can cause some problem (but not a pve reboot!).

franciscopaniskaseker · Mar 25, 2022

Reading the topics:

seems possible.

franciscopaniskaseker · Mar 28, 2022

If I have 3 nodes and I execute reboot in just one node, is it possible to that node be temporary removed from proxmox, causing HA quorum problem?

This can explain why the other 2 servers rebooted automatically.

aaron · Mar 28, 2022

If the 2 remaining nodes still form a quorum, then no. If they run into some network issues and lose quorum and have HA guests, then yes.

Check the status of the cluster with pvecm status.

How long after you shut down that node this the other 2 reboot?

franciscopaniskaseker · Mar 28, 2022

How long after you shut down that node this the other 2 reboot?

less than 20 minutes.

Check the status of the cluster with pvecm status.

I will check in next reboot.

Code:

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2

I executed now. Is not expected quorum 3?

aaron · Mar 28, 2022

Quorum means majority. 2 out 3 is the majority. And right now you have all 3 votes.

franciscopaniskaseker said:
less than 20 minutes.

Then it really needs to be something else or if it is the HA stack, then it is caused by something else. If the nodes lose contact with the quorate part of the cluster, and they have HA guests, they will fence themselves about one minute unless they can reestablish the connection to the cluster.

What are the minutes before the fresh boot in the /var/log/syslog files? Be aware that it is possible that not everything is written down to disk prior the crash / power off, so some info might be missing.

grin · Mar 28, 2022

The nodes reboot when there is no quorum, BUT it may happen for various reasons, including high local load, network packet or connectivity loss (often due to high network load), or any kinds of iowait stopping the ha manager (or corosync) to update its state.

I am not sure whether you can freeze the HA manually to prevent reboot in known risky situations; probably stoping the ha manager temporarily prevents reboot, but aaron possibly knows the internals better.

franciscopaniskaseker · May 20, 2022

Today we are seeing this again. We shutdown one node, and the others too reboot also.
We tried to add qdevice to help with the quorum, but is not working; And we are seeing very weird informations...

Code:

root@pve3 ~ # pvecm qdevice setup X.Y.Z.W -f
All nodes must be online! Node pve1 is offline, aborting.
root@pve3 ~ # pvecm qdevice remove
All nodes must be online! Node pve1 is offline, aborting.
root@pve3 ~ #

root@pve3 ~ # pvecm qdevice setup X.Y.Z.W -f
All nodes must be online! Node pve1 is offline, aborting.
root@pve3 ~ # pvecm qdevice remove
All nodes must be online! Node pve1 is offline, aborting.

root@pve2 ~ # pvecm status
Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      2
Quorum:           3 Activity blocked
Flags:            Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000002          1   A,NV,NMW 192.168.1.253 (local)
0x00000003          1    A,V,NMW 192.168.1.252
0x00000000          0            Qdevice (votes 2)


root@pve3 ~ # pvecm status
Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      4
Quorum:           3
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000002          1   A,NV,NMW 192.168.1.253
0x00000003          1    A,V,NMW 192.168.1.252 (local)
0x00000000          2            Qdevice

At this moment I have pve1 running all vms, and pve2 offline because there is no quorum (????). The pve1 is off becase we are changing the hardware.

Is being hard to get HA with and without qdevice. with 3 pve que can not have one pve off. with 3 pve + qdevice, we can not have 1 pve off (and other pve will be out of quorum).

this is not making any sense for me.

Search

Search

Which events can trigger automatic pve nodes reboot?

franciscopaniskaseker

Member

franciscopaniskaseker

Member

franciscopaniskaseker

Member

aaron

Proxmox Staff Member

franciscopaniskaseker

Member

aaron

Proxmox Staff Member

grin

Renowned Member

franciscopaniskaseker

Member

We value your privacy