[SOLVED] proxmox 5.13 unexpected reboot

Sep 4, 2019
26
2
8
32
Hello,
My configuration consist of three proxmox identical nodes with the following:
- proxmox pve-manager/5.4-13/aee6f0ec (running kernel: 4.15.18-25-pve) on debian stretch
- ceph version 12.2.12 luminous (stable)
And a 6TB NFS storage connected to cluster with 1Gb ethernet bond (active-backup mode)

We use ceph to store VMs drives and the NFS storage to store VMs dumps (vzdump) and VMs backups (clone with qmrestore from the dumps):
1/ The VM dumps are launched every saturday by a cron job
2/ There is another cron which creates VMs clones from these dumps via a qmrestore every sunday

The thing is, when the qmrestore script is triggered on sunday, 2 of the 3 servers unexpectedly reboot while the one left runs the qmrestore script with no problem.

Does anyone have an idea about the causes?
 
Did you have HA enabled guests on the two nodes which rebooted?

Do you run the PVE cluster network (corosync) on the same physical network interface on which you have either Ceph of the connection to your NFS?

Effects like these are usually triggered if Corosync is run on the same physical NIC as other resource-intensive services, usually some storage.

Corosync itself does not need a lot of bandwidth but needs low latency. If there is a lot of traffic on the NIC caused by another service, the network can become congested and the latency for corosync goes up. This can result in the nodes losing the cluster connection to each other. If a node has HA enabled guests on it and is losing contact to the quorum part of the cluster for 2 min it will fence itself (restart) as to not have the HA guest running on two nodes. It assumes that the HA guest is started on another node.

That's why the recommendation is to have a dedicated NIC for corosync. Other links can be added as fallback.
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network

With PVE6 and thus corosync version 3, up to 8 links can be configured.
 
Did you have HA enabled guests on the two nodes which rebooted?

Do you run the PVE cluster network (corosync) on the same physical network interface on which you have either Ceph of the connection to your NFS?
Yes to both questions

Our servers network configurations are identical and look like the following:
- four 1GB ethernet interfaces
- port1 and port2 (round robin mode) are dedicated to ceph
- port3 and port4 (ative-backup mode) are dedicated to various vlans: LAN, internet, VPN, NFS storage and ... corosync

As a temporary solution, would it be OK if we add --bwlimit to the qmrestore command?[/QUOTE]
 
- four 1GB ethernet interfaces
- port1 and port2 (round robin mode) are dedicated to ceph
That works okayish on just 2x 1GBit?

As a temporary solution, would it be OK if we add --bwlimit to the qmrestore command?
It will probably help in your situation, but no guarantees.

You really should move corosync to a dedicated NIC though, especially if HA is used. How to add a second link/ring to corosync is described in the documentation. The documentation fitting your older PVE 5.4 installation can be found by clicking on the "Documentation" button in the top right of the GUI.


Another tip: If you want to only reply to certain parts of a post, you can select that text and then click on the appearing "Reply" button. This will close the quotes right after it and makes it easier to see which part is the quoted text and what is your answer.
 
Thank you Aaron,

We started from something rather small that is why we went for a few Gb ethernet ports but we are planning to upgrade everything soon :)

I will send feedback here next week.
 
Sorry guys for the late reply,
Throttling the speed down was the solution , now the cluster in fine
NEVER MIX COROSYNC TRAFFIC WITH ANYTHING ELSE

Cheers!
 
  • Like
Reactions: Stoiko Ivanov

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!