Node crash while backup VM

ApisD

New Member
Sep 9, 2019
7
0
1
Hi,

In our 3-nodes cluster, the VMs are backuped once a week each saturdays night. (The local storage is the destination of these backups)
However, regularly, maybe one out of four or five times, one of the three nodes crashes and restart by itself.
I checked the /var/lib/vz/dump folder, it seems this is always during the same VM backup job the node crashes. I found .dat files and temp folders for the VM number 200.
I also checked /var/logs (attached) if there are any signs of errors, and I seen that corosync lost network links a few minutes before crash. (0:13)
I heard about a similar bug that the overload of network links during backup could make corosync struggling, but I thought it was fixed. (We run Proxmox VE 6.0-9) For information, the VMs running on a network SAN (iSCSI), so indeed the backup jobs generates necessarily network traffic.

Thanks in advance for your help.
(And apologies for bad english )
 

Attachments

  • Jupiter (Quorum).txt
    687.4 KB · Views: 3
  • Uranus (faulty).txt
    539.7 KB · Views: 6
Hi,
do you have HA enabled? Corosync does need a stable, low-latency network connection. Yes, the network traffic from the backup is very likely to disturb corosync. When the quorum is lost, the node will fence itself and reboot. Ideally you should set up a dedicated network for corosync. Otherwise you'll need to disable HA. (If you really want HA you could try limiting the bandwith for the backup, but no guarantees).
 
Hi Fabian, thank you for your reply.
Indeed HA is enabled.
Corosync and SAN networks are compartmentalised by dedicated VLANs, but physically there is only one link. (Due to host infrastructure)
Two nodes of three have 10 Gbps network, never experienced issues with these. The third node is limited to 1 Gbps so I think we need to limit the backup bandwidth only for this one. What is the scope of /etc/vzdump.conf, is it applied for the current node or the entire cluster ? Because we would like to keep HA, I will try to limit backup bandwith of the faulty node.
 
It should only affect the current node. You can have a different 'vzdump.conf' for each node, as it is not part of the cluster file system.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!