ha-manager VM's going in error state

Jan 16, 2018
230
52
68
Hi,

I often find VM's in error state in the ha-manager but the VM is cleanly running. I know how to get it out of error state even without restarting the VM (remove and add to ha Group)

But this is annoyingly as I regularly have to clean up error states for VM's as they should get started on server failures of course.

How can I debug why this error state comes up? We run Proxmox VE 5.1
 
Hi,

I found it out!

The error states did appear in conjunction with running backup jobs. Sometimes the backup jobs even triggered ipmi Watchdog and fenced the server.

The reason was probably the vzdump backups running to fast by writing onto an external NFS Storage which made the proxmox server unresponsive.

It looks like the cure is to limit bandwith for vzdump in /etc/vzdump.conf

I limited now to 50000
 
Any news on this?
https://bugzilla.proxmox.com/show_bug.cgi?id=1794 exactly describes the behavior we are experiencing.
I also have tried to limit the qm dump to 10MB/s (and later to 50 MB/s) but this only affects performance, not the error.
I don't think there is any update as of now. I am still experiencing this issue. The only 'solutions' at this point seem to be using bad hardware or turning off vzdumps.
 
It really looks like an NFS issue. As long as I did vzdump onto a NFS Share, i did run into this problem. Even limiting the bandwidth didn't solve it, just reduced the rate how often it happened. It was even so bad that some Nodes were fenced.

Now I do the backup's via CIFS, and it works flawlessly even without bandwidth limiting.

So my conclusion:

Heavy file writes on NFS (especially very large files) eat up too much resources in the kernel (maybe network buffer), so the node renders unresponsive until the write is complete. I do see the unresponsiveness also with Linux Clients writing multi GByte files or also with Centos, so it looks like a general NFS Issue.
 
It really looks like an NFS issue. As long as I did vzdump onto a NFS Share, i did run into this problem. Even limiting the bandwidth didn't solve it, just reduced the rate how often it happened. It was even so bad that some Nodes were fenced.

Now I do the backup's via CIFS, and it works flawlessly even without bandwidth limiting.

So my conclusion:

Heavy file writes on NFS (especially very large files) eat up too much resources in the kernel (maybe network buffer), so the node renders unresponsive until the write is complete. I do see the unresponsiveness also with Linux Clients writing multi GByte files or also with Centos, so it looks like a general NFS Issue.
Perhaps, but the root cause would still be a QMP command timing out.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!