ha-manager VM's going in error state

Klaus Steinberger · Jan 24, 2018

Hi,

I often find VM's in error state in the ha-manager but the VM is cleanly running. I know how to get it out of error state even without restarting the VM (remove and add to ha Group)

But this is annoyingly as I regularly have to clean up error states for VM's as they should get started on server failures of course.

How can I debug why this error state comes up? We run Proxmox VE 5.1

Klaus Steinberger · Feb 20, 2018

Hi,

I found it out!

The error states did appear in conjunction with running backup jobs. Sometimes the backup jobs even triggered ipmi Watchdog and fenced the server.

The reason was probably the vzdump backups running to fast by writing onto an external NFS Storage which made the proxmox server unresponsive.

It looks like the cure is to limit bandwith for vzdump in /etc/vzdump.conf

I limited now to 50000

tuxis · May 31, 2018

I reported this in https://bugzilla.proxmox.com/show_bug.cgi?id=1794

Maxemilian Hilbrand · Sep 12, 2018

Any news on this?
https://bugzilla.proxmox.com/show_bug.cgi?id=1794 exactly describes the behavior we are experiencing.
I also have tried to limit the qm dump to 10MB/s (and later to 50 MB/s) but this only affects performance, not the error.

William Edwards · Sep 12, 2018

Maxemilian Hilbrand said:
Any news on this?
https://bugzilla.proxmox.com/show_bug.cgi?id=1794 exactly describes the behavior we are experiencing.
I also have tried to limit the qm dump to 10MB/s (and later to 50 MB/s) but this only affects performance, not the error.

I don't think there is any update as of now. I am still experiencing this issue. The only 'solutions' at this point seem to be using bad hardware or turning off vzdumps.

Klaus Steinberger · Sep 12, 2018

It really looks like an NFS issue. As long as I did vzdump onto a NFS Share, i did run into this problem. Even limiting the bandwidth didn't solve it, just reduced the rate how often it happened. It was even so bad that some Nodes were fenced.

Now I do the backup's via CIFS, and it works flawlessly even without bandwidth limiting.

So my conclusion:

Heavy file writes on NFS (especially very large files) eat up too much resources in the kernel (maybe network buffer), so the node renders unresponsive until the write is complete. I do see the unresponsiveness also with Linux Clients writing multi GByte files or also with Centos, so it looks like a general NFS Issue.

William Edwards · Sep 12, 2018

Klaus Steinberger said:
It really looks like an NFS issue. As long as I did vzdump onto a NFS Share, i did run into this problem. Even limiting the bandwidth didn't solve it, just reduced the rate how often it happened. It was even so bad that some Nodes were fenced.

Now I do the backup's via CIFS, and it works flawlessly even without bandwidth limiting.

So my conclusion:

Heavy file writes on NFS (especially very large files) eat up too much resources in the kernel (maybe network buffer), so the node renders unresponsive until the write is complete. I do see the unresponsiveness also with Linux Clients writing multi GByte files or also with Centos, so it looks like a general NFS Issue.

Perhaps, but the root cause would still be a QMP command timing out.

Klaus Steinberger · Sep 12, 2018

Yep, but it times out because the kernel goes unresponsive with the large NFS writes.

Deleted member 46289 · Mar 20, 2019

Just in case someone else have this problem: It´s now fixed with version 2.0.7 of pve-ha-manager
https://bugzilla.proxmox.com/show_bug.cgi?id=1794#c8

Search

Search

ha-manager VM's going in error state

Klaus Steinberger

Renowned Member

Klaus Steinberger

Renowned Member

tuxis

Famous Member

Maxemilian Hilbrand

New Member

William Edwards

Well-Known Member

Klaus Steinberger

Renowned Member

William Edwards

Well-Known Member

Klaus Steinberger

Renowned Member

Deleted member 46289

Guest