VMs going down during backup randomly

robhost

Active Member
Jun 15, 2014
224
9
38
Dresden
www.robhost.de
Hi,

with latest PVE 5.1 we have sometimes VMs (KVM) going down and need to be started manually during backup process. It appeas random across our VMs and hosts. We use NFS storage and "snapshot" mode with LZO compression.

Example:

845: 2018-04-11 13:31:40 INFO: status: 74% (397318029312/536870912000), sparse 12% (68007620608), duration 5491, read/write 63/58 MB/s
845: 2018-04-11 13:32:45 INFO: status: 75% (402724487168/536870912000), sparse 12% (68617400320), duration 5556, read/write 83/73 MB/s
845: 2018-04-11 13:33:31 ERROR: VM 845 not running
845: 2018-04-11 13:33:31 INFO: aborting backup job
845: 2018-04-11 13:33:31 ERROR: VM 845 not running
845: 2018-04-11 13:33:43 ERROR: Backup of VM 845 failed - VM 845 not running


Any idea whats wrong or how to fix this?
 
Hi,

it is hard to say with this little information.
Network problems?
The NFS is hanging?
What OS has the VM's?
 
Hi,

there are no network problems und no NFS hangings, because other backup jobs (from other nodes) are running fine.

VMs are Linux (CentOS 7). But the VMs are stopped and there does not exists a KVM process anymore, so it does not seem like a OS problem. Qemu Guest Agent is installed in all VMs.
 
Hi,

there are no network problems und no NFS hangings, because other backup jobs (from other nodes) are running fine.

VMs are Linux (CentOS 7). But the VMs are stopped and there does not exists a KVM process anymore, so it does not seem like a OS problem. Qemu Guest Agent is installed in all VMs.

do the logs show anything out of the ordinary? e.g. a segfaulted kvm process?
 
if you can reproduce this using a test VM, it might make sense to attempt to reproduce it with tracing output and/or under gdb
 
To OP: I am having a similar vzdump backup issue with a legacy XP KVM using remote NFS storage. The problem started about 3 weeks ago after running fine for a long time. On a whim, yesterday I disabled lzo compression. I need several more backup runs before deciding lzo is the culprit. I don't know if that might help you. :)