VM crashes during Backup after Upgrade

gooni

New Member
Feb 13, 2021
22
0
1
38
Hello there,

since upgraded to version:

proxmox-ve: 6.4-1 (running kernel: 5.4.119-1-pve)
pve-manager: 6.4-8 (running version: 6.4-8/185e14db)
pve-kernel-5.4: 6.4-3
pve-kernel-helper: 6.4-3
pve-kernel-5.4.119-1-pve: 5.4.119-1
pve-kernel-5.4.114-1-pve: 5.4.114-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.4-1
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.9-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-6
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.2-4
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

The Windows VM crashes every time during backup at 41%. I have to unlock them afterwards and restart it. The backup does not continue.

Syslog shows only:

Jun 12 21:25:08 pve1-systemd[1]: Stopped User Manager for UID 0.
Jun 12 21:25:08 pve1- systemd[1]: Stopping User Runtime Directory /run/user/0...
Jun 12 21:25:08 pve1-systemd[1]: run-user-0.mount: Succeeded.
Jun 12 21:25:08 pve1- systemd[1]: user-runtime-dir@0.service: Succeeded.
Jun 12 21:25:08 pve1-systemd[1]: Stopped User Runtime Directory /run/user/0.
Jun 12 21:25:08 pve1- systemd[1]: Removed slice User Slice of UID 0.
Jun 12 21:25:09 pve1- qm[32168]: VM 102 qmp command failed - VM 102 qmp command 'change' failed - unable to connect to VM 102 qmp socket - timeout after 600 retries
Jun 12 21:25:09 pve1- pvedaemon[32166]: Failed to run vncproxy.

I also noticed that I can´t move hard drives from this VM.... It stops at 0%

The disk is stored in a QNAP NAS via NFS

i created a new VM and imported the desik without success.

agent: 1
boot: order=virtio0;ide1;net0
cores: 4
cpu: host
ide1: local:iso/virtio-win-0.1.190.iso,media=cdrom,size=489986K
machine: pc-i440fx-5.2
memory: 8192
name: DC01
net0: virtio=7E:11:E9:7D:CE:A9,bridge=vmbr0,firewall=1
numa: 1
ostype: win10
scsihw: virtio-scsi-pci
smbios1: uuid=dde1e935-b8c1-4dcc-86c2-443ab6eb9d26
sockets: 1
virtio0: ProxmoxDS1:102/vm-102-disk-0.raw,cache=writeback,discard=on,size=100G
vmgenid: 9ccaa62c-a6f4-4156-bb52-079b19196ea4

Can anyone help?

Thanks.
 
Last edited:
Try to enable agent in windows vm services.
Or, try to disable agent, balooning (if enabled) in vm config
 
Try to enable agent in windows vm services.
Or, try to disable agent, balooning (if enabled) in vm config

Hi,

Tried all combinations of that. Same error:

Jun 13 13:25:18 pve1-auskin pvestatd[1268]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 31 retries

Now i could move the VM disk to the local datastore. The backup works fine. What the hell is that?

I think it has to do with the newest version...

Thanks.
 
Last edited:
Hi,

I am having the same or at least a similar issue.
After recent updates the backups fail but they fail due to the VM being killed in the middle of the backup.

For some reason the VM gets killed as a result of a oom.
Though the system was never oom.

As you can see here:
1623592958365.png

This is logged on the host:
Code:
[840482.940251] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/qemu.slice/100.scope,task=kvm,pid=28838,uid=0
[840482.940282] Out of memory: Killed process 28838 (kvm) total-vm:18338888kB, anon-rss:16688828kB, file-rss:4624kB, shmem-rss:4kB, UID:0 pgtables:33444kB oom_score_adj:0
[840483.856076] oom_reaper: reaped process 28838 (kvm), now anon-rss:0kB, file-rss:104kB, shmem-rss:4kB

Updates that were last installed are:
Code:
Upgrade: proxmox-widget-toolkit:amd64 (2.5-5, 2.5-6), proxmox-backup-file-restore:amd64 (1.1.6-2, 1.1.8-1), pve-firmware:amd64 (3.2-3, 3.2-4), pve-firewall:amd64 (4.1-3, 4.1-4), proxmox-backup-client:amd64 (1.1.6-2, 1.1.8-1), pve-manager:amd64 (6.4-6, 6.4-8), proxmox-backup-restore-image:amd64 (0.2.1, 0.2.2), libpve-http-server-perl:amd64 (3.2-2, 3.2-3)

Any hints are welcome!

Regards
 
Last edited:
  • Like
Reactions: gooni
I can't reproduce this(use ext4 only). What is your backend storage?
 
Last edited:
  • Like
Reactions: squirell
Do you have limited the zfs memory ?
it's quite possible that zfs memory cache is growing during the backup read, and then no enough memory for your vm -> oom

https://pve.proxmox.com/wiki/ZFS_on_Linux#sysadmin_zfs_limit_memory_usage
Hi,

Thanks for your reply.

I have no limit set:
Code:
# cat /sys/module/zfs/parameters/zfs_arc_max
0

The memory screenshot from above is from the node and not the VM.
To me it clearly shows the node has available memory.
The node has 64GB of RAM and only 32GB allocated to VMs, so I am not over provisioning or similar.

Thanks
 
Hi,

Thanks for your reply.

I have no limit set:
Code:
# cat /sys/module/zfs/parameters/zfs_arc_max
0

The memory screenshot from above is from the node and not the VM.
To me it clearly shows the node has available memory.
The node has 64GB of RAM and only 32GB allocated to VMs, so I am not over provisioning or similar.

Thanks
Try limit zfs arc cache as said @spirit
It's very important
 
Try limit zfs arc cache as said @spirit
It's very important
Hi,

I had set the limit to the following:
echo "$[8 * 1024*1024*1024 - 1]" >/sys/module/zfs/parameters/zfs_arc_min
echo "$[8 * 1024*1024*1024]" >/sys/module/zfs/parameters/zfs_arc_max

Now I get this error:
INFO: transferred 514.00 GiB in 2709 seconds (194.3 MiB/s)
zstd: /*stdout*\: Input/output error
ERROR: Backup of VM 100 failed - zstd --rsyncable --threads=1 failed - wrong exit status 1
INFO: Failed at 2021-06-20 11:59:57
INFO: Backup job finished with errors
TASK ERROR: job errors

At least the VM didn't get killed this time.
But the backup was not successful.

Any further hints?

Thanks