VM doesnt boot after Backupjob

odolmach3

New Member
May 13, 2023
17
0
1
Hello everyone,

A Windows Server 2022 VM no longer starts after the backup job (stop). Other servers are not affected. A special feature of this VM is that a GPU is passed through. I suspect that this may be the issue.

I initially had the problem that the VM went into state "internal error" during the backup.

I found the following entries in the log:

Code:
Nov 05 14:50:56 proxmox kernel: pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: 0000:01:00.1
Nov 05 14:50:56 proxmox kernel: vfio-pci 0000:01:00.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
Nov 05 14:50:56 proxmox kernel: vfio-pci 0000:01:00.1:   device [10de:228b] error status/mask=00100000/00000000
Nov 05 14:50:56 proxmox kernel: vfio-pci 0000:01:00.1:    [20] UnsupReq               (First)
Nov 05 14:50:56 proxmox kernel: vfio-pci 0000:01:00.1: AER:   TLP Header: 40000001 0000000f 426254e8 f7f7f7f7

I then expanded the /etc/kernel/cmdline to include the parameter pcie_aspm=off. Now at least the error above no longer occurs.

However, if I start a backup now and the affected VM is finished and tries to boot up, it doesn't work. I then find the following error in the log:

Code:
Nov 05 17:09:42 proxmox kernel: Out of memory: Killed process 1032905 (kvm) total-vm:17847964kB, anon-rss:15412572kB, file-rss:256kB, shmem-rss:0kB, UID:0 pgtables:30508kB oom_score_adj:0
Nov 05 17:09:42 proxmox systemd[1]: 100.scope: A process of this unit has been killed by the OOM killer.
Nov 05 17:09:42 proxmox systemd[1]: 100.scope: Failed with result 'oom-kill'.
Nov 05 17:09:42 proxmox systemd[1]: 100.scope: Consumed 9.382s CPU time.
Nov 05 17:09:42 proxmox kernel: vmbr0: port 4(tap100i0) entered disabled state
Nov 05 17:09:42 proxmox kernel: vmbr0: port 4(tap100i0) entered disabled state
Nov 05 17:09:42 proxmox pvedaemon[1032793]: stopping swtpm instance (pid 1032897) due to QEMU startup error
Nov 05 17:09:42 proxmox pvedaemon[1032768]: start failed: QEMU exited with code 1
Nov 05 17:09:42 proxmox pvedaemon[2860]: <root@pam> end task UPID:proxmox:000FC240:000184E5:6547BE3C:qmstart:100:root@pam: start failed: QEMU exited with code 1
Nov 05 17:09:42 proxmox pvestatd[2825]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - unable to connect to VM 100 qmp socket - Connection refused


Code:
proxmox-ve: 8.0.1 (running kernel: 6.2.16-3-pve)
pve-manager: 8.0.3 (running version: 8.0.3/bbf3993334bfa916)
pve-kernel-6.2: 8.0.2
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx2
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-3
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.0
libpve-access-control: 8.0.3
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.5
libpve-guest-common-perl: 5.0.3
libpve-http-server-perl: 5.0.3
libpve-rs-perl: 0.8.3
libpve-storage-perl: 8.0.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 2.99.0-1
proxmox-backup-file-restore: 2.99.0-1
proxmox-kernel-helper: 8.0.2
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.0.5
pve-cluster: 8.0.1
pve-container: 5.0.3
pve-docs: 8.0.3
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.2
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.4
pve-qemu-kvm: 8.0.2-3
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1


Any suggestions? Thanks and greetings.
 
However, if I start a backup now and the affected VM is finished and tries to boot up, it doesn't work. I then find the following error in the log:

Code:
Nov 05 17:09:42 proxmox kernel: Out of memory: Killed process 1032905 (kvm) total-vm:17847964kB, anon-rss:15412572kB, file-rss:256kB, shmem-rss:0kB, UID:0 pgtables:30508kB oom_score_adj:0
Nov 05 17:09:42 proxmox systemd[1]: 100.scope: A process of this unit has been killed by the OOM killer.
Nov 05 17:09:42 proxmox systemd[1]: 100.scope: Failed with result 'oom-kill'.
Nov 05 17:09:42 proxmox systemd[1]: 100.scope: Consumed 9.382s CPU time.
Nov 05 17:09:42 proxmox kernel: vmbr0: port 4(tap100i0) entered disabled state
Nov 05 17:09:42 proxmox kernel: vmbr0: port 4(tap100i0) entered disabled state
Nov 05 17:09:42 proxmox pvedaemon[1032793]: stopping swtpm instance (pid 1032897) due to QEMU startup error
Nov 05 17:09:42 proxmox pvedaemon[1032768]: start failed: QEMU exited with code 1
Nov 05 17:09:42 proxmox pvedaemon[2860]: <root@pam> end task UPID:proxmox:000FC240:000184E5:6547BE3C:qmstart:100:root@pam: start failed: QEMU exited with code 1
Nov 05 17:09:42 proxmox pvestatd[2825]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - unable to connect to VM 100 qmp socket - Connection refused
The VM is killed because the system runs out of memory. Maybe memory has become too fragmented for such a large allocation? Try starting it with half the memory?
 
The vm has 16 gb of Ram the node has 32 gb, what do you mean with too fragmented?
That there might not be enough continuous memory free for the VM. Since it uses PCI(e) passthrough, all the VM memory must be pinned into actual host RAM. A reboot will probably fix this (until you shutdown the VM).
Proxmox itself needs memory, other VMs need memory, device emulation needs memory, ZFS will use up to 50% of memory (unless limited). Regardless the technicalities, fact remains that Proxmox killed the VM while it was starting because it ran out of usable memory (and the VM was the biggest chunk of memory and therefore selected to be killed).
 
That there might not be enough continuous memory free for the VM. Since it uses PCI(e) passthrough, all the VM memory must be pinned into actual host RAM. A reboot will probably fix this (until you shutdown the VM).
Proxmox itself needs memory, other VMs need memory, device emulation needs memory, ZFS will use up to 50% of memory (unless limited). Regardless the technicalities, fact remains that Proxmox killed the VM while it was starting because it ran out of usable memory (and the VM was the biggest chunk of memory and therefore selected to be killed).

Hm, ive shutdown all vm´s except that one and started a backup. After the backup the vm starts but is not reachable.

Code:
ov 05 20:42:57 proxmox pvedaemon[694454]: INFO: Finished Backup of VM 100 (00:03:50)
Nov 05 20:42:57 proxmox pvedaemon[694454]: INFO: Backup job finished successfully
Nov 05 20:42:57 proxmox pvedaemon[2837]: <root@pam> end task UPID:proxmox:000A98B6:0002FDF0:6547EF5B:vzdump:100:root@pam: OK
Nov 05 20:43:33 proxmox pvedaemon[2837]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
Nov 05 20:43:45 proxmox pvedaemon[2837]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
Nov 05 20:44:04 proxmox pvedaemon[2837]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
Nov 05 20:44:23 proxmox pvedaemon[2836]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
Nov 05 20:44:43 proxmox pvedaemon[2836]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
Nov 05 20:45:02 proxmox pvedaemon[2837]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
Nov 05 20:45:22 proxmox pvedaemon[2836]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
Nov 05 20:45:41 proxmox pvedaemon[2837]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
Nov 05 20:46:00 proxmox pvedaemon[2838]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
Nov 05 20:46:20 proxmox pvedaemon[2836]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout

After a reboot of the node the vm works again.
 
Maybe you can share the VM configuration file (qm config VMNR)? Do you use ZFS? Can you check memory usage (free -m) before it fails to start?
Do you use explicit hugepages? I found that transparent hugepages works well enough. Maybe running backup in snapshot mode is a work-around?
 
Maybe you can share the VM configuration file (qm config VMNR)? Do you use ZFS? Can you check memory usage (free -m) before it fails to start?
Do you use explicit hugepages? I found that transparent hugepages works well enough. Maybe running backup in snapshot mode is a work-around?
affinity: 6-15 agent: 1 bios: ovmf boot: order=scsi0;net0;ide0 cores: 10 cpu: host efidisk0: local-zfs:vm-100-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M hostpci0: 0000:01:00,pcie=1,x-vga=1 machine: pc-q35-8.0 memory: 16416 meta: creation-qemu=8.0.2,ctime=1696030269 name: W2K22 net0: virtio=B6:8E:1D:E1:CB:BF,bridge=vmbr0 numa: 0 onboot: 1 ostype: win11 scsi0: local-zfs:vm-100-disk-1,cache=writeback,discard=on,size=150G scsihw: virtio-scsi-pci smbios1: uuid=6bf34c9a-297f-4b2e-b594-026427f67929 sockets: 1 tpmstate0: local-zfs:vm-100-disk-2,size=4M,version=v2.0 vmgenid: 8ccaf031-5c0d-4ce4-b1c7-7b76a233e306


so ive made another test, if i shutdown the vm in windows, and then run the backup , the vm starts fine afterwards.
Yes i use zfs.
Snapshot Backups work fine also.

It looks like something "blocks" the gpu if the backup job shuts the vm down!?
 
Last edited:
It looks like something "blocks" the gpu !?
No but the use of "the gpu" requires more memory at once, which becomes a problem.
memory: 16416
Why not use a "normal" amount of memory like 16384?
Yes i use zfs.
Did you limit the ARC size or do you allow ZFS to take 50% of your host memory (which interferes with your VM that also wants 51% of memory)?
Snapshot Backups work fine also.
Since it does not shutdown the VM and release all memory, which then need to be reallocated after it has been fragmented and ZFS has taken some of it?
 
No but the use of "the gpu" requires more memory at once, which becomes a problem.

Why not use a "normal" amount of memory like 16384?

Did you limit the ARC size or do you allow ZFS to take 50% of your host memory (which interferes with your VM that also wants 51% of memory)?

Since it does not shutdown the VM and release all memory, which then need to be reallocated after it has been fragmented and ZFS has taken some of it?
But why does this problem appears only after the shutdown from the Backupjob? A Normal Windows shutdown/reboot or a shutdown/reobot over the proxmox gui works fine. Does this not release all the memory?
 
But why does this problem appears only after the shutdown from the Backupjob? A Normal Windows shutdown/reboot or a shutdown/reobot over the proxmox gui works fine. Does this not release all the memory?
Maybe because the backup causes ZFS to grow the ARC? Or the backup process increases memory fragmentation? Did you limit ZFS ARC?
 
Maybe you can share the VM configuration file (qm config VMNR)? Do you use ZFS? Can you check memory usage (free -m) before it fails to start?
Do you use explicit hugepages? I found that transparent hugepages works well enough. Maybe running backup in snapshot mode is a work-around?
that's while it backs up

root@proxmox:~# free -m total used free shared buff/cache available Mem: 31851 29684 282 60 2379 2166 Swap: 0 0 0


and that's when the backup is finished, and the vm doesn't start right

root@proxmox:~# free -m total used free shared buff/cache available Mem: 31851 24482 3945 57 3942 7369 Swap: 0 0 0


In the Summary Tab of the vm the status is running and the ram usage ist up to 90%, cpu 10% HA State none, IP "Guest Agent not running"... It Seems up but not reachable
 
Last edited:
that's while it backs up

root@proxmox:~# free -m total used free shared buff/cache available Mem: 31851 29684 282 60 2379 2166 Swap: 0 0 0


and that's when the backup is finished, and the vm doesn't start right

root@proxmox:~# free -m total used free shared buff/cache available Mem: 31851 24482 3945 57 3942 7369 Swap: 0 0 0


In the Summary Tab of the vm the status is running and the ram usage ist up to 90%, cpu 10% HA State none, IP "Guest Agent not running"... It Seems up but not reachable
There is no 16384MB free, so the VM cannot start. Please, please, PLEASE answer the question about ZFS ARC size and whether you changed it from the default of 50%/16GB. This is such a common mistake!
not yet but i will test it.

Nevertheless Thank you very much for your help!
OK, so you want a VM that uses 16GB and you allow ZFS to use 16GB and you run other VMs and Proxmox needs 2GB. Of course this will not work with a 32GB host!
This issue has been answer some many times over in this forum: don't overcommit memory and be aware that ZFS is not integrated into Linux and also uses (a lot of) memory.
 
Last edited:
  • Like
Reactions: _gabriel
There is no 16384MB free, so the VM cannot start. Please, please, PLEASE answer the question about ZFS ARC size and whether you changed it from the default of 50%/16GB. This is such a common mistake!

OK, so you want a VM that uses 16GB and you allow ZFS to use 16GB and you run other VMs and Proxmox needs 2GB. Of course this will not work with a 32GB host!
This issue has been answer some many times over in this forum: don't overcommit memory and be aware that ZFS is not integrated into Linux and also uses (a lot of) memory

ive read that before but used the command arcstat while the backup was running and thought the values were ok

oot@proxmox:~# arcstat time read miss miss% dmis dm% pmis pm% mmis mm% size c avail 21:04:20 19 0 0 0 0 0 0 0 0 2.1G 2.2G 3.5G
at peak it was at 5.5G, but maybe i was interpreting that wrong =(