VM doesnt boot after Backupjob

odolmach3 · Nov 5, 2023

Hello everyone,

A Windows Server 2022 VM no longer starts after the backup job (stop). Other servers are not affected. A special feature of this VM is that a GPU is passed through. I suspect that this may be the issue.

I initially had the problem that the VM went into state "internal error" during the backup.

I found the following entries in the log:

Code:

Nov 05 14:50:56 proxmox kernel: pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: 0000:01:00.1
Nov 05 14:50:56 proxmox kernel: vfio-pci 0000:01:00.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
Nov 05 14:50:56 proxmox kernel: vfio-pci 0000:01:00.1:   device [10de:228b] error status/mask=00100000/00000000
Nov 05 14:50:56 proxmox kernel: vfio-pci 0000:01:00.1:    [20] UnsupReq               (First)
Nov 05 14:50:56 proxmox kernel: vfio-pci 0000:01:00.1: AER:   TLP Header: 40000001 0000000f 426254e8 f7f7f7f7

I then expanded the /etc/kernel/cmdline to include the parameter pcie_aspm=off. Now at least the error above no longer occurs.

However, if I start a backup now and the affected VM is finished and tries to boot up, it doesn't work. I then find the following error in the log:

Code:

Nov 05 17:09:42 proxmox kernel: Out of memory: Killed process 1032905 (kvm) total-vm:17847964kB, anon-rss:15412572kB, file-rss:256kB, shmem-rss:0kB, UID:0 pgtables:30508kB oom_score_adj:0
Nov 05 17:09:42 proxmox systemd[1]: 100.scope: A process of this unit has been killed by the OOM killer.
Nov 05 17:09:42 proxmox systemd[1]: 100.scope: Failed with result 'oom-kill'.
Nov 05 17:09:42 proxmox systemd[1]: 100.scope: Consumed 9.382s CPU time.
Nov 05 17:09:42 proxmox kernel: vmbr0: port 4(tap100i0) entered disabled state
Nov 05 17:09:42 proxmox kernel: vmbr0: port 4(tap100i0) entered disabled state
Nov 05 17:09:42 proxmox pvedaemon[1032793]: stopping swtpm instance (pid 1032897) due to QEMU startup error
Nov 05 17:09:42 proxmox pvedaemon[1032768]: start failed: QEMU exited with code 1
Nov 05 17:09:42 proxmox pvedaemon[2860]: <root@pam> end task UPID:proxmox:000FC240:000184E5:6547BE3C:qmstart:100:root@pam: start failed: QEMU exited with code 1
Nov 05 17:09:42 proxmox pvestatd[2825]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - unable to connect to VM 100 qmp socket - Connection refused

Code:

proxmox-ve: 8.0.1 (running kernel: 6.2.16-3-pve)
pve-manager: 8.0.3 (running version: 8.0.3/bbf3993334bfa916)
pve-kernel-6.2: 8.0.2
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx2
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-3
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.0
libpve-access-control: 8.0.3
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.5
libpve-guest-common-perl: 5.0.3
libpve-http-server-perl: 5.0.3
libpve-rs-perl: 0.8.3
libpve-storage-perl: 8.0.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 2.99.0-1
proxmox-backup-file-restore: 2.99.0-1
proxmox-kernel-helper: 8.0.2
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.0.5
pve-cluster: 8.0.1
pve-container: 5.0.3
pve-docs: 8.0.3
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.2
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.4
pve-qemu-kvm: 8.0.2-3
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1

Any suggestions? Thanks and greetings.

leesteken · Nov 5, 2023

odolmach3 said:

However, if I start a backup now and the affected VM is finished and tries to boot up, it doesn't work. I then find the following error in the log:

Code:

Nov 05 17:09:42 proxmox kernel: Out of memory: Killed process 1032905 (kvm) total-vm:17847964kB, anon-rss:15412572kB, file-rss:256kB, shmem-rss:0kB, UID:0 pgtables:30508kB oom_score_adj:0
Nov 05 17:09:42 proxmox systemd[1]: 100.scope: A process of this unit has been killed by the OOM killer.
Nov 05 17:09:42 proxmox systemd[1]: 100.scope: Failed with result 'oom-kill'.
Nov 05 17:09:42 proxmox systemd[1]: 100.scope: Consumed 9.382s CPU time.
Nov 05 17:09:42 proxmox kernel: vmbr0: port 4(tap100i0) entered disabled state
Nov 05 17:09:42 proxmox kernel: vmbr0: port 4(tap100i0) entered disabled state
Nov 05 17:09:42 proxmox pvedaemon[1032793]: stopping swtpm instance (pid 1032897) due to QEMU startup error
Nov 05 17:09:42 proxmox pvedaemon[1032768]: start failed: QEMU exited with code 1
Nov 05 17:09:42 proxmox pvedaemon[2860]: <root@pam> end task UPID:proxmox:000FC240:000184E5:6547BE3C:qmstart:100:root@pam: start failed: QEMU exited with code 1
Nov 05 17:09:42 proxmox pvestatd[2825]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - unable to connect to VM 100 qmp socket - Connection refused

The VM is killed because the system runs out of memory. Maybe memory has become too fragmented for such a large allocation? Try starting it with half the memory?

odolmach3 · Nov 5, 2023

odolmach3 said:
us Error: severity=Uncorrected (Non-Fatal), type=Transacti

leesteken said:
The VM is killed because the system runs out of memory. Maybe memory has become too fragmented for such a large allocation? Try starting it with half the memory?

Thank you for your reply!

The vm has 16 gb of Ram the node has 32 gb, what do you mean with too fragmented?

leesteken · Nov 5, 2023

odolmach3 said:
The vm has 16 gb of Ram the node has 32 gb, what do you mean with too fragmented?

That there might not be enough continuous memory free for the VM. Since it uses PCI(e) passthrough, all the VM memory must be pinned into actual host RAM. A reboot will probably fix this (until you shutdown the VM).
Proxmox itself needs memory, other VMs need memory, device emulation needs memory, ZFS will use up to 50% of memory (unless limited). Regardless the technicalities, fact remains that Proxmox killed the VM while it was starting because it ran out of usable memory (and the VM was the biggest chunk of memory and therefore selected to be killed).

odolmach3 · Nov 5, 2023

leesteken said:
That there might not be enough continuous memory free for the VM. Since it uses PCI(e) passthrough, all the VM memory must be pinned into actual host RAM. A reboot will probably fix this (until you shutdown the VM).
Proxmox itself needs memory, other VMs need memory, device emulation needs memory, ZFS will use up to 50% of memory (unless limited). Regardless the technicalities, fact remains that Proxmox killed the VM while it was starting because it ran out of usable memory (and the VM was the biggest chunk of memory and therefore selected to be killed).

Hm, ive shutdown all vm´s except that one and started a backup. After the backup the vm starts but is not reachable.

Code:

ov 05 20:42:57 proxmox pvedaemon[694454]: INFO: Finished Backup of VM 100 (00:03:50)
Nov 05 20:42:57 proxmox pvedaemon[694454]: INFO: Backup job finished successfully
Nov 05 20:42:57 proxmox pvedaemon[2837]: <root@pam> end task UPID:proxmox:000A98B6:0002FDF0:6547EF5B:vzdump:100:root@pam: OK
Nov 05 20:43:33 proxmox pvedaemon[2837]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
Nov 05 20:43:45 proxmox pvedaemon[2837]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
Nov 05 20:44:04 proxmox pvedaemon[2837]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
Nov 05 20:44:23 proxmox pvedaemon[2836]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
Nov 05 20:44:43 proxmox pvedaemon[2836]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
Nov 05 20:45:02 proxmox pvedaemon[2837]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
Nov 05 20:45:22 proxmox pvedaemon[2836]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
Nov 05 20:45:41 proxmox pvedaemon[2837]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
Nov 05 20:46:00 proxmox pvedaemon[2838]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
Nov 05 20:46:20 proxmox pvedaemon[2836]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout

After a reboot of the node the vm works again.

odolmach3 · Nov 5, 2023

If i delete the PCI Device (GPU) the VM boots fine....

leesteken · Nov 5, 2023

odolmach3 said:
If i delete the PCI Device (GPU) the VM boots fine....

leesteken said:
Since it uses PCI(e) passthrough, all the VM memory must be pinned into actual host RAM.

Not all of the memory needs to be available at once when not using PCI(e) passthrough. That can make a difference.

leesteken · Nov 5, 2023

Maybe you can share the VM configuration file (qm config VMNR)? Do you use ZFS? Can you check memory usage (free -m) before it fails to start?
Do you use explicit hugepages? I found that transparent hugepages works well enough. Maybe running backup in snapshot mode is a work-around?

odolmach3 · Nov 5, 2023

leesteken said:
Maybe you can share the VM configuration file (qm config VMNR)? Do you use ZFS? Can you check memory usage (free -m) before it fails to start?
Do you use explicit hugepages? I found that transparent hugepages works well enough. Maybe running backup in snapshot mode is a work-around?


affinity: 6-15
agent: 1
bios: ovmf
boot: order=scsi0;net0;ide0
cores: 10
cpu: host
efidisk0: local-zfs:vm-100-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
hostpci0: 0000:01:00,pcie=1,x-vga=1
machine: pc-q35-8.0
memory: 16416
meta: creation-qemu=8.0.2,ctime=1696030269
name: W2K22
net0: virtio=B6:8E:1D:E1:CB:BF,bridge=vmbr0
numa: 0
onboot: 1
ostype: win11
scsi0: local-zfs:vm-100-disk-1,cache=writeback,discard=on,size=150G
scsihw: virtio-scsi-pci
smbios1: uuid=6bf34c9a-297f-4b2e-b594-026427f67929
sockets: 1
tpmstate0: local-zfs:vm-100-disk-2,size=4M,version=v2.0
vmgenid: 8ccaf031-5c0d-4ce4-b1c7-7b76a233e306

so ive made another test, if i shutdown the vm in windows, and then run the backup , the vm starts fine afterwards.
Yes i use zfs.
Snapshot Backups work fine also.

It looks like something "blocks" the gpu if the backup job shuts the vm down!?

leesteken · Nov 5, 2023

odolmach3 said:
It looks like something "blocks" the gpu !?

No but the use of "the gpu" requires more memory at once, which becomes a problem.

odolmach3 said:
memory: 16416

Why not use a "normal" amount of memory like 16384?

odolmach3 said:
Yes i use zfs.

Did you limit the ARC size or do you allow ZFS to take 50% of your host memory (which interferes with your VM that also wants 51% of memory)?

odolmach3 said:
Snapshot Backups work fine also.

Since it does not shutdown the VM and release all memory, which then need to be reallocated after it has been fragmented and ZFS has taken some of it?

odolmach3 · Nov 5, 2023

leesteken said:
No but the use of "the gpu" requires more memory at once, which becomes a problem.

Why not use a "normal" amount of memory like 16384?

Did you limit the ARC size or do you allow ZFS to take 50% of your host memory (which interferes with your VM that also wants 51% of memory)?

Since it does not shutdown the VM and release all memory, which then need to be reallocated after it has been fragmented and ZFS has taken some of it?

But why does this problem appears only after the shutdown from the Backupjob? A Normal Windows shutdown/reboot or a shutdown/reobot over the proxmox gui works fine. Does this not release all the memory?

leesteken · Nov 5, 2023

odolmach3 said:
But why does this problem appears only after the shutdown from the Backupjob? A Normal Windows shutdown/reboot or a shutdown/reobot over the proxmox gui works fine. Does this not release all the memory?

Maybe because the backup causes ZFS to grow the ARC? Or the backup process increases memory fragmentation? Did you limit ZFS ARC?

odolmach3 · Nov 5, 2023

leesteken said:
Maybe you can share the VM configuration file (qm config VMNR)? Do you use ZFS? Can you check memory usage (free -m) before it fails to start?
Do you use explicit hugepages? I found that transparent hugepages works well enough. Maybe running backup in snapshot mode is a work-around?

that's while it backs up


root@proxmox:~# free -m
               total        used        free      shared  buff/cache   available
Mem:           31851       29684         282          60        2379        2166
Swap:              0           0           0

and that's when the backup is finished, and the vm doesn't start right


root@proxmox:~# free -m
               total        used        free      shared  buff/cache   available
Mem:           31851       24482        3945          57        3942        7369
Swap:              0           0           0

In the Summary Tab of the vm the status is running and the ram usage ist up to 90%, cpu 10% HA State none, IP "Guest Agent not running"... It Seems up but not reachable

odolmach3 · Nov 5, 2023

leesteken said:
Maybe because the backup causes ZFS to grow the ARC? Or the backup process increases memory fragmentation? Did you limit ZFS ARC?

not yet but i will test it.

Nevertheless Thank you very much for your help!

leesteken · Nov 5, 2023

odolmach3 said:
that's while it backs up

root@proxmox:~# free -m total used free shared buff/cache available Mem: 31851 29684 282 60 2379 2166 Swap: 0 0 0

and that's when the backup is finished, and the vm doesn't start right

root@proxmox:~# free -m total used free shared buff/cache available Mem: 31851 24482 3945 57 3942 7369 Swap: 0 0 0

In the Summary Tab of the vm the status is running and the ram usage ist up to 90%, cpu 10% HA State none, IP "Guest Agent not running"... It Seems up but not reachable

There is no 16384MB free, so the VM cannot start. Please, please, PLEASE answer the question about ZFS ARC size and whether you changed it from the default of 50%/16GB. This is such a common mistake!

odolmach3 said:
not yet but i will test it.

Nevertheless Thank you very much for your help!

OK, so you want a VM that uses 16GB and you allow ZFS to use 16GB and you run other VMs and Proxmox needs 2GB. Of course this will not work with a 32GB host!
This issue has been answer some many times over in this forum: don't overcommit memory and be aware that ZFS is not integrated into Linux and also uses (a lot of) memory.

odolmach3 · Nov 5, 2023

leesteken said:
There is no 16384MB free, so the VM cannot start. Please, please, PLEASE answer the question about ZFS ARC size and whether you changed it from the default of 50%/16GB. This is such a common mistake!

OK, so you want a VM that uses 16GB and you allow ZFS to use 16GB and you run other VMs and Proxmox needs 2GB. Of course this will not work with a 32GB host!
This issue has been answer some many times over in this forum: don't overcommit memory and be aware that ZFS is not integrated into Linux and also uses (a lot of) memory

ive read that before but used the command arcstat while the backup was running and thought the values were ok


oot@proxmox:~# arcstat
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size     c  avail
21:04:20    19     0      0     0    0     0    0     0    0  2.1G  2.2G   3.5G

at peak it was at 5.5G, but maybe i was interpreting that wrong =(

odolmach3 · Nov 5, 2023

after i set the limit for the arc cache to 3GB everything works fine!

Search

Search

VM doesnt boot after Backupjob

odolmach3

New Member

leesteken

Distinguished Member

odolmach3

New Member

leesteken

Distinguished Member

odolmach3

New Member

odolmach3

New Member

leesteken

Distinguished Member

leesteken

Distinguished Member

odolmach3

New Member

leesteken

Distinguished Member

odolmach3

New Member

leesteken

Distinguished Member

odolmach3

New Member

odolmach3

New Member

leesteken

Distinguished Member

odolmach3

New Member

odolmach3

New Member

leesteken, Thank you very much!

We value your privacy

VM doesnt boot after Backupjob

New Member

Distinguished Member

New Member

Distinguished Member

New Member

New Member

Distinguished Member

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

New Member

Distinguished Member

New Member

New Member

leesteken, Thank you very much!​

We value your privacy

leesteken, Thank you very much!