Qemu crash with vzdump

Dec 10, 2022
42
2
8
Hello everyone, lately some QEMU virtual machines have been unexpectedly crashing during vzdump. Below, I'm listing the errors as well as the versions of Proxmox Backup Server (PBS), Proxmox Virtual Environment (PVE), and QEMU.

So far, I haven't found anything related to this. Does anyone know more about it? Do you think increasing the ulimit max open file could mitigate the issue? If yes, where can I set it?

Thanks in advance.

** Error Log
Aug 02 22:31:33 pve-dc1-4 QEMU[1303812]: thread '<unnamed>' panicked at 'failed to spawn tokio runtime: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', /usr/share/cargo/registry/proxmox-a>
Aug 02 22:31:33 pve-dc1-4 QEMU[1303812]: note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Aug 02 22:31:33 pve-dc1-4 QEMU[1303812]: fatal runtime error: failed to initiate panic, error 5
Aug 02 22:31:33 pve-dc1-4 kernel: fwbr107i1: port 2(tap107i1) entered disabled state
Aug 02 22:31:33 pve-dc1-4 kernel: fwbr107i1: port 2(tap107i1) entered disabled state
Aug 02 22:31:34 pve-dc1-4 vzdump[879988]: VM 107 qmp command failed - VM 107 qmp command 'backup' failed - client closed connection
Aug 02 22:31:34 pve-dc1-4 vzdump[879988]: VM 107 qmp command failed - VM 107 not running
Aug 02 22:31:34 pve-dc1-4 vzdump[879988]: VM 107 qmp command failed - VM 107 not running
Aug 02 22:31:34 pve-dc1-4 vzdump[879988]: VM 107 qmp command failed - VM 107 not running
Aug 02 22:31:34 pve-dc1-4 vzdump[879988]: ERROR: Backup of VM 107 failed - VM 107 not running

** Pve
proxmox-ve: 8.0.1 (running kernel: 6.2.16-3-pve)
pve-manager: 8.0.3 (running version: 8.0.3/bbf3993334bfa916)
pve-kernel-6.2: 8.0.2
pve-kernel-5.15: 7.4-3
pve-kernel-5.13: 7.1-9
pve-kernel-6.2.16-3-pve: 6.2.16-3
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.102-1-pve: 5.15.102-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx2
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-3
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.0
libpve-access-control: 8.0.3
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.5
libpve-guest-common-perl: 5.0.3
libpve-http-server-perl: 5.0.3
libpve-rs-perl: 0.8.3
libpve-storage-perl: 8.0.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 2.99.0-1
proxmox-backup-file-restore: 2.99.0-1
proxmox-kernel-helper: 8.0.2
proxmox-mail-forward: 0.2.0
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.1
proxmox-widget-toolkit: 4.0.5
pve-cluster: 8.0.1
pve-container: 5.0.4
pve-docs: 8.0.4
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.2
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.4
pve-qemu-kvm: 8.0.2-3
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1

** Pbs
proxmox-backup: 3.0.0 (running kernel: 6.2.16-3-pve)
proxmox-backup-server: 3.0.1-1 (running version: 3.0.1)
pve-kernel-6.2: 8.0.2
pve-kernel-5.15: 7.4-3
pve-kernel-6.2.16-3-pve: 6.2.16-3
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.102-1-pve: 5.15.102-1
ifupdown2: 3.2.0-1+pmx3
libjs-extjs: 7.0.0-3
proxmox-backup-docs: 3.0.1-1
proxmox-backup-client: 3.0.1-1
proxmox-mail-forward: 0.2.0
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: not correctly installed
proxmox-widget-toolkit: 4.0.6
pve-xtermjs: 4.16.0-3
smartmontools: 7.3-pve1
zfsutils-linux: 2.1.12-pve1
 
Hi,
can you please share one or two example configurations of the affected VMs and tell us what kind of storage(s) you are using? You can check the limit and how many open files each of your VM processes has with e.g.
for pid in $(pidof kvm); do prlimit -p $pid | grep NOFILE; ls -1 /proc/$pid/fd/ | wc -l; done

The limit can be increased as described here: https://bugzilla.proxmox.com/show_bug.cgi?id=4507#c1

EDIT: to make this post linkable and self-contained:

The customer that ran into this issue has worked around it by increasing the `DefaultLimitNOFILE` in /etc/systemd/system.conf and by increasing the limit in /etc/security/limits.d/ for root.

For the systemd limit, which applies to VMs started via UI or API, it is necessary to run systemctl daemon-reload and systemctl restart pvedaemon.service pveproxy.service to make the new limits apply. For the user limit, which applies to VMs started via qm, it is necessary to login to a new session.
 
Last edited:
Of course, I'll send you everything right away (It doesn't seem like the maximum number of open files to me. :rolleyes: ):

NOFILE max number of open files 1024 4096 files
0
NOFILE max number of open files 1024 1048576 files
954
NOFILE max number of open files 1024 4096 files
0
NOFILE max number of open files 1024 524288 files
250
NOFILE max number of open files 1024 4096 files
0
NOFILE max number of open files 1024 1048576 files
892
NOFILE max number of open files 1024 4096 files
0
NOFILE max number of open files 1024 1048576 files
935
NOFILE max number of open files 1024 4096 files
0
NOFILE max number of open files 1024 524288 files
172
NOFILE max number of open files 1024 4096 files
0
NOFILE max number of open files 1024 524288 files
725
NOFILE max number of open files 1024 4096 files
0
NOFILE max number of open files 1024 1048576 files
993
NOFILE max number of open files 1024 4096 files
0
NOFILE max number of open files 1024 524288 files
257
NOFILE max number of open files 1024 4096 files
0
NOFILE max number of open files 1024 524288 files
259
NOFILE max number of open files 1024 4096 files
0
NOFILE max number of open files 1024 1048576 files
911
NOFILE max number of open files 1024 4096 files
0
NOFILE max number of open files 1024 1048576 files
927
NOFILE max number of open files 1024 4096 files
0
NOFILE max number of open files 1024 1048576 files
924
NOFILE max number of open files 1024 4096 files
0
NOFILE max number of open files 1024 524288 files
927
NOFILE max number of open files 1024 4096 files
0
NOFILE max number of open files 1024 524288 files
936
NOFILE max number of open files 1024 4096 files
0
NOFILE max number of open files 1024 1048576 files
917
NOFILE max number of open files 1024 4096 files
0
NOFILE max number of open files 1024 524288 files
919
NOFILE max number of open files 1024 4096 files
0
NOFILE max number of open files 1024 524288 files
927
NOFILE max number of open files 1024 4096 files
0
NOFILE max number of open files 1024 524288 files
922
NOFILE max number of open files 1024 4096 files
0
NOFILE max number of open files 1024 524288 files
970
NOFILE max number of open files 1024 4096 files
0
NOFILE max number of open files 1024 524288 files
384
NOFILE max number of open files 1024 4096 files
0
NOFILE max number of open files 1024 524288 files
1000
NOFILE max number of open files 1024 4096 files
0
NOFILE max number of open files 1024 1048576 files
748
NOFILE max number of open files 1024 4096 files
0
NOFILE max number of open files 1024 524288 files
1006
NOFILE max number of open files 1024 4096 files
0
NOFILE max number of open files 1024 524288 files
940
NOFILE max number of open files 1024 4096 files
0
NOFILE max number of open files 1024 524288 files
924
NOFILE max number of open files 1024 4096 files
0
NOFILE max number of open files 1024 524288 files
972
NOFILE max number of open files 1024 4096 files
0
NOFILE max number of open files 1024 524288 files
939
NOFILE max number of open files 1024 4096 files
0
NOFILE max number of open files 1024 524288 files
923
NOFILE max number of open files 1024 4096 files
0
NOFILE max number of open files 1024 1048576 files
909
agent: 0
boot: order=scsi0;ide2;net0
cores: 4
ide2: none,media=cdrom
memory: 4096
name: xxxxx
net0: virtio=66:EF:17:0D:23:ED,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
parent: auto-hourly-230803103403
protection: 1
scsi0: local-vm:vm-118-disk-0,format=raw,size=23G
scsi1: local-vm:vm-118-disk-1,format=raw,iothread=1,size=500G
scsihw: virtio-scsi-single
smbios1: uuid=ea3d41fa-a76e-4e71-b1b8-8b3909917176
sockets: 1
tags: xxx
vmgenid: 62f4796f-9a01-44f3-82dd-d449507b8aa2

agent: 1
boot: order=scsi0;ide2;net0
cores: 8
hotplug: disk,network,usb
ide2: none,media=cdrom
memory: 32768
name: xxxxx
net0: virtio=52:54:00:82:6b:53,bridge=vmbr0,firewall=1
net1: virtio=CA:4E:42:AF:85:F5,bridge=vmbr0,firewall=1,link_down=1
numa: 0
onboot: 1
ostype: l26
parent: auto-hourly-230803103951
protection: 1
scsi0: local-vm:vm-107-disk-1,format=raw,size=11G
scsi1: local-vm:vm-107-disk-2,format=raw,size=15G
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=851d0535-7155-4be3-8dc3-44c574c6110b
sockets: 2
vmgenid: 1ab5d409-d29c-4a24-8cc7-f80d09d4c234

boot: order=scsi0
cores: 4
ide0: none,media=cdrom
machine: pc-i440fx-6.1
memory: 10240
meta: creation-qemu=6.1.0,ctime=1640857569
name: yyyyy
net0: virtio=42:00:00:9C:60:2F,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: win10
parent: auto-hourly-230803103455
scsi0: local-vm:vm-363-disk-0,size=52G
scsi1: local-vm:vm-363-disk-1,size=800G
scsihw: virtio-scsi-pci
smbios1: uuid=a4b0a686-8796-4bf4-9d1c-2f11d5151a91
sockets: 4
vmgenid: b2fe77b3-b1ed-45c1-9094-26a7bda0a36e
 
Of course, I'll send you everything right away (It doesn't seem like the maximum number of open files to me. :rolleyes: ):
It's not yet the maximum (1024), but you are getting very close and during backup new file descriptors need to be opened. I'd suggest to increase the limit, but it's a bit surprising there are so many. Will try to reproduce. Are you doing other special operations with the VMs except taking backups, e.g. snapshots/disk hotplug/etc.?
 
Sure, I was confused and considering the hard limit.

Yes, let's take several snapshots using "cv4pve-autosnap," and I'd say that explains it.

Kind regards, Luca
--
Luca
 
Last edited:
Sure, I was confused and considering the hard limit.

Yes, let's take several snapshots using "cv4pve-autosnap," and I'd say that explains it.

Kind regards, Luca
--
Luca
I can reproduce the issue. Every time a snapshot is taken, the number of open file descriptors increases by one. Likely, something is not cleaned up properly. I'll try to debug further.
 
I increased the NOFILE by 4 times as much for the soft limit. I'm also attaching you an "lsof" of a KVM process. Additionally, I'd like to add that we're performing "cv4pve-autostop" snapshots using ZFS storage.

I suppose that the KVM processes started before the increase in the NOFILE are at risk, having previously read the NOFILE. However, even with increased NOFILE, it delays the crash but doesn't eliminate it, as I imagine the file descriptors (FDs) are not released. If possible, I'm seeking confirmation of these hypotheses.

Thank you.
 

Attachments

  • lsof.txt
    30.2 KB · Views: 3
Last edited:
I suppose that the KVM processes started before the increase in the NOFILE are at risk, having previously read the NOFILE. However, even with increased NOFILE, it delays the crash but doesn't eliminate it, as I imagine the file descriptors (FDs) are not released. If possible, I'm seeking confirmation of these hypotheses.
Unfortunately, yes. If you create a new QEMU instance, the number of open files will be reset. This can be done via live migration (or with hibernate+resume if you have pve-qemu-kvm>=8.0.2-4, currently available in the no-subscription repository. There was an issue in older versions with resuming when iothread=1 and a PBS backup was done that was fixed in 8.0.2-4).

Ok, I understand. I apologize for the previous post, I hadn't read your message yet.
No need to apologize at all. Thank you again for the report!

I found a fix on the QEMU developer mailing list and tested it. Sent to our mailing list now.
 
Great!! When do you think it might be available on the Proxmox binary repository?

Thanks Luca

--
Luca
Unfortunately, I don't know. It doesn't help that it's holiday season here. But it should be the next bump, so pve-qemu-kvm=8.0.2-5. First it will be on the test and no-subscription repository until it's deemed ready for enterprise, i.e. if no other issue caused by the fix pops up.
 
  • Like
Reactions: Leprelnx
Unfortunately, I don't know. It doesn't help that it's holiday season here. But it should be the next bump, so pve-qemu-kvm=8.0.2-5. First it will be on the test and no-subscription repository until it's deemed ready for enterprise, i.e. if no other issue caused by the fix pops up.


I have a few questions to ask in order to mitigate the issue I'm experiencing in production on several Proxmox Virtual Environments (PVEs):

- If we stop producing snapshots, the inode count won't increase, and therefore these anomalies won't occur. Is that correct?
- I increased NOFILE in limits.d and system.conf, but the new KVM processes still have the old definitions. Which parent process do I need to restart?
- Is it possible to perform a downgrade?

Thank you in advance for your assistance.
 
- If we stop producing snapshots, the inode count won't increase, and therefore these anomalies won't occur. Is that correct?
Yes.
- I increased NOFILE in limits.d and system.conf, but the new KVM processes still have the old definitions. Which parent process do I need to restart?
Did you try systemctl daemon-reload?
- Is it possible to perform a downgrade?
Unfortunately not, because QEMU 7.2 is not supported for Proxmox VE 8.
 
@fiona

Thank you, that's perfect.

After the "daemon-reload" I restarted the "pve-daemon," and now the NOFILE value is okay.

We'll wait for the release on the repository. Thanks for the support.

Best regards.
--
Luca
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!