[SOLVED] Random VM crashes when backing up VM's

Mar 8, 2022
13
1
8
23
Netherlands
Hello!

In the past few weeks we've been setting up a Proxmox cluster to be able to host our applications.
From the start we've been experiencing some weird issues when backing up the VM's to an NFS target (Synology NAS)

Randomly during the backup (sometimes 0 or sometimes multiple) VM's seem to be crashing at random points into the disk backups.

The storage is set up using ZFS over ISCSI to a TrueNAS Scale server.

The VM's all run an Ubuntu server 22.04 Cloudinit image.

I've posted the relevant logs below:

Code:
INFO: Backup started at 2023-02-15 03:44:19
INFO: status = running
INFO: VM Name: kube-master-01
INFO: include disk 'scsi0' 'slc-storage-01-ssd:vm-401-disk-0' 51404M
iscsiadm: No session found.
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/mnt/pve/RS-Brainworkz-Backup-No-Offsite/dump/vzdump-qemu-401-2023_02_15-03_44_19.vma.zst'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '512e4bb0-0410-4553-9f7d-89fe61aff9be'
INFO: resuming VM again
INFO:   0% (178.9 MiB of 50.2 GiB) in 3s, read: 59.6 MiB/s, write: 9.8 MiB/s
INFO:   1% (516.1 MiB of 50.2 GiB) in 9s, read: 56.2 MiB/s, write: 42.6 MiB/s
INFO:   2% (1.1 GiB of 50.2 GiB) in 19s, read: 56.3 MiB/s, write: 43.5 MiB/s
INFO:   3% (1.5 GiB of 50.2 GiB) in 28s, read: 56.3 MiB/s, write: 36.7 MiB/s
INFO:   4% (2.0 GiB of 50.2 GiB) in 37s, read: 56.2 MiB/s, write: 31.6 MiB/s
INFO:   5% (2.5 GiB of 50.2 GiB) in 46s, read: 56.2 MiB/s, write: 32.1 MiB/s
INFO:   6% (3.0 GiB of 50.2 GiB) in 55s, read: 56.4 MiB/s, write: 29.5 MiB/s
INFO:   7% (3.5 GiB of 50.2 GiB) in 1m 4s, read: 56.2 MiB/s, write: 21.3 KiB/s
INFO:   8% (4.0 GiB of 50.2 GiB) in 1m 13s, read: 56.2 MiB/s, write: 5.3 MiB/s
INFO:   9% (4.6 GiB of 50.2 GiB) in 1m 23s, read: 56.3 MiB/s, write: 18.7 MiB/s
INFO:  10% (5.1 GiB of 50.2 GiB) in 1m 32s, read: 56.3 MiB/s, write: 2.1 MiB/s
INFO:  11% (5.6 GiB of 50.2 GiB) in 1m 41s, read: 56.3 MiB/s, write: 1.7 MiB/s
INFO:  12% (6.1 GiB of 50.2 GiB) in 1m 51s, read: 50.6 MiB/s, write: 11.1 MiB/s
INFO:  13% (6.5 GiB of 50.2 GiB) in 2m, read: 56.2 MiB/s, write: 4.0 KiB/s
INFO:  14% (7.0 GiB of 50.2 GiB) in 2m 9s, read: 56.3 MiB/s, write: 0 B/s
INFO:  15% (7.5 GiB of 50.2 GiB) in 2m 18s, read: 56.2 MiB/s, write: 0 B/s
INFO:  16% (8.0 GiB of 50.2 GiB) in 2m 27s, read: 56.3 MiB/s, write: 0 B/s
INFO:  17% (8.6 GiB of 50.2 GiB) in 2m 37s, read: 56.2 MiB/s, write: 124.4 KiB/s
ERROR: VM 401 not running
INFO: aborting backup job
ERROR: VM 401 not running
INFO: resuming VM again
ERROR: Backup of VM 401 failed - VM 401 not running
INFO: Failed at 2023-02-15 03:47:09

The time the VM crashes into it being backed up varies too. Sometimes it happens after less than ten seconds and sometimes it takes minutes before it crashes.

I've tried looking through syslogs of the vm's at the time they're crashing but I can't find any log entries relating to something going wrong (also no dmesg entries on the host).

What steps could I take to try to diagnose these issues?

If any other information is required please let me know! :)
 
After some more investigating I've found that creating an IO load on the VM side while migrating a disk also causes the same crashing issue to happen.

I've found some logs that appear on the truenas side before the VM actually crashes.

Code:
[Wed Feb 15 13:51:46 2023] [6302]: iscsi-scst: ***ERROR***: Connection 0000000032130e18 with initiator iqn.1993-08.org.debian:01:1a871eb45bab unexpectedly closed!
[Wed Feb 15 13:51:46 2023] [1868835]: scst: TM fn NEXUS_LOSS_SESS/6 (mcmd 000000008ea50236, initiator iqn.1993-08.org.debian:01:1a871eb45bab#10.20.0.2, target iqn.2005-10.slc.storage.01:proxmox)
[Wed Feb 15 13:51:46 2023] [6261]: scst: TM fn 6 (mcmd 000000008ea50236) finished, status 0
[Wed Feb 15 13:51:46 2023] [1868835]: iscsi-scst: Freeing conn 0000000032130e18 (sess=00000000921aec4f, 0x53e0000065ba380 0, initiator iqn.1993-08.org.debian:01:1a871eb45bab#10.20.0.2)
[Wed Feb 15 13:51:46 2023] [1868835]: iscsi-scst: Freeing session 00000000921aec4f (SID 53e0000065ba380)
[Wed Feb 15 13:51:46 2023] [6328]: scst: Using security group "security_group" for initiator "iqn.1993-08.org.debian:01:1a871eb45bab#10.20.0.2" (target iqn.2005-10.slc.storage.01:proxmox)
[Wed Feb 15 13:51:46 2023] [6328]: iscsi-scst: Session 000000001eb1bfda created: target 00000000a18d2540, tid 1, sid 0x53f0000d482dd80, initiator iqn.1993-08.org.debian:01:1a871eb45bab#10.20.0.2
[Wed Feb 15 13:51:46 2023] [6328]: iscsi-scst: Negotiated parameters: InitialR2T No, ImmediateData Yes, MaxConnections 1, MaxRecvDataSegmentLength 1048576, MaxXmitDataSegmentLength 262144,
[Wed Feb 15 13:51:46 2023] [6328]: iscsi-scst:     MaxBurstLength 262144, FirstBurstLength 65536, DefaultTime2Wait 0, DefaultTime2Retain 0,
[Wed Feb 15 13:51:46 2023] [6328]: iscsi-scst:     MaxOutstandingR2T 1, DataPDUInOrder Yes, DataSequenceInOrder Yes, ErrorRecoveryLevel 0,
[Wed Feb 15 13:51:46 2023] [6328]: iscsi-scst:     HeaderDigest None, DataDigest None, OFMarker No, IFMarker No, OFMarkInt 2048, IFMarkInt 2048, RDMAExtensions No
[Wed Feb 15 13:51:46 2023] [6328]: iscsi-scst: Target parameters set for session 53f0000d482dd80: QueuedCommands 32, Response timeout 90, Nop-In interval 30, Nop-In timeout 30
[Wed Feb 15 13:51:46 2023] [6328]: iscsi-scst: Creating connection 00000000384c858d for sid 0x53f0000d482dd80, cid 0 (initiator iqn.1993-08.org.debian:01:1a871eb45bab#10.20.0.2)
[Wed Feb 15 13:51:46 2023] [0]: iscsi-scst: ***ERROR***: Connection 00000000477567b7 with initiator iqn.1993-08.org.debian:01:1a871eb45bab unexpectedly closed!
[Wed Feb 15 13:51:46 2023] [0]: iscsi-scst: ***ERROR***: Connection 00000000384c858d with initiator iqn.1993-08.org.debian:01:1a871eb45bab unexpectedly closed!
[Wed Feb 15 13:51:46 2023] [1869050]: scst: TM fn NEXUS_LOSS_SESS/6 (mcmd 00000000ea3bdb1c, initiator iqn.1993-08.org.debian:01:1a871eb45bab#10.20.0.2, target iqn.2005-10.slc.storage.01:proxmox)
[Wed Feb 15 13:51:46 2023] [1869051]: scst: TM fn NEXUS_LOSS_SESS/6 (mcmd 0000000046e39a43, initiator iqn.1993-08.org.debian:01:1a871eb45bab#10.20.0.2, target iqn.2005-10.slc.storage.01:proxmox)
[Wed Feb 15 13:51:50 2023] [6261]: scst: TM fn 6 (mcmd 00000000ea3bdb1c) finished, status 0
[Wed Feb 15 13:51:50 2023] [1869050]: iscsi-scst: Freeing conn 00000000477567b7 (sess=00000000f9731f77, 0x53d0000c6562980 0, initiator iqn.1993-08.org.debian:01:1a871eb45bab#10.20.0.2)
[Wed Feb 15 13:51:50 2023] [1869050]: iscsi-scst: Freeing session 00000000f9731f77 (SID 53d0000c6562980)
[Wed Feb 15 13:51:50 2023] [6261]: scst: TM fn 6 (mcmd 0000000046e39a43) finished, status 0
[Wed Feb 15 13:51:50 2023] [1869051]: iscsi-scst: Freeing conn 00000000384c858d (sess=000000001eb1bfda, 0x53f0000d482dd80 0, initiator iqn.1993-08.org.debian:01:1a871eb45bab#10.20.0.2)
[Wed Feb 15 13:51:50 2023] [1869051]: iscsi-scst: Freeing session 000000001eb1bfda (SID 53f0000d482dd80)
[Wed Feb 15 13:51:57 2023] [6328]: scst: Using security group "security_group" for initiator "iqn.1993-08.org.debian:01:1a871eb45bab#10.20.0.2" (target iqn.2005-10.slc.storage.01:proxmox)
[Wed Feb 15 13:51:58 2023] [6328]: iscsi-scst: Session 00000000fa3b5117 created: target 00000000a18d2540, tid 1, sid 0x540000075292780, initiator iqn.1993-08.org.debian:01:1a871eb45bab#10.20.0.2
[Wed Feb 15 13:51:58 2023] [6328]: iscsi-scst: Negotiated parameters: InitialR2T No, ImmediateData Yes, MaxConnections 1, MaxRecvDataSegmentLength 1048576, MaxXmitDataSegmentLength 262144,
[Wed Feb 15 13:51:58 2023] [6328]: iscsi-scst:     MaxBurstLength 262144, FirstBurstLength 65536, DefaultTime2Wait 0, DefaultTime2Retain 0,
[Wed Feb 15 13:51:58 2023] [6328]: iscsi-scst:     MaxOutstandingR2T 1, DataPDUInOrder Yes, DataSequenceInOrder Yes, ErrorRecoveryLevel 0,
[Wed Feb 15 13:51:58 2023] [6328]: iscsi-scst:     HeaderDigest None, DataDigest None, OFMarker No, IFMarker No, OFMarkInt 2048, IFMarkInt 2048, RDMAExtensions No
[Wed Feb 15 13:51:58 2023] [6328]: iscsi-scst: Target parameters set for session 540000075292780: QueuedCommands 32, Response timeout 90, Nop-In interval 30, Nop-In timeout 30
[Wed Feb 15 13:51:58 2023] [6328]: iscsi-scst: Creating connection 000000001d38d768 for sid 0x540000075292780, cid 0 (initiator iqn.1993-08.org.debian:01:1a871eb45bab#10.20.0.2)
[Wed Feb 15 13:51:58 2023] [6284]: iscsi-scst: Logout received from initiator iqn.1993-08.org.debian:01:1a871eb45bab
[Wed Feb 15 13:51:58 2023] [6302]: iscsi-scst: Closing connection at initiator's iqn.1993-08.org.debian:01:1a871eb45bab request
[Wed Feb 15 13:51:58 2023] [1872490]: scst: TM fn NEXUS_LOSS_SESS/6 (mcmd 000000001b0b7b3d, initiator iqn.1993-08.org.debian:01:1a871eb45bab#10.20.0.2, target iqn.2005-10.slc.storage.01:proxmox)
[Wed Feb 15 13:51:58 2023] [6261]: scst: TM fn 6 (mcmd 000000001b0b7b3d) finished, status 0
[Wed Feb 15 13:51:58 2023] [1872490]: iscsi-scst: Freeing conn 000000001d38d768 (sess=00000000fa3b5117, 0x540000075292780 0, initiator iqn.1993-08.org.debian:01:1a871eb45bab#10.20.0.2)
[Wed Feb 15 13:51:58 2023] [1872490]: iscsi-scst: Freeing session 00000000fa3b5117 (SID 540000075292780)

The following dmesg entries appear on the Proxmox side a few seconds after the first log entries appear on the TrueNAS side:

Code:
Wed Feb 15 13:51:54 2023] kvm[1515396]: segfault at 55d6f5c696de ip 000055d6f5c5d4f7 sp 00007f070794d290 error 7 in qemu-system-x86_64[55d6f56eb000+5a6000]
[Wed Feb 15 13:51:54 2023] Code: 83 ec 28 64 48 8b 04 25 28 00 00 00 48 89 44 24 18 31 c0 8b 05 5a c4 83 00 85 c0 0f 85 c2 00 00 00 48 8d 0d 2b 30 1e 00 31 c0 <f0> 48 0f b1 4b 30 48 85 c0 0f 85 25 01 00 00 48 89 ef e8 e2 fb a8
[Wed Feb 15 13:51:54 2023] vmbr0: port 6(tap107i0) entered disabled state
[Wed Feb 15 13:51:54 2023] vmbr0: port 6(tap107i0) entered disabled state
[Wed Feb 15 13:52:18 2023] device tap107i0 entered promiscuous mode
[Wed Feb 15 13:52:18 2023] vmbr0: port 6(tap107i0) entered blocking state
[Wed Feb 15 13:52:18 2023] vmbr0: port 6(tap107i0) entered disabled state
[Wed Feb 15 13:52:18 2023] vmbr0: port 6(tap107i0) entered blocking state
[Wed Feb 15 13:52:18 2023] vmbr0: port 6(tap107i0) entered forwarding state

My logical conclusion would be that the connection seems to drop somewhere and that's causing the VM to crash. Where should I look to diagnose the culprit?
 
Hi,
please post the output of pveversion -v and qm config <ID> with the ID of an affected VM.

My logical conclusion would be that the connection seems to drop somewhere and that's causing the VM to crash. Where should I look to diagnose the culprit?
I'd recommend re-checking network configuration and stability. But it could also be that the root cause is QEMU, in any case it shouldn't crash.

If you want to further debug the QEMU part of the issue, you could use apt install pve-qemu-kvm-dbg gdb to install the relevant debug symbols and debugger. After starting the VM, you can attach the debugger with
Code:
gdb --ex 'set pagination off' --ex 'handle SIGUSR1 noprint nostop' --ex 'handle SIGPIPE noprint nostop' --ex 'c' -p $(cat /var/run/qemu-server/<ID>.pid)
replacing <ID> with the ID of your VM.

You should see output similar to the following:
Code:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007feada80cee6 in __ppoll (fds=0x559a628eb450, nfds=15, timeout=<optimized out>, timeout@entry=0x7ffcfb3b98c0, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
44      ../sysdeps/unix/sysv/linux/ppoll.c: No such file or directory.
Signal        Stop      Print   Pass to program Description
SIGUSR1       No        No      Yes             User defined signal 1
Signal        Stop      Print   Pass to program Description
SIGPIPE       No        No      Yes             Broken pipe
Continuing.

Then, try to trigger the crash again. You should end up in a prompt in the debugger when it happens. There you can enter t a a bt to get backtraces for all threads. Please share the output of your GDB session here.
 
Hi,
please post the output of pveversion -v and qm config <ID> with the ID of an affected VM.


I'd recommend re-checking network configuration and stability. But it could also be that the root cause is QEMU, in any case it shouldn't crash.

If you want to further debug the QEMU part of the issue, you could use apt install pve-qemu-kvm-dbg gdb to install the relevant debug symbols and debugger. After starting the VM, you can attach the debugger with
Code:
gdb --ex 'set pagination off' --ex 'handle SIGUSR1 noprint nostop' --ex 'handle SIGPIPE noprint nostop' --ex 'c' -p $(cat /var/run/qemu-server/<ID>.pid)
replacing <ID> with the ID of your VM.

You should see output similar to the following:
Code:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007feada80cee6 in __ppoll (fds=0x559a628eb450, nfds=15, timeout=<optimized out>, timeout@entry=0x7ffcfb3b98c0, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
44      ../sysdeps/unix/sysv/linux/ppoll.c: No such file or directory.
Signal        Stop      Print   Pass to program Description
SIGUSR1       No        No      Yes             User defined signal 1
Signal        Stop      Print   Pass to program Description
SIGPIPE       No        No      Yes             Broken pipe
Continuing.

Then, try to trigger the crash again. You should end up in a prompt in the debugger when it happens. There you can enter t a a bt to get backtraces for all threads. Please share the output of your GDB session here.
Hello!

The output of pveversion is
Code:
proxmox-ve: 7.3-1 (running kernel: 5.15.85-1-pve)
pve-manager: 7.3-6 (running version: 7.3-6/723bb6ec)
pve-kernel-helper: 7.3-4
pve-kernel-5.15: 7.3-2
pve-kernel-5.15.85-1-pve: 5.15.85-1
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.3
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.3-1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-2
libpve-guest-common-perl: 4.2-3
libpve-http-server-perl: 4.1-5
libpve-storage-perl: 7.3-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-1
lxcfs: 5.0.3-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.3.3-1
proxmox-backup-file-restore: 2.3.3-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.5
pve-cluster: 7.3-2
pve-container: 4.4-2
pve-docs: 7.3-1
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-7
pve-firmware: 3.6-3
pve-ha-manager: 3.5.1
pve-i18n: 2.8-2
pve-qemu-kvm: 7.1.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.3-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+2
vncterm: 1.7-1
zfsutils-linux: 2.1.9-pve1

The VM config is as follows

Code:
agent: 1,fstrim_cloned_disks=1
boot: order=scsi0;net0
cipassword: **********
ciuser: root
cores: 2
ide2: slc-storage-01-ssd:vm-107-cloudinit,media=cdrom
ipconfig0: ip=10.2.10.4/16,gw=10.2.0.1
memory: 4096
meta: creation-qemu=7.1.0,ctime=1674207981
name: vaultwarden-ubuntu
net0: virtio=DA:9B:FC:25:09:E7,bridge=vmbr0,tag=2
ostype: l26
scsi0: slc-storage-01-ssd:vm-107-disk-1,discard=on,iothread=1,size=102604M,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=7414f0bd-b746-492b-82b3-82f81f73452e
sshkeys: ssh-rsa%20AAAAB3NzaC1yc2EAAAADAQABAAABgQDMf5xFnOxhTWx1IgM3FOS8CZqtxR%2BHSNi0DOFUjp293ISElwLt0i6DbpM%2Bxu08McyKp2u2qzzFQ1SAtdIHa%2F4akUoNJiQuzbvPffBKwVMAE%2BQFbTUByZS1RtMGVi2vH%2BDweDzzG4QnoWaZtBxHJLsigtCKPKSXP1ehc5rLvA%2BHNJDuezYu1fJFiMvl1hALoCpUOuh7VDEX3jms70zbnCy4m4uGxgSryhwpLmA0Fg3I5dPTZuaXi9oi72Z0AsALDuAq0t9iwOTHoLNooiQ1tCCApwV1JW82u21jT7VwUG0lAhfLx0sIRGqwno2HKs%2BgM45F9LHMGtJuaxBryVOpojAM4lftUaRKYvEulqwFBKsNm8hztHEhhqHZZt0CZAjas1OAcbBqDDgCUzwHxNQwA1iTnUJRpmyx8YOK3eZ%2FnE5vO0G1C7phs4cnITQy5Y%2FWNyaRJFSXHB15f%2FKzCRr1tB2Zbamcvv0%2BnLaB14LOAI4yo32Q6GDMlWNPc46s8NuM%2Bp8%3D%20robin%40Workstation-01%0A%0Assh-rsa%20AAAAB3NzaC1yc2EAAAADAQABAAABgQCtznORh8P9RC%2Bneneto4h2Bww%2B4eP2UnwtzC6C6nXrtIXPkn8KAhxObe2%2BsuBnFSaZOyqPgt%2FlHyrRBqnostRkHA6TBcwae3SkaIbS3n%2FpVV3%2FNyrnsCOwIBjA1vQaZjTQxKGh65ZW%2FowdY7p82eM369ZaBMsFsPcQwvHD5EokSUPbjS%2FT8l0%2B2AxauGQXb6peDqD%2FhjAPx%2Fvsw8N3KDxBJOPzGF1wDO%2BCltEIrMYaQ75d0zqs76IPtUUZNHFouaNaEiUWVm2RgLKC3l9EKqgyC5DYiUDTS0moeK1XYODPeEu4XY24F520yiGN5UgozZSenvhO7hzqtmgFA0Y7oDE7UZuuLPkzZyu%2Ff06jHiHXmYc1n5i1RdYQDGVrRR5%2FssaxGzPBG2N0CFbd7J6hV6z7Gkcakgko5R3BeLcC5XqeNnttAaUV8r514odr7guc5U2hzAnucHErkA9LrHvyxhyYeOLZTaPnh63spyUA%2BB31PKD9B63DEAE6L2hNQqLDuTE%3D%20mauri%40Workstation-04
tags: internal;nfs
vmgenid: 37c4a603-71e1-4a02-8882-a25aefaf6cd6

I'lll run the debugger now and post the reaction below
 
Hi,
please post the output of pveversion -v and qm config <ID> with the ID of an affected VM.


I'd recommend re-checking network configuration and stability. But it could also be that the root cause is QEMU, in any case it shouldn't crash.

If you want to further debug the QEMU part of the issue, you could use apt install pve-qemu-kvm-dbg gdb to install the relevant debug symbols and debugger. After starting the VM, you can attach the debugger with
Code:
gdb --ex 'set pagination off' --ex 'handle SIGUSR1 noprint nostop' --ex 'handle SIGPIPE noprint nostop' --ex 'c' -p $(cat /var/run/qemu-server/<ID>.pid)
replacing <ID> with the ID of your VM.

You should see output similar to the following:
Code:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007feada80cee6 in __ppoll (fds=0x559a628eb450, nfds=15, timeout=<optimized out>, timeout@entry=0x7ffcfb3b98c0, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
44      ../sysdeps/unix/sysv/linux/ppoll.c: No such file or directory.
Signal        Stop      Print   Pass to program Description
SIGUSR1       No        No      Yes             User defined signal 1
Signal        Stop      Print   Pass to program Description
SIGPIPE       No        No      Yes             Broken pipe
Continuing.

Then, try to trigger the crash again. You should end up in a prompt in the debugger when it happens. There you can enter t a a bt to get backtraces for all threads. Please share the output of your GDB session here.
The output from GDB is the following:

Code:
root@slc-app-02:~# gdb --ex 'set pagination off' --ex 'handle SIGUSR1 noprint nostop' --ex 'handle SIGPIPE noprint nostop' --ex 'c' -p $(cat /var/run/qemu-server/107.pid)
GNU gdb (Debian 10.1-1.7) 10.1.90.20210103-git
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 1606770
[New LWP 1606771]
[New LWP 1606772]
[New LWP 1606796]
[New LWP 1606797]
[New LWP 1606799]
[New LWP 1607011]
[New LWP 1607115]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007faf63a54e26 in __ppoll (fds=0x55d12cdaff90, nfds=76, timeout=<optimized out>, timeout@entry=0x7ffdff528c80, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
44      ../sysdeps/unix/sysv/linux/ppoll.c: No such file or directory.
Signal        Stop      Print   Pass to program Description
SIGUSR1       No        No      Yes             User defined signal 1
Signal        Stop      Print   Pass to program Description
SIGPIPE       No        No      Yes             Broken pipe
Continuing.

Thread 3 "kvm" received signal SIGABRT, Aborted.
[Switching to Thread 0x7faf53fff700 (LWP 1606772)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) t a a bt

Thread 8 (Thread 0x7faf53fff700 (LWP 1607115) "iou-wrk-1606772"):
#0  0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x0

Thread 7 (Thread 0x7faf53fff700 (LWP 1607011) "iou-wrk-1606772"):
#0  0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x0

Thread 6 (Thread 0x7fae43dff700 (LWP 1606799) "vnc_worker"):
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x55d12c2b5778) at ../sysdeps/nptl/futex-internal.h:186
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55d12c2b5788, cond=0x55d12c2b5750) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x55d12c2b5750, mutex=mutex@entry=0x55d12c2b5788) at pthread_cond_wait.c:638
#3  0x000055d12afc055b in qemu_cond_wait_impl (cond=0x55d12c2b5750, mutex=0x55d12c2b5788, file=0x55d12b04b274 "../ui/vnc-jobs.c", line=248) at ../util/qemu-thread-posix.c:219
#4  0x000055d12aa8e5e3 in vnc_worker_thread_loop (queue=0x55d12c2b5750) at ../ui/vnc-jobs.c:248
#5  0x000055d12aa8f2a8 in vnc_worker_thread (arg=arg@entry=0x55d12c2b5750) at ../ui/vnc-jobs.c:361
#6  0x000055d12afbfa19 in qemu_thread_start (args=0x7fae43dfa3f0) at ../util/qemu-thread-posix.c:504
#7  0x00007faf63b42ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#8  0x00007faf63a60a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7faf52ffd700 (LWP 1606797) "CPU 1/KVM"):
#0  0x00007faf63a565f7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x000055d12ae3e817 in kvm_vcpu_ioctl (cpu=cpu@entry=0x55d12c297e50, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3089
#2  0x000055d12ae3e981 in kvm_cpu_exec (cpu=cpu@entry=0x55d12c297e50) at ../accel/kvm/kvm-all.c:2904
#3  0x000055d12ae3ffed in kvm_vcpu_thread_fn (arg=arg@entry=0x55d12c297e50) at ../accel/kvm/kvm-accel-ops.c:49
#4  0x000055d12afbfa19 in qemu_thread_start (args=0x7faf52ff83f0) at ../util/qemu-thread-posix.c:504
#5  0x00007faf63b42ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007faf63a60a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7faf537fe700 (LWP 1606796) "CPU 0/KVM"):
#0  0x00007faf63a565f7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x000055d12ae3e817 in kvm_vcpu_ioctl (cpu=cpu@entry=0x55d12c2619d0, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3089
#2  0x000055d12ae3e981 in kvm_cpu_exec (cpu=cpu@entry=0x55d12c2619d0) at ../accel/kvm/kvm-all.c:2904
#3  0x000055d12ae3ffed in kvm_vcpu_thread_fn (arg=arg@entry=0x55d12c2619d0) at ../accel/kvm/kvm-accel-ops.c:49
#4  0x000055d12afbfa19 in qemu_thread_start (args=0x7faf537f93f0) at ../util/qemu-thread-posix.c:504
#5  0x00007faf63b42ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007faf63a60a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7faf53fff700 (LWP 1606772) "kvm"):
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007faf63986537 in __GI_abort () at abort.c:79
#2  0x00007faf639df768 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7faf63afd3a5 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3  0x00007faf639e6a5a in malloc_printerr (str=str@entry=0x7faf63affb98 "malloc(): invalid next size (unsorted)") at malloc.c:5347
#4  0x00007faf639e9b94 in _int_malloc (av=av@entry=0x7faf4c000020, bytes=bytes@entry=256) at malloc.c:3739
#5  0x00007faf639eb299 in __GI___libc_malloc (bytes=256) at malloc.c:3066
#6  0x00007faf64494fdc in ?? () from /lib/x86_64-linux-gnu/libiscsi.so.7
#7  0x00007faf644950da in ?? () from /lib/x86_64-linux-gnu/libiscsi.so.7
#8  0x00007faf64499af6 in ?? () from /lib/x86_64-linux-gnu/libiscsi.so.7
#9  0x00007faf6449a348 in iscsi_scsi_command_async () from /lib/x86_64-linux-gnu/libiscsi.so.7
#10 0x00007faf64494401 in ?? () from /lib/x86_64-linux-gnu/libiscsi.so.7
#11 0x00007faf644940bc in ?? () from /lib/x86_64-linux-gnu/libiscsi.so.7
#12 0x00007faf64497ebb in ?? () from /lib/x86_64-linux-gnu/libiscsi.so.7
#13 0x00007faf6449966e in ?? () from /lib/x86_64-linux-gnu/libiscsi.so.7
#14 0x00007faf644a58e7 in ?? () from /lib/x86_64-linux-gnu/libiscsi.so.7
#15 0x000055d12af19054 in iscsi_process_read (arg=0x55d12c169a00) at ../block/iscsi.c:403
#16 0x000055d12afbbf78 in aio_dispatch_handler (ctx=ctx@entry=0x55d12c16a510, node=0x7faf4c03da80) at ../util/aio-posix.c:369
#17 0x000055d12afbc9c8 in aio_dispatch_ready_handlers (ready_list=0x7faf53ffa368, ctx=0x55d12c16a510) at ../util/aio-posix.c:399
#18 aio_poll (ctx=0x55d12c16a510, blocking=blocking@entry=true) at ../util/aio-posix.c:713
#19 0x000055d12ae7bd36 in iothread_run (opaque=opaque@entry=0x55d12bf9a900) at ../iothread.c:67
#20 0x000055d12afbfa19 in qemu_thread_start (args=0x7faf53ffa3f0) at ../util/qemu-thread-posix.c:504
#21 0x00007faf63b42ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#22 0x00007faf63a60a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7faf58f0f700 (LWP 1606771) "call_rcu"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x000055d12afc0bda in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at /build/pve-qemu/pve-qemu-kvm-7.1.0/include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0x55d12b80c608 <rcu_call_ready_event>) at ../util/qemu-thread-posix.c:429
#3  0x000055d12afc916a in call_rcu_thread (opaque=opaque@entry=0x0) at ../util/rcu.c:261
#4  0x000055d12afbfa19 in qemu_thread_start (args=0x7faf58f0a3f0) at ../util/qemu-thread-posix.c:504
#5  0x00007faf63b42ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007faf63a60a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7faf590721c0 (LWP 1606770) "kvm"):
#0  0x00007faf63a54e26 in __ppoll (fds=0x55d12cdaff90, nfds=76, timeout=<optimized out>, timeout@entry=0x7ffdff528c80, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
#1  0x000055d12afdf291 in ppoll (__ss=0x0, __timeout=0x7ffdff528c80, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=316490797) at ../util/qemu-timer.c:351
#3  0x000055d12afdbaa5 in os_host_main_loop_wait (timeout=316490797) at ../util/main-loop.c:315
#4  main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:596
#5  0x000055d12ac2a861 in qemu_main_loop () at ../softmmu/runstate.c:734
#6  0x000055d12aa67edc in qemu_main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at ../softmmu/main.c:38
#7  0x00007faf63987d0a in __libc_start_main (main=0x55d12aa631e0 <main>, argc=72, argv=0x7ffdff528e48, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffdff528e38) at ../csu/libc-start.c:308
#8  0x000055d12aa67e0a in _start ()
(gdb)
 
I'd recommend re-checking network configuration and stability. But it could also be that the root cause is QEMU, in any case it shouldn't crash.
I don't think the network would be the issue as the storage connection is just a flat network between the host and TrueNAS. It just goes over one switch.
I have tried disabling the link aggregation on the interfaces on both the NAS and Proxmox side but that hasn't fixed anything either
 
Thread 3 (Thread 0x7faf53fff700 (LWP 1606772) "kvm"):
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007faf63986537 in __GI_abort () at abort.c:79
The abort is here.
#2 0x00007faf639df768 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7faf63afd3a5 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3 0x00007faf639e6a5a in malloc_printerr (str=str@entry=0x7faf63affb98 "malloc(): invalid next size (unsorted)") at malloc.c:5347
#4 0x00007faf639e9b94 in _int_malloc (av=av@entry=0x7faf4c000020, bytes=bytes@entry=256) at malloc.c:3739
Unfortunately, this can indicate a heap corruption, which can be very difficult to debug.
#5 0x00007faf639eb299 in __GI___libc_malloc (bytes=256) at malloc.c:3066
#6 0x00007faf64494fdc in ?? () from /lib/x86_64-linux-gnu/libiscsi.so.7
#7 0x00007faf644950da in ?? () from /lib/x86_64-linux-gnu/libiscsi.so.7
#8 0x00007faf64499af6 in ?? () from /lib/x86_64-linux-gnu/libiscsi.so.7
#9 0x00007faf6449a348 in iscsi_scsi_command_async () from /lib/x86_64-linux-gnu/libiscsi.so.7
#10 0x00007faf64494401 in ?? () from /lib/x86_64-linux-gnu/libiscsi.so.7
#11 0x00007faf644940bc in ?? () from /lib/x86_64-linux-gnu/libiscsi.so.7
#12 0x00007faf64497ebb in ?? () from /lib/x86_64-linux-gnu/libiscsi.so.7
#13 0x00007faf6449966e in ?? () from /lib/x86_64-linux-gnu/libiscsi.so.7
#14 0x00007faf644a58e7 in ?? () from /lib/x86_64-linux-gnu/libiscsi.so.7
Installing the debug symbols for libiscsi would help to see these functions too. It could be that the corruption happens right before the malloc in those functions, in which case there's at least hope to find the root cause of the issue quickly.
 
Installing the debug symbols for libiscsi would help to see these functions too. It could be that the corruption happens right before the malloc in those functions, in which case there's at least hope to find the root cause of the issue quickly.
I have installed libisci-bin and ran the debugger again. The following stacktrace was printed out after triggering the crash again:

Code:
GNU gdb (Debian 10.1-1.7) 10.1.90.20210103-git
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 1211702
[New LWP 1211703]
[New LWP 1211704]
[New LWP 1211941]
[New LWP 1211942]
[New LWP 1211944]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fcec3507e26 in __ppoll (fds=0x5571f07d0200, nfds=75, timeout=<optimized out>, timeout@entry=0x7ffee15a4100, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
44      ../sysdeps/unix/sysv/linux/ppoll.c: No such file or directory.
Signal        Stop      Print   Pass to program Description
SIGUSR1       No        No      Yes             User defined signal 1
Signal        Stop      Print   Pass to program Description
SIGPIPE       No        No      Yes             Broken pipe
Continuing.

Thread 3 "kvm" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fceb3fff700 (LWP 1211704)]
__strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex.S:77
77      ../sysdeps/x86_64/multiarch/strlen-evex.S: No such file or directory.
(gdb) t a a bt

Thread 6 (Thread 0x7fcda3dff700 (LWP 1211944) "vnc_worker"):
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x5571f0e6c628) at ../sysdeps/nptl/futex-internal.h:186
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5571f0e6c638, cond=0x5571f0e6c600) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x5571f0e6c600, mutex=mutex@entry=0x5571f0e6c638) at pthread_cond_wait.c:638
#3  0x00005571ed15355b in qemu_cond_wait_impl (cond=0x5571f0e6c600, mutex=0x5571f0e6c638, file=0x5571ed1de274 "../ui/vnc-jobs.c", line=248) at ../util/qemu-thread-posix.c:219
#4  0x00005571ecc215e3 in vnc_worker_thread_loop (queue=0x5571f0e6c600) at ../ui/vnc-jobs.c:248
#5  0x00005571ecc222a8 in vnc_worker_thread (arg=arg@entry=0x5571f0e6c600) at ../ui/vnc-jobs.c:361
#6  0x00005571ed152a19 in qemu_thread_start (args=0x7fcda3dfa3f0) at ../util/qemu-thread-posix.c:504
#7  0x00007fcec35f5ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#8  0x00007fcec3513a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7fceb2cbc700 (LWP 1211942) "CPU 1/KVM"):
#0  0x00007fcec35095f7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x00005571ecfd1817 in kvm_vcpu_ioctl (cpu=cpu@entry=0x5571efa3aae0, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3089
#2  0x00005571ecfd1981 in kvm_cpu_exec (cpu=cpu@entry=0x5571efa3aae0) at ../accel/kvm/kvm-all.c:2904
#3  0x00005571ecfd2fed in kvm_vcpu_thread_fn (arg=arg@entry=0x5571efa3aae0) at ../accel/kvm/kvm-accel-ops.c:49
#4  0x00005571ed152a19 in qemu_thread_start (args=0x7fceb2cb73f0) at ../util/qemu-thread-posix.c:504
#5  0x00007fcec35f5ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007fcec3513a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7fceb37fe700 (LWP 1211941) "CPU 0/KVM"):
#0  0x00007fcec35095f7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x00005571ecfd1817 in kvm_vcpu_ioctl (cpu=cpu@entry=0x5571efa049f0, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3089
#2  0x00005571ecfd1981 in kvm_cpu_exec (cpu=cpu@entry=0x5571efa049f0) at ../accel/kvm/kvm-all.c:2904
#3  0x00005571ecfd2fed in kvm_vcpu_thread_fn (arg=arg@entry=0x5571efa049f0) at ../accel/kvm/kvm-accel-ops.c:49
#4  0x00005571ed152a19 in qemu_thread_start (args=0x7fceb37f93f0) at ../util/qemu-thread-posix.c:504
#5  0x00007fcec35f5ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007fcec3513a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7fceb3fff700 (LWP 1211704) "kvm"):
#0  __strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex.S:77
#1  0x00007fcec347ff76 in __vfprintf_internal (s=s@entry=0x7fceb3ff7b10, format=0x5571ed346450 "%s: Co-routine was already scheduled in '%s'\n", ap=0x7fceb3ffa1b0, mode_flags=2) at vfprintf-internal.c:1688
#2  0x00007fcec3480f44 in buffered_vfprintf (s=s@entry=0x7fcec35e75c0 <_IO_2_1_stderr_>, format=format@entry=0x5571ed346450 "%s: Co-routine was already scheduled in '%s'\n", args=args@entry=0x7fceb3ffa1b0, mode_flags=mode_flags@entry=2) at vfprintf-internal.c:2377
#3  0x00007fcec347e0d4 in __vfprintf_internal (s=0x7fcec35e75c0 <_IO_2_1_stderr_>, format=0x5571ed346450 "%s: Co-routine was already scheduled in '%s'\n", ap=ap@entry=0x7fceb3ffa1b0, mode_flags=mode_flags@entry=2) at vfprintf-internal.c:1346
#4  0x00007fcec3521cff in ___fprintf_chk (fp=<optimized out>, flag=flag@entry=1, format=format@entry=0x5571ed346450 "%s: Co-routine was already scheduled in '%s'\n") at fprintf_chk.c:33
#5  0x00005571ed163648 in fprintf (__fmt=0x5571ed346450 "%s: Co-routine was already scheduled in '%s'\n", __stream=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/stdio2.h:100
#6  aio_co_schedule (ctx=0x2d4eb72cdc91933, co=0x7fceac03c2b0) at ../util/async.c:590
#7  0x00005571ed162c64 in aio_bh_call (bh=0x7fceac025c70) at ../util/async.c:150
#8  aio_bh_poll (ctx=ctx@entry=0x5571ef90ab40) at ../util/async.c:178
#9  0x00005571ed14f980 in aio_poll (ctx=0x5571ef90ab40, blocking=blocking@entry=true) at ../util/aio-posix.c:712
#10 0x00005571ed00ed36 in iothread_run (opaque=opaque@entry=0x5571ef843c00) at ../iothread.c:67
#11 0x00005571ed152a19 in qemu_thread_start (args=0x7fceb3ffa3f0) at ../util/qemu-thread-posix.c:504
#12 0x00007fcec35f5ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#13 0x00007fcec3513a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7fceb89ba700 (LWP 1211703) "call_rcu"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00005571ed153bda in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at /build/pve-qemu/pve-qemu-kvm-7.1.0/include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0x5571ed99f608 <rcu_call_ready_event>) at ../util/qemu-thread-posix.c:429
#3  0x00005571ed15c16a in call_rcu_thread (opaque=opaque@entry=0x0) at ../util/rcu.c:261
#4  0x00005571ed152a19 in qemu_thread_start (args=0x7fceb89b53f0) at ../util/qemu-thread-posix.c:504
#5  0x00007fcec35f5ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007fcec3513a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7fceb8b251c0 (LWP 1211702) "kvm"):
#0  0x00007fcec3507e26 in __ppoll (fds=0x5571f07d0200, nfds=76, timeout=<optimized out>, timeout@entry=0x7ffee15a4100, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
#1  0x00005571ed172291 in ppoll (__ss=0x0, __timeout=0x7ffee15a4100, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=338794369) at ../util/qemu-timer.c:351
#3  0x00005571ed16eaa5 in os_host_main_loop_wait (timeout=338794369) at ../util/main-loop.c:315
#4  main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:596
#5  0x00005571ecdbd861 in qemu_main_loop () at ../softmmu/runstate.c:734
#6  0x00005571ecbfaedc in qemu_main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at ../softmmu/main.c:38
#7  0x00007fcec343ad0a in __libc_start_main (main=0x5571ecbf61e0 <main>, argc=72, argv=0x7ffee15a42c8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffee15a42b8) at ../csu/libc-start.c:308
#8  0x00005571ecbfae0a in _start ()
(gdb)

I hope this helps to debug the issue. Let me know if I need other packages for the debug symbols as I'm not very familiar with gdb debugging.
 
Thread 3 "kvm" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fceb3fff700 (LWP 1211704)]
__strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex.S:77
77 ../sysdeps/x86_64/multiarch/strlen-evex.S: No such file or directory.
(gdb) t a a bt

Thread 3 (Thread 0x7fceb3fff700 (LWP 1211704) "kvm"):
#0 __strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex.S:77
#1 0x00007fcec347ff76 in __vfprintf_internal (s=s@entry=0x7fceb3ff7b10, format=0x5571ed346450 "%s: Co-routine was already scheduled in '%s'\n", ap=0x7fceb3ffa1b0, mode_flags=2) at vfprintf-internal.c:1688
#2 0x00007fcec3480f44 in buffered_vfprintf (s=s@entry=0x7fcec35e75c0 <_IO_2_1_stderr_>, format=format@entry=0x5571ed346450 "%s: Co-routine was already scheduled in '%s'\n", args=args@entry=0x7fceb3ffa1b0, mode_flags=mode_flags@entry=2) at vfprintf-internal.c:2377
#3 0x00007fcec347e0d4 in __vfprintf_internal (s=0x7fcec35e75c0 <_IO_2_1_stderr_>, format=0x5571ed346450 "%s: Co-routine was already scheduled in '%s'\n", ap=ap@entry=0x7fceb3ffa1b0, mode_flags=mode_flags@entry=2) at vfprintf-internal.c:1346
#4 0x00007fcec3521cff in ___fprintf_chk (fp=<optimized out>, flag=flag@entry=1, format=format@entry=0x5571ed346450 "%s: Co-routine was already scheduled in '%s'\n") at fprintf_chk.c:33
#5 0x00005571ed163648 in fprintf (__fmt=0x5571ed346450 "%s: Co-routine was already scheduled in '%s'\n", __stream=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/stdio2.h:100
#6 aio_co_schedule (ctx=0x2d4eb72cdc91933, co=0x7fceac03c2b0) at ../util/async.c:590
#7 0x00005571ed162c64 in aio_bh_call (bh=0x7fceac025c70) at ../util/async.c:150
#8 aio_bh_poll (ctx=ctx@entry=0x5571ef90ab40) at ../util/async.c:178
#9 0x00005571ed14f980 in aio_poll (ctx=0x5571ef90ab40, blocking=blocking@entry=true) at ../util/aio-posix.c:712
#10 0x00005571ed00ed36 in iothread_run (opaque=opaque@entry=0x5571ef843c00) at ../iothread.c:67
#11 0x00005571ed152a19 in qemu_thread_start (args=0x7fceb3ffa3f0) at ../util/qemu-thread-posix.c:504
#12 0x00007fcec35f5ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#13 0x00007fcec3513a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Unfortunately, it's at a completely different location this time and a segmentation fault instead of abort.

I suggest you also check your RAM with e.g. MemTest86 (included on the Proxmox VE installer ISO). Memory issues at random locations can be caused by bad hardware too, not just software errors like heap corruption.
 
Unfortunately, it's at a completely different location this time and a segmentation fault instead of abort.

I suggest you also check your RAM with e.g. MemTest86 (included on the Proxmox VE installer ISO). Memory issues at random locations can be caused by bad hardware too, not just software errors like heap corruption.
Aha that's weird.

The problem does happen over three seperate hosts though so I can't imagine it being a hardware issue. Unless all three of our hosts are bad (Dell R640's). Is there anywhere else I can look? I'll memtest one to be sure though!
 
Aha that's weird.

The problem does happen over three seperate hosts though so I can't imagine it being a hardware issue. Unless all three of our hosts are bad (Dell R640's). Is there anywhere else I can look? I'll memtest one to be sure though!
Okay, then it's much more likely to be a software issue after all.

If the heap corruption originates from some other context, it'll be very hard to debug, unfortunately. There are tools like valgrind to catch/debug such memory issues, but with something as huge as a full-blown QEMU VM, I think the slowdown is just too much.

If you really want, you can try to trigger it a few more times, maybe there is some pattern to it. You can use just bt to get the backtrace of the current thread (you should always land in the one with the error).
 
Okay, then it's much more likely to be a software issue after all.

If the heap corruption originates from some other context, it'll be very hard to debug, unfortunately. There are tools like valgrind to catch/debug such memory issues, but with something as huge as a full-blown QEMU VM, I think the slowdown is just too much.

If you really want, you can try to trigger it a few more times, maybe there is some pattern to it. You can use just bt to get the backtrace of the current thread (you should always land in the one with the error).
Alright! I will try triggering the error a few more times on Monday and see where we can go from there.

Have a nice weekend!
 
I have just triggered the crash a few more times. I'll post the results below:

Code:
Thread 3 "kvm" received signal SIGABRT, Aborted.
[Switching to Thread 0x7f33c23ef700 (LWP 1491768)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) t a a bt

Thread 6 (Thread 0x7f32b21bf700 (LWP 1492006) "vnc_worker"):
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x562eba256848) at ../sysdeps/nptl/futex-internal.h:186
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x562eba256858, cond=0x562eba256820) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x562eba256820, mutex=mutex@entry=0x562eba256858) at pthread_cond_wait.c:638
#3  0x0000562eb811955b in qemu_cond_wait_impl (cond=0x562eba256820, mutex=0x562eba256858, file=0x562eb81a4274 "../ui/vnc-jobs.c", line=248) at ../util/qemu-thread-posix.c:219
#4  0x0000562eb7be75e3 in vnc_worker_thread_loop (queue=0x562eba256820) at ../ui/vnc-jobs.c:248
#5  0x0000562eb7be82a8 in vnc_worker_thread (arg=arg@entry=0x562eba256820) at ../ui/vnc-jobs.c:361
#6  0x0000562eb8118a19 in qemu_thread_start (args=0x7f32b21ba3f0) at ../util/qemu-thread-posix.c:504
#7  0x00007f33cd92cea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#8  0x00007f33cd84aa2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7f33c11ad700 (LWP 1492004) "CPU 1/KVM"):
#0  0x00007f33cd8405f7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x0000562eb7f97817 in kvm_vcpu_ioctl (cpu=cpu@entry=0x562eba25bc50, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3089
#2  0x0000562eb7f97981 in kvm_cpu_exec (cpu=cpu@entry=0x562eba25bc50) at ../accel/kvm/kvm-all.c:2904
#3  0x0000562eb7f98fed in kvm_vcpu_thread_fn (arg=arg@entry=0x562eba25bc50) at ../accel/kvm/kvm-accel-ops.c:49
#4  0x0000562eb8118a19 in qemu_thread_start (args=0x7f33c11a83f0) at ../util/qemu-thread-posix.c:504
#5  0x00007f33cd92cea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f33cd84aa2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7f33c1bee700 (LWP 1492003) "CPU 0/KVM"):
#0  0x00007f33cd8405f7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x0000562eb7f97817 in kvm_vcpu_ioctl (cpu=cpu@entry=0x562eba224c00, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3089
#2  0x0000562eb7f97981 in kvm_cpu_exec (cpu=cpu@entry=0x562eba224c00) at ../accel/kvm/kvm-all.c:2904
#3  0x0000562eb7f98fed in kvm_vcpu_thread_fn (arg=arg@entry=0x562eba224c00) at ../accel/kvm/kvm-accel-ops.c:49
#4  0x0000562eb8118a19 in qemu_thread_start (args=0x7f33c1be93f0) at ../util/qemu-thread-posix.c:504
#5  0x00007f33cd92cea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f33cd84aa2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7f33c23ef700 (LWP 1491768) "kvm"):
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f33cd770537 in __GI_abort () at abort.c:79
#2  0x00007f33cd7c9768 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f33cd8e73a5 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3  0x00007f33cd7d0a5a in malloc_printerr (str=str@entry=0x7f33cd8e9790 "double free or corruption (out)") at malloc.c:5347
#4  0x00007f33cd7d2088 in _int_free (av=0x7f33cd91db80 <main_arena>, p=0x7f33b40327b0, have_lock=<optimized out>) at malloc.c:4314
#5  0x0000562eb80755da in iscsi_co_writev (bs=<optimized out>, sector_num=<optimized out>, nb_sectors=<optimized out>, iov=0x7f33b4003160, flags=<optimized out>) at ../block/iscsi.c:665
#6  0x0000562eb800cd75 in bdrv_driver_pwritev (bs=bs@entry=0x562eba148a40, offset=offset@entry=52565786624, bytes=bytes@entry=4096, qiov=qiov@entry=0x7f33b4003160, qiov_offset=qiov_offset@entry=0, flags=flags@entry=0) at ../block/io.c:1266
#7  0x0000562eb800d72e in bdrv_aligned_pwritev (child=child@entry=0x562eba13d660, req=req@entry=0x7f32a33f37a0, offset=52565786624, bytes=4096, align=align@entry=512, qiov=0x7f33b4003160, qiov_offset=0, flags=0) at ../block/io.c:2101
#8  0x0000562eb800e91a in bdrv_co_pwritev_part (child=0x562eba13d660, offset=<optimized out>, bytes=<optimized out>, bytes@entry=4096, qiov=<optimized out>, qiov@entry=0x7f33b4003160, qiov_offset=<optimized out>, qiov_offset@entry=0, flags=flags@entry=0) at ../block/io.c:2293
#9  0x0000562eb800ecfb in bdrv_co_pwritev (child=<optimized out>, offset=<optimized out>, bytes=bytes@entry=4096, qiov=qiov@entry=0x7f33b4003160, flags=flags@entry=0) at ../block/io.c:2210
#10 0x0000562eb8039ccd in raw_co_pwritev (bs=0x562eba141700, offset=<optimized out>, bytes=4096, qiov=<optimized out>, flags=0) at ../block/raw-format.c:269
#11 0x0000562eb800cccb in bdrv_driver_pwritev (bs=bs@entry=0x562eba141700, offset=offset@entry=52565786624, bytes=bytes@entry=4096, qiov=qiov@entry=0x7f33b4003160, qiov_offset=qiov_offset@entry=0, flags=flags@entry=0) at ../block/io.c:1233
#12 0x0000562eb800d72e in bdrv_aligned_pwritev (child=child@entry=0x562ebac25650, req=req@entry=0x7f32a33f3ae0, offset=52565786624, bytes=4096, align=align@entry=1, qiov=0x7f33b4003160, qiov_offset=0, flags=0) at ../block/io.c:2101
#13 0x0000562eb800e91a in bdrv_co_pwritev_part (child=0x562ebac25650, offset=<optimized out>, offset@entry=52565786624, bytes=<optimized out>, bytes@entry=4096, qiov=<optimized out>, qiov@entry=0x7f33b4003160, qiov_offset=<optimized out>, qiov_offset@entry=0, flags=flags@entry=0) at ../block/io.c:2293
#14 0x0000562eb800ecfb in bdrv_co_pwritev (child=<optimized out>, offset=offset@entry=52565786624, bytes=bytes@entry=4096, qiov=qiov@entry=0x7f33b4003160, flags=flags@entry=0) at ../block/io.c:2210
#15 0x0000562eb80121e1 in bdrv_mirror_top_do_write (bs=<optimized out>, method=MIRROR_METHOD_COPY, offset=52565786624, bytes=4096, qiov=0x7f33b4003160, flags=0) at ../block/mirror.c:1452
#16 0x0000562eb800cccb in bdrv_driver_pwritev (bs=bs@entry=0x562ebab3c9f0, offset=offset@entry=52565786624, bytes=bytes@entry=4096, qiov=qiov@entry=0x7f33b4003160, qiov_offset=qiov_offset@entry=0, flags=flags@entry=0) at ../block/io.c:1233
#17 0x0000562eb800d72e in bdrv_aligned_pwritev (child=child@entry=0x562eba152500, req=req@entry=0x7f32a33f3e10, offset=52565786624, bytes=4096, align=align@entry=1, qiov=0x7f33b4003160, qiov_offset=0, flags=0) at ../block/io.c:2101
#18 0x0000562eb800e91a in bdrv_co_pwritev_part (child=0x562eba152500, offset=<optimized out>, offset@entry=52565786624, bytes=<optimized out>, bytes@entry=4096, qiov=<optimized out>, qiov@entry=0x7f33b4003160, qiov_offset=<optimized out>, qiov_offset@entry=0, flags=flags@entry=0) at ../block/io.c:2293
#19 0x0000562eb7fff459 in blk_co_do_pwritev_part (blk=0x562eba13fbc0, offset=52565786624, bytes=4096, qiov=0x7f33b4003160, qiov_offset=qiov_offset@entry=0, flags=0) at ../block/block-backend.c:1388
#20 0x0000562eb7fff5ab in blk_aio_write_entry (opaque=0x7f33b402f0e0) at ../block/block-backend.c:1568
#21 0x0000562eb812a9eb in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at ../util/coroutine-ucontext.c:177
#22 0x00007f33cd79bd40 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#23 0x00007f33c23e9850 in ?? ()
#24 0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7f33c2cf1700 (LWP 1491767) "call_rcu"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x0000562eb8119bda in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at /build/pve-qemu/pve-qemu-kvm-7.1.0/include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0x562eb8965608 <rcu_call_ready_event>) at ../util/qemu-thread-posix.c:429
#3  0x0000562eb812216a in call_rcu_thread (opaque=opaque@entry=0x0) at ../util/rcu.c:261
#4  0x0000562eb8118a19 in qemu_thread_start (args=0x7f33c2cec3f0) at ../util/qemu-thread-posix.c:504
#5  0x00007f33cd92cea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f33cd84aa2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7f33c2e5c1c0 (LWP 1491766) "kvm"):
#0  0x00007f33cd83ee26 in __ppoll (fds=0x562eba108160, nfds=10, timeout=<optimized out>, timeout@entry=0x7fff4918b770, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
#1  0x0000562eb8138291 in ppoll (__ss=0x0, __timeout=0x7fff4918b770, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=943799363) at ../util/qemu-timer.c:351
#3  0x0000562eb8134aa5 in os_host_main_loop_wait (timeout=943799363) at ../util/main-loop.c:315
#4  main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:596
#5  0x0000562eb7d83861 in qemu_main_loop () at ../softmmu/runstate.c:734
#6  0x0000562eb7bc0edc in qemu_main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at ../softmmu/main.c:38
#7  0x00007f33cd771d0a in __libc_start_main (main=0x562eb7bbc1e0 <main>, argc=69, argv=0x7fff4918b938, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff4918b928) at ../csu/libc-start.c:308
#8  0x0000562eb7bc0e0a in _start ()

Code:
Thread 3 "kvm" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fb5c5d52700 (LWP 12368)]
0x000055ba86c394f7 in aio_co_schedule (ctx=0x492c7f000003e7fa, co=0x55ba86c456ae <qemu_co_mutex_lock+46>) at ../util/async.c:586
586     ../util/async.c: No such file or directory.
(gdb) t a a bt

Thread 6 (Thread 0x7fb4b5bbf700 (LWP 12571) "vnc_worker"):
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x55ba8a5e8b2c) at ../sysdeps/nptl/futex-internal.h:186
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55ba8a5e8b38, cond=0x55ba8a5e8b00) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x55ba8a5e8b00, mutex=mutex@entry=0x55ba8a5e8b38) at pthread_cond_wait.c:638
#3  0x000055ba86c2955b in qemu_cond_wait_impl (cond=0x55ba8a5e8b00, mutex=0x55ba8a5e8b38, file=0x55ba86cb4274 "../ui/vnc-jobs.c", line=248) at ../util/qemu-thread-posix.c:219
#4  0x000055ba866f75e3 in vnc_worker_thread_loop (queue=0x55ba8a5e8b00) at ../ui/vnc-jobs.c:248
#5  0x000055ba866f82a8 in vnc_worker_thread (arg=arg@entry=0x55ba8a5e8b00) at ../ui/vnc-jobs.c:361
#6  0x000055ba86c28a19 in qemu_thread_start (args=0x7fb4b5bba3f0) at ../util/qemu-thread-posix.c:504
#7  0x00007fb5d128fea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#8  0x00007fb5d11ada2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7fb5c4b10700 (LWP 12569) "CPU 1/KVM"):
#0  0x00007fb5d11a35f7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x000055ba86aa7817 in kvm_vcpu_ioctl (cpu=cpu@entry=0x55ba892bcab0, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3089
#2  0x000055ba86aa7981 in kvm_cpu_exec (cpu=cpu@entry=0x55ba892bcab0) at ../accel/kvm/kvm-all.c:2904
#3  0x000055ba86aa8fed in kvm_vcpu_thread_fn (arg=arg@entry=0x55ba892bcab0) at ../accel/kvm/kvm-accel-ops.c:49
#4  0x000055ba86c28a19 in qemu_thread_start (args=0x7fb5c4b0b3f0) at ../util/qemu-thread-posix.c:504
#5  0x00007fb5d128fea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007fb5d11ada2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7fb5c5551700 (LWP 12568) "CPU 0/KVM"):
#0  0x00007fb5d11a35f7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x000055ba86aa7817 in kvm_vcpu_ioctl (cpu=cpu@entry=0x55ba89285be0, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3089
#2  0x000055ba86aa7981 in kvm_cpu_exec (cpu=cpu@entry=0x55ba89285be0) at ../accel/kvm/kvm-all.c:2904
#3  0x000055ba86aa8fed in kvm_vcpu_thread_fn (arg=arg@entry=0x55ba89285be0) at ../accel/kvm/kvm-accel-ops.c:49
#4  0x000055ba86c28a19 in qemu_thread_start (args=0x7fb5c554c3f0) at ../util/qemu-thread-posix.c:504
#5  0x00007fb5d128fea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007fb5d11ada2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7fb5c5d52700 (LWP 12368) "kvm"):
#0  0x000055ba86c394f7 in aio_co_schedule (ctx=0x492c7f000003e7fa, co=0x55ba86c456ae <qemu_co_mutex_lock+46>) at ../util/async.c:586
#1  0x000055ba86c48d6d in timerlist_run_timers (timer_list=0x55ba8919dae0) at ../util/qemu-timer.c:576
#2  0x000055ba86c48e5f in timerlist_run_timers (timer_list=<optimized out>) at ../util/qemu-timer.c:509
#3  timerlistgroup_run_timers (tlg=tlg@entry=0x55ba8919d9c0) at ../util/qemu-timer.c:615
#4  0x000055ba86c259f2 in aio_poll (ctx=0x55ba8919d810, blocking=blocking@entry=true) at ../util/aio-posix.c:719
#5  0x000055ba86ae4d36 in iothread_run (opaque=opaque@entry=0x55ba88fdbc00) at ../iothread.c:67
#6  0x000055ba86c28a19 in qemu_thread_start (args=0x7fb5c5d4d3f0) at ../util/qemu-thread-posix.c:504
#7  0x00007fb5d128fea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#8  0x00007fb5d11ada2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7fb5c6654700 (LWP 12367) "call_rcu"):
#0  0x00007fb5d1174561 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=0x7fb5c664f370, rem=0x7fb5c664f380) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
#1  0x00007fb5d1179d43 in __GI___nanosleep (requested_time=<optimized out>, remaining=<optimized out>) at nanosleep.c:27
#2  0x00007fb5d2620b4f in g_usleep () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#3  0x000055ba86c32150 in call_rcu_thread (opaque=opaque@entry=0x0) at ../util/rcu.c:253
#4  0x000055ba86c28a19 in qemu_thread_start (args=0x7fb5c664f3f0) at ../util/qemu-thread-posix.c:504
#5  0x00007fb5d128fea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007fb5d11ada2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7fb5c67bf1c0 (LWP 12366) "kvm"):
#0  0x00007fb5d11a1e26 in __ppoll (fds=0x55ba89169160, nfds=10, timeout=<optimized out>, timeout@entry=0x7ffd8f054c60, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
#1  0x000055ba86c48291 in ppoll (__ss=0x0, __timeout=0x7ffd8f054c60, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=252247333) at ../util/qemu-timer.c:351
#3  0x000055ba86c44aa5 in os_host_main_loop_wait (timeout=252247333) at ../util/main-loop.c:315
#4  main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:596
#5  0x000055ba86893861 in qemu_main_loop () at ../softmmu/runstate.c:734
#6  0x000055ba866d0edc in qemu_main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at ../softmmu/main.c:38
#7  0x00007fb5d10d4d0a in __libc_start_main (main=0x55ba866cc1e0 <main>, argc=66, argv=0x7ffd8f054e28, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffd8f054e18) at ../csu/libc-start.c:308
#8  0x000055ba866d0e0a in _start ()
 
Two more traces

Code:
Thread 3 "kvm" received signal SIGABRT, Aborted.
[Switching to Thread 0x7f86f9660700 (LWP 15397)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) t a a bt

Thread 6 (Thread 0x7f85e93bf700 (LWP 15616) "vnc_worker"):
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x563dbf282ef8) at ../sysdeps/nptl/futex-internal.h:186
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x563dbf282f08, cond=0x563dbf282ed0) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x563dbf282ed0, mutex=mutex@entry=0x563dbf282f08) at pthread_cond_wait.c:638
#3  0x0000563dbcaf655b in qemu_cond_wait_impl (cond=0x563dbf282ed0, mutex=0x563dbf282f08, file=0x563dbcb81274 "../ui/vnc-jobs.c", line=248) at ../util/qemu-thread-posix.c:219
#4  0x0000563dbc5c45e3 in vnc_worker_thread_loop (queue=0x563dbf282ed0) at ../ui/vnc-jobs.c:248
#5  0x0000563dbc5c52a8 in vnc_worker_thread (arg=arg@entry=0x563dbf282ed0) at ../ui/vnc-jobs.c:361
#6  0x0000563dbcaf5a19 in qemu_thread_start (args=0x7f85e93ba3f0) at ../util/qemu-thread-posix.c:504
#7  0x00007f8704b9dea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#8  0x00007f8704abba2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7f85ebdff700 (LWP 15614) "CPU 1/KVM"):
#0  0x00007f8704ab15f7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x0000563dbc974817 in kvm_vcpu_ioctl (cpu=cpu@entry=0x563dbf24a830, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3089
#2  0x0000563dbc974981 in kvm_cpu_exec (cpu=cpu@entry=0x563dbf24a830) at ../accel/kvm/kvm-all.c:2904
#3  0x0000563dbc975fed in kvm_vcpu_thread_fn (arg=arg@entry=0x563dbf24a830) at ../accel/kvm/kvm-accel-ops.c:49
#4  0x0000563dbcaf5a19 in qemu_thread_start (args=0x7f85ebdfa3f0) at ../util/qemu-thread-posix.c:504
#5  0x00007f8704b9dea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f8704abba2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7f86f8e5f700 (LWP 15613) "CPU 0/KVM"):
#0  0x00007f8704ab15f7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x0000563dbc974817 in kvm_vcpu_ioctl (cpu=cpu@entry=0x563dbf213bf0, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3089
#2  0x0000563dbc974981 in kvm_cpu_exec (cpu=cpu@entry=0x563dbf213bf0) at ../accel/kvm/kvm-all.c:2904
#3  0x0000563dbc975fed in kvm_vcpu_thread_fn (arg=arg@entry=0x563dbf213bf0) at ../accel/kvm/kvm-accel-ops.c:49
#4  0x0000563dbcaf5a19 in qemu_thread_start (args=0x7f86f8e5a3f0) at ../util/qemu-thread-posix.c:504
#5  0x00007f8704b9dea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f8704abba2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7f86f9660700 (LWP 15397) "kvm"):
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f87049e1537 in __GI_abort () at abort.c:79
#2  0x00007f8704a3a768 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f8704b583a5 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3  0x00007f8704a41a5a in malloc_printerr (str=str@entry=0x7f8704b5a790 "double free or corruption (out)") at malloc.c:5347
#4  0x00007f8704a43088 in _int_free (av=0x7f8704b8eb80 <main_arena>, p=0x7f86ec05d070, have_lock=<optimized out>) at malloc.c:4314
#5  0x0000563dbca51f7d in iscsi_co_readv (bs=<optimized out>, sector_num=<optimized out>, nb_sectors=<optimized out>, iov=0x7f86ec026bd0) at ../block/iscsi.c:898
#6  0x0000563dbc9e8398 in bdrv_driver_preadv (bs=0x563dbf137a40, offset=53593186304, bytes=4096, qiov=0x7f86ec026bd0, qiov_offset=0, flags=0) at ../block/io.c:1190
#7  0x0000563dbc9ead30 in bdrv_aligned_preadv (req=req@entry=0x7f85d9fdf8c0, offset=53593186304, bytes=4096, align=<optimized out>, qiov=0x7f86ec026bd0, qiov_offset=0, flags=0, child=<optimized out>, child=<optimized out>) at ../block/io.c:1548
#8  0x0000563dbc9ec494 in bdrv_co_preadv_part (child=0x563dbf12c660, offset=<optimized out>, bytes=<optimized out>, qiov=<optimized out>, qiov_offset=<optimized out>, flags=0) at ../block/io.c:1825
#9  0x0000563dbc9e82a4 in bdrv_driver_preadv (bs=0x563dbf130700, offset=53593186304, bytes=4096, qiov=0x7f86ec026bd0, qiov_offset=0, flags=0) at ../block/io.c:1160
#10 0x0000563dbc9ead30 in bdrv_aligned_preadv (req=req@entry=0x7f85d9fdfb80, offset=53593186304, bytes=4096, align=<optimized out>, qiov=0x7f86ec026bd0, qiov_offset=0, flags=0, child=<optimized out>, child=<optimized out>) at ../block/io.c:1548
#11 0x0000563dbc9ec494 in bdrv_co_preadv_part (child=0x563dc0348ab0, offset=<optimized out>, bytes=<optimized out>, qiov=<optimized out>, qiov_offset=<optimized out>, flags=0) at ../block/io.c:1825
#12 0x0000563dbc9e82a4 in bdrv_driver_preadv (bs=0x563dbfe74970, offset=53593186304, bytes=4096, qiov=0x7f86ec026bd0, qiov_offset=0, flags=0) at ../block/io.c:1160
#13 0x0000563dbc9ead30 in bdrv_aligned_preadv (req=req@entry=0x7f85d9fdfe40, offset=53593186304, bytes=4096, align=<optimized out>, qiov=0x7f86ec026bd0, qiov_offset=0, flags=0, child=<optimized out>, child=<optimized out>) at ../block/io.c:1548
#14 0x0000563dbc9ec494 in bdrv_co_preadv_part (child=0x563dbf1414f0, offset=<optimized out>, offset@entry=53593186304, bytes=<optimized out>, bytes@entry=4096, qiov=<optimized out>, qiov@entry=0x7f86ec026bd0, qiov_offset=<optimized out>, qiov_offset@entry=0, flags=flags@entry=0) at ../block/io.c:1825
#15 0x0000563dbc9dc14d in blk_co_do_preadv_part (blk=0x563dbf12ebc0, offset=53593186304, bytes=4096, qiov=0x7f86ec026bd0, qiov_offset=qiov_offset@entry=0, flags=0) at ../block/block-backend.c:1311
#16 0x0000563dbc9dc2c6 in blk_aio_read_entry (opaque=0x7f86ec01d9d0) at ../block/block-backend.c:1556
#17 0x0000563dbcb079eb in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at ../util/coroutine-ucontext.c:177
#18 0x00007f8704a0cd40 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#19 0x00007f86f965a850 in ?? ()
#20 0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7f86f9f62700 (LWP 15396) "call_rcu"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x0000563dbcaf6bda in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at /build/pve-qemu/pve-qemu-kvm-7.1.0/include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0x563dbd342608 <rcu_call_ready_event>) at ../util/qemu-thread-posix.c:429
#3  0x0000563dbcaff16a in call_rcu_thread (opaque=opaque@entry=0x0) at ../util/rcu.c:261
#4  0x0000563dbcaf5a19 in qemu_thread_start (args=0x7f86f9f5d3f0) at ../util/qemu-thread-posix.c:504
#5  0x00007f8704b9dea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f8704abba2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7f86fa0cd1c0 (LWP 15395) "kvm"):
#0  0x00007f8704aafe26 in __ppoll (fds=0x563dbf0f7160, nfds=10, timeout=<optimized out>, timeout@entry=0x7ffc76f9e850, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
#1  0x0000563dbcb15291 in ppoll (__ss=0x0, __timeout=0x7ffc76f9e850, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=342295742) at ../util/qemu-timer.c:351
#3  0x0000563dbcb11aa5 in os_host_main_loop_wait (timeout=342295742) at ../util/main-loop.c:315
#4  main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:596
#5  0x0000563dbc760861 in qemu_main_loop () at ../softmmu/runstate.c:734
#6  0x0000563dbc59dedc in qemu_main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at ../softmmu/main.c:38
#7  0x00007f87049e2d0a in __libc_start_main (main=0x563dbc5991e0 <main>, argc=66, argv=0x7ffc76f9ea18, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffc76f9ea08) at ../csu/libc-start.c:308
#8  0x0000563dbc59de0a in _start ()


Code:
Thread 3 "kvm" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fda6d239700 (LWP 17491)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) t a a ab

Thread 6 (Thread 0x7fd95d1bf700 (LWP 17695) "vnc_worker"):
Undefined command: "ab".  Try "help".
(gdb) t a a bt

Thread 6 (Thread 0x7fd95d1bf700 (LWP 17695) "vnc_worker"):
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x562225883788) at ../sysdeps/nptl/futex-internal.h:186
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x562225883798, cond=0x562225883760) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x562225883760, mutex=mutex@entry=0x562225883798) at pthread_cond_wait.c:638
#3  0x000056222334355b in qemu_cond_wait_impl (cond=0x562225883760, mutex=0x562225883798, file=0x5622233ce274 "../ui/vnc-jobs.c", line=248) at ../util/qemu-thread-posix.c:219
#4  0x0000562222e115e3 in vnc_worker_thread_loop (queue=0x562225883760) at ../ui/vnc-jobs.c:248
#5  0x0000562222e122a8 in vnc_worker_thread (arg=arg@entry=0x562225883760) at ../ui/vnc-jobs.c:361
#6  0x0000562223342a19 in qemu_thread_start (args=0x7fd95d1ba3f0) at ../util/qemu-thread-posix.c:504
#7  0x00007fda78776ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#8  0x00007fda78694a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7fd95fbff700 (LWP 17689) "CPU 1/KVM"):
#0  0x00007fda7868a5f7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x00005622231c1817 in kvm_vcpu_ioctl (cpu=cpu@entry=0x562224d53c50, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3089
#2  0x00005622231c1981 in kvm_cpu_exec (cpu=cpu@entry=0x562224d53c50) at ../accel/kvm/kvm-all.c:2904
#3  0x00005622231c2fed in kvm_vcpu_thread_fn (arg=arg@entry=0x562224d53c50) at ../accel/kvm/kvm-accel-ops.c:49
#4  0x0000562223342a19 in qemu_thread_start (args=0x7fd95fbfa3f0) at ../util/qemu-thread-posix.c:504
#5  0x00007fda78776ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007fda78694a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7fda6ca38700 (LWP 17688) "CPU 0/KVM"):
#0  0x00007fda7868a5f7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x00005622231c1817 in kvm_vcpu_ioctl (cpu=cpu@entry=0x562224d1cc00, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3089
#2  0x00005622231c1981 in kvm_cpu_exec (cpu=cpu@entry=0x562224d1cc00) at ../accel/kvm/kvm-all.c:2904
#3  0x00005622231c2fed in kvm_vcpu_thread_fn (arg=arg@entry=0x562224d1cc00) at ../accel/kvm/kvm-accel-ops.c:49
#4  0x0000562223342a19 in qemu_thread_start (args=0x7fda6ca333f0) at ../util/qemu-thread-posix.c:504
#5  0x00007fda78776ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007fda78694a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7fda6d239700 (LWP 17491) "kvm"):
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007fda785ba537 in __GI_abort () at abort.c:79
#2  0x00007fda785ba40f in __assert_fail_base (fmt=0x7fda787326a8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x562223536256 "acb->refcnt > 0", file=0x562223536246 "../util/aiocb.c", line=51, function=<optimized out>) at assert.c:92
#3  0x00007fda785c9662 in __GI___assert_fail (assertion=assertion@entry=0x562223536256 "acb->refcnt > 0", file=file@entry=0x562223536246 "../util/aiocb.c", line=line@entry=51, function=function@entry=0x562223536268 <__PRETTY_FUNCTION__.0> "qemu_aio_unref") at assert.c:101
#4  0x0000562223352825 in qemu_aio_unref (p=<optimized out>) at ../util/aiocb.c:51
#5  0x00005622233549eb in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at ../util/coroutine-ucontext.c:177
#6  0x00007fda785e5d40 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#7  0x00007fda6d233850 in ?? ()
#8  0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7fda6db3b700 (LWP 17490) "call_rcu"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x0000562223343bda in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at /build/pve-qemu/pve-qemu-kvm-7.1.0/include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0x562223b8f608 <rcu_call_ready_event>) at ../util/qemu-thread-posix.c:429
#3  0x000056222334c16a in call_rcu_thread (opaque=opaque@entry=0x0) at ../util/rcu.c:261
#4  0x0000562223342a19 in qemu_thread_start (args=0x7fda6db363f0) at ../util/qemu-thread-posix.c:504
#5  0x00007fda78776ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007fda78694a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7fda6dca61c0 (LWP 17489) "kvm"):
#0  0x00007fda78688e26 in __ppoll (fds=0x562224c00160, nfds=10, timeout=<optimized out>, timeout@entry=0x7ffcd67035b0, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
#1  0x0000562223362291 in ppoll (__ss=0x0, __timeout=0x7ffcd67035b0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=1370211276) at ../util/qemu-timer.c:351
#3  0x000056222335eaa5 in os_host_main_loop_wait (timeout=1370211276) at ../util/main-loop.c:315
#4  main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:596
#5  0x0000562222fad861 in qemu_main_loop () at ../softmmu/runstate.c:734
#6  0x0000562222deaedc in qemu_main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at ../softmmu/main.c:38
#7  0x00007fda785bbd0a in __libc_start_main (main=0x562222de61e0 <main>, argc=66, argv=0x7ffcd6703778, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffcd6703768) at ../csu/libc-start.c:308
#8  0x0000562222deae0a in _start ()
(gdb)
 
I sent a patch and it's most likely going to be in the upcoming release of QEMU 7.2 in PVE. It should be out on the pvetest repository soonish (at most a week if no other issues pop up), but keep in mind that it'll be a testing version, although we do try our best when testing internally.

Unfortunately, the commit introducing the issue is pretty old, so you can't just downgrade to an earlier version to work around the issue.

The fix mentions that the issue triggers when the status is BUSY or similar, so I'd guess that bandwidth limits could help. So if you don't want to risk using the testing version, you could try to use bandwidth limits for the backup jobs and storage:
Code:
pvesh set /cluster/backup/backup-<ID> --bwlimit <speed in KiB/s>
pvesm set <storage> --bwlimit default=<speed in KiB/s>
And to enforce it for the guest itself too, you can configure the limit by editing the disk in the VM's Hardware tab.
 
  • Like
Reactions: Robin Dittrich
I sent a patch and it's most likely going to be in the upcoming release of QEMU 7.2 in PVE. It should be out on the pvetest repository soonish (at most a week if no other issues pop up), but keep in mind that it'll be a testing version, although we do try our best when testing internally.

Unfortunately, the commit introducing the issue is pretty old, so you can't just downgrade to an earlier version to work around the issue.

The fix mentions that the issue triggers when the status is BUSY or similar, so I'd guess that bandwidth limits could help. So if you don't want to risk using the testing version, you could try to use bandwidth limits for the backup jobs and storage:
Code:
pvesh set /cluster/backup/backup-<ID> --bwlimit <speed in KiB/s>
pvesm set <storage> --bwlimit default=<speed in KiB/s>
And to enforce it for the guest itself too, you can configure the limit by editing the disk in the VM's Hardware tab.

In the mean time until the new release is out and we can test it we will be using NFS for the VM disks. Which runs without issues but with less performance. I'll keep an eye on any updates.

Thanks again!
 
  • Like
Reactions: fiona
I just tested the new version of qemu (7.2) on the PVEtest repository. This seems to fix the issue!

Thank you so much for the help! I'll mark this thread as solved!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!