Network-related kernel panics in FreeBSD guest after PVE 8 upgrade

cmeWNQAm

New Member
Feb 20, 2024
19
2
3
I have a PVE setup with a TrueNAS (FreeBSD-based) file server as a KVM guest. It is configured with a separate network interface (virtio) used for an NFS share between it and the Proxmox host, as well as one for the local network. This provides storage for some other containers and as a backup destination for a daily backup I have set up. (Boot orders are set correctly, and it's not backing up TrueNAS to itself...)

After doing an in-place upgrade to PVE 8.4.1 from 7.4 a couple days ago using apt, I have been noticing the server blows up whenever a my daily backup job starts. A couple backups complete successfully, but then I/O delay shown on the web UI goes up to 70-80%, and various parts of the UI are blank or show question marks. During this time, other guests seem unaffected, including containers and another FreeBSD guest (pfSense). TrueNAS, however, becomes essentially unresponsive and SMB shares it is providing on the local network are unavailable.

If I leave it for long enough, TrueNAS kernel panics and reboots, and the rest of the backup seems to eventually finish, albeit quite slowly and with a couple failing again. During this time I was trying to kill the vzdump backup processes, though they seemed unresponsive even to kill -9s for a while.

TrueNAS /var/log/messages
Code:
Feb 20 02:34:43 truenas spin lock 0xffffffff81f85e08 (sleepq chain) held by 0xfffffe00cac9dac0 (tid 101989) too long
Feb 20 02:34:43 truenas spin lock 0xffffffff81f58300 (callout) held by 0xfffffe00dd8c6ac0 (tid 101990) too long
Feb 20 02:34:43 truenas spin lock 0xffffffff81f38fa0 (cnputs_mtx) held by 0xfffffe00107f5c80 (tid 100065) too long
Feb 20 02:34:43 truenas panic: spin lock held too long
Feb 20 02:34:43 truenas cpuid = 6
Feb 20 02:34:43 truenas time = 1708405941
Feb 20 02:34:43 truenas KDB: stack backtrace:
Feb 20 02:34:43 truenas db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe008ec48350
Feb 20 02:34:43 truenas vpanic() at vpanic+0x17f/frame 0xfffffe008ec483a0
Feb 20 02:34:43 truenas panic() at panic+0x43/frame 0xfffffe008ec48400
Feb 20 02:34:43 truenas _mtx_lock_indefinite_check() at _mtx_lock_indefinite_check+0x68/frame 0xfffffe008ec48410
Feb 20 02:34:43 truenas _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xd5/frame 0xfffffe008ec48480
Feb 20 02:34:43 truenas cnputsn() at cnputsn+0xd8/frame 0xfffffe008ec484c0
Feb 20 02:34:43 truenas putchar() at putchar+0x14a/frame 0xfffffe008ec48550
Feb 20 02:34:43 truenas kvprintf() at kvprintf+0xf5/frame 0xfffffe008ec48670
Feb 20 02:34:43 truenas vprintf() at vprintf+0x82/frame 0xfffffe008ec48750
Feb 20 02:34:43 truenas printf() at printf+0x43/frame 0xfffffe008ec487b0
Feb 20 02:34:43 truenas _mtx_lock_indefinite_check() at _mtx_lock_indefinite_check+0x5a/frame 0xfffffe008ec487c0
Feb 20 02:34:43 truenas _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xd5/frame 0xfffffe008ec48830
Feb 20 02:34:43 truenas callout_lock() at callout_lock+0xb0/frame 0xfffffe008ec48860
Feb 20 02:34:43 truenas _callout_stop_safe() at _callout_stop_safe+0xc9/frame 0xfffffe008ec488c0
Feb 20 02:34:43 truenas sleepq_remove_thread() at sleepq_remove_thread+0xb7/frame 0xfffffe008ec488e0
Feb 20 02:34:43 truenas sleepq_resume_thread() at sleepq_resume_thread+0x49/frame 0xfffffe008ec48920
Feb 20 02:34:43 truenas sleepq_broadcast() at sleepq_broadcast+0x74/frame 0xfffffe008ec48950
Feb 20 02:34:43 truenas cv_broadcastpri() at cv_broadcastpri+0x41/frame 0xfffffe008ec48970
Feb 20 02:34:43 truenas doselwakeup() at doselwakeup+0xa9/frame 0xfffffe008ec489c0
Feb 20 02:34:43 truenas sowakeup() at sowakeup+0x1e/frame 0xfffffe008ec489f0
Feb 20 02:34:43 truenas udp_append() at udp_append+0x254/frame 0xfffffe008ec48a70
Feb 20 02:34:43 truenas udp_input() at udp_input+0x733/frame 0xfffffe008ec48b50
Feb 20 02:34:43 truenas ip_input() at ip_input+0x11f/frame 0xfffffe008ec48be0
Feb 20 02:34:43 truenas netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe008ec48c30
Feb 20 02:34:43 truenas ether_demux() at ether_demux+0x138/frame 0xfffffe008ec48c60
Feb 20 02:34:43 truenas ether_nh_input() at ether_nh_input+0x355/frame 0xfffffe008ec48cc0
Feb 20 02:34:43 truenas netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe008ec48d10
Feb 20 02:34:43 truenas ether_input() at ether_input+0x69/frame 0xfffffe008ec48d70
Feb 20 02:34:43 truenas vtnet_rxq_eof() at vtnet_rxq_eof+0x73e/frame 0xfffffe008ec48e30
Feb 20 02:34:43 truenas vtnet_rx_vq_process() at vtnet_rx_vq_process+0x67/frame 0xfffffe008ec48e60
Feb 20 02:34:43 truenas ithread_loop() at ithread_loop+0x25a/frame 0xfffffe008ec48ef0
Feb 20 02:34:43 truenas fork_exit() at fork_exit+0x7e/frame 0xfffffe008ec48f30
Feb 20 02:34:43 truenas fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe008ec48f30
Feb 20 02:34:43 truenas --- trap 0x80b7b3a0, rip = 0xffffffff80aa32ef, rsp = 0, rbp = 0x32ec000 ---
Feb 20 02:34:43 truenas mi_startup() at mi_startup+0xdf/frame 0x32ec000
Feb 20 02:34:43 truenas KDB: enter: panic
Feb 20 02:34:43 truenas ---<<BOOT>>---
Feb 20 02:34:43 truenas Copyright (c) 1992-2021 The FreeBSD Project.
Feb 20 02:34:43 truenas Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
Feb 20 02:34:43 truenas     The Regents of the University of California. All rights reserved.
Feb 20 02:34:43 truenas FreeBSD is a registered trademark of The FreeBSD Foundation.
Feb 20 02:34:43 truenas FreeBSD 13.1-RELEASE-p9 n245429-296d095698e TRUENAS amd64
normal boot from here...

Proxmox /var/log/syslog at the same time (192.168.101.2 is the TrueNAS server's NFS interface IP address)
Attached is a more complete version that shows the time when the backup job started until the NFS failed.
Code:
2024-02-20T02:33:35.940721-08:00 kernel: [82911.936038] I/O error, dev loop1, sector 12648880 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 2
2024-02-20T02:33:35.940737-08:00 kernel: [82911.936076] nfs: server 192.168.101.2 not responding, timed out
2024-02-20T02:33:35.940739-08:00 kernel: [82911.936089] I/O error, dev loop0, sector 16592952 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
2024-02-20T02:33:35.950991-08:00 kernel: [82911.940751] I/O error, dev loop0, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 2
2024-02-20T02:33:35.951006-08:00 kernel: [82911.946997] EXT4-fs warning (device loop1): htree_dirblock_to_tree:1082: inode #393240: lblock 0: comm qmgr: error -5 reading directory block
2024-02-20T02:33:35.960610-08:00 systemd[1]: session-141.scope: Deactivated successfully.
2024-02-20T02:33:35.960691-08:00 systemd[1]: session-141.scope: Consumed 2.542s CPU time.
2024-02-20T02:33:39.586794-08:00 pvestatd[1511]: storage 'TrueNAS' is not online
2024-02-20T02:33:39.637345-08:00 pvestatd[1511]: status update time (19331.242 seconds)
2024-02-20T02:33:45.729615-08:00 pvestatd[1511]: storage 'TrueNAS' is not online
2024-02-20T02:33:45.913988-08:00 pvestatd[1511]: status update time (6.276 seconds)
2024-02-20T02:33:54.945730-08:00 pvestatd[1511]: storage 'TrueNAS' is not online
2024-02-20T02:33:54.987696-08:00 pvestatd[1511]: status update time (5.073 seconds)
2024-02-20T02:34:04.161739-08:00 pvestatd[1511]: storage 'TrueNAS' is not online
2024-02-20T02:34:13.377724-08:00 pvestatd[1511]: storage 'TrueNAS' is not online
2024-02-20T02:34:25.665816-08:00 pvestatd[1511]: storage 'TrueNAS' is not online

Proxmox version
Code:
# pveversion -v
proxmox-ve: 8.1.0 (running kernel: 6.5.11-8-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.15: 7.4-10
proxmox-kernel-6.5: 6.5.11-8
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
pve-kernel-5.15.136-1-pve: 5.15.136-1
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph-fuse: 16.2.11+ds-2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.1
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.4-1
proxmox-backup-file-restore: 3.1.4-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-3
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.2.0
pve-qemu-kvm: 8.1.5-2
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1

My interpretation is that when there starts to be significant traffic over NFS, sometimes something breaks within the network stack on Proxmox/Linux/KVM, causing FreeBSD to panic somewhere in its networking code. I don't really know how to start troubleshooting this since there isn't any indication of what might be going wrong in the Linux syslog, other than of course the NFS errors which I assume are a symptom of the original problem.

I've seen some people talking about downgrading the kernel version for various reasons. Might that fix this?
 

Attachments

Last edited:
Hello, do you use TrueNAS 13 ? I also have some problems with NFS and PVE 8, see https://forum.proxmox.com/threads/vms-hung-after-backup.137286/page-3#post-637026
Yep, the latest version of TrueNAS Core. Looks like you're running into something pretty similar?

I haven't found a solution yet other than turning off my backup jobs :rolleyes:

With those disabled it's been fine for a little over a week now, so definitely seems to be related to NFS or something. I do have a couple other VMs with their boot disks stored on the same NFS share that have not been disabled. To me this seems like some kind of deadlock/race condition thing that is triggered when there is a lot of data being transferred over the network bridge.
 
I see the new `iothread` patch from Finoa that was discussed in that other thread has been released in the meantime, so I will try that. I was getting problems with IO on the host as well, so I am not convinced it will fix it, but we'll see (maybe the IO being borked in TrueNAS led to IO being borked on the host through NFS). I'll re-enable the backup jobs and keep an eye on it to see if it blows up again when they run tonight.
 
I just upgraded my system to use qemu 8.1.5-3 that has the iothread patch discussed in the other thread:

Code:
# pveversion -v
proxmox-ve: 8.1.0 (running kernel: 6.5.13-1-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.15: 7.4-10
proxmox-kernel-6.5.13-1-pve-signed: 6.5.13-1
proxmox-kernel-6.5: 6.5.13-1
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
pve-kernel-5.15.136-1-pve: 5.15.136-1
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph-fuse: 16.2.11+ds-2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.1
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.4-1
proxmox-backup-file-restore: 3.1.4-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-4
pve-firewall: 5.0.3
pve-firmware: 3.9-2
pve-ha-manager: 4.0.3
pve-i18n: 3.2.0
pve-qemu-kvm: 8.1.5-3
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve2

I just manually ran the backup job, and the exact same problem happened. One VM and one container backed up to the NFS share, and now it's stuck trying to back up the second container. The TrueNAS VM appears to be freezing and is becoming nonfunctional and the Proxmox web UI is deteriorating as I'm typing this.

I have tried suspending the TrueNAS VM (which was a suggested workaround to the iothread issue), which doesn't work:
Code:
# qm suspend 102
VM 102 qmp command 'stop' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries

I have also tried the suggested GDB command, which simply hangs forever and does nothing.
Code:
# gdb --batch --ex 't a a bt' -p $(cat /var/run/qemu-server/102.pid)

iotop shows 0B/s disk IO, apart from a few kB/s spikes presumably to the local storage.
 
This time recovery was particularly brutal. With the entire system grinding to a halt, I had to kill -9 all the backup jobs. Even when running reboot over SSH, the containers failed to stop, and after waiting for 10 minutes, the VMs didn't stop either. kill -9ing the QEMU and KVM processes didn't work. reboot -f didn't work. I had to use the sysrq method to force reboot the system.
 
I haven't found a solution yet other than turning off my backup jobs :rolleyes:
As i didn't upgrade the pool I was able to step back to TrueNAS 12.0-U8.1 and problems is almost «fixed»: I rub backup job but most of them are not usable during the backup which occur as reasonable speed (1 to 2 minutes for 1 Gb) and everything recover at the end of the backup.
Since the beginning I have been trying to determine if the culprit is TrueNAS or PVE and the more I search (in the fog) the more I tell myself that it is a combination of the two!
One thing is sure: it is related to NFS and probably to linux (on PVE host) and FreeBSD (on TrueNAS host) versions; and perhaps qemu version ?
 
As i didn't upgrade the pool I was able to step back to TrueNAS 12.0-U8.1 and problems is almost «fixed»: I rub backup job but most of them are not usable during the backup which occur as reasonable speed (1 to 2 minutes for 1 Gb) and everything recover at the end of the backup.
Since the beginning I have been trying to determine if the culprit is TrueNAS or PVE and the more I search (in the fog) the more I tell myself that it is a combination of the two!
One thing is sure: it is related to NFS and probably to linux (on PVE host) and FreeBSD (on TrueNAS host) versions; and perhaps qemu version ?
I had no problems for months with TrueNAS 13, but it started failing immediately upon upgrade to PVE 8, so I'm inclined to blame it on PVE. It seems odd that a bug in a VM could bring the host system to its knees. But if reverting TrueNAS helps, then it's certainly plausible that something TrueNAS 13 is doing triggers a bug in Linux.

Also, it sounds like when your setup fails, it kind of temporarily slows down, but then recovers? In my case both systems' IO basically stops working and as far as I can tell it's unrecoverable except with a forced reboot.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!