5.15.64-1-pve kernel breaks scheduler? unable to ssh, start emu, or even reboot

wrobelda · Nov 7, 2022

A regular update today rendered by node completely broken. I could not ssh, although I could telnet on port 22 just fine — i.e. the main sshd process was running fine, the spawn subprocesses were stalling. I could not reboot — all the processes were stalled. The VMs also wouldn't start, although the containers did. I could http into the web interface. I could open console from the web interface.

Looking at the dmesg/syslog, processes simply hung indefinitely:

Code:

[  242.862135] INFO: task kworker/8:1:165 blocked for more than 120 seconds.
[  242.862156]       Tainted: P           O      5.15.64-1-pve #1
[  242.862169] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  242.862185] task:kworker/8:1     state:D stack:    0 pid:  165 ppid:     2 flags:0x00004000
[  242.862205] Workqueue: ipv6_addrconf addrconf_dad_work
[  242.862220] Call Trace:
[  242.862227]  <TASK>
[  242.862233]  __schedule+0x34e/0x1740
[  242.862244]  ? update_load_avg+0x82/0x640
[  242.862256]  schedule+0x69/0x110
[  242.862264]  schedule_preempt_disabled+0xe/0x20
[  242.862275]  __mutex_lock.constprop.0+0x255/0x480
[  242.862287]  __mutex_lock_slowpath+0x13/0x20
[  242.862609]  mutex_lock+0x38/0x50
[  242.862932]  rtnl_lock+0x15/0x20
[  242.863248]  addrconf_dad_work+0x39/0x4d0
[  242.863557]  process_one_work+0x228/0x3d0
[  242.863861]  worker_thread+0x53/0x420
[  242.864160]  ? process_one_work+0x3d0/0x3d0
[  242.864457]  kthread+0x127/0x150
[  242.864756]  ? set_kthread_struct+0x50/0x50
[  242.865063]  ret_from_fork+0x1f/0x30
[  242.865361]  </TASK>
[  242.865657] INFO: task kworker/9:3:984 blocked for more than 120 seconds.
[  242.865954]       Tainted: P           O      5.15.64-1-pve #1
[  242.866268] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  242.866549] task:kworker/9:3     state:D stack:    0 pid:  984 ppid:     2 flags:0x00004000
[  242.866831] Workqueue: events switchdev_deferred_process_work
[  242.867119] Call Trace:
[  242.867401]  <TASK>
[  242.867681]  __schedule+0x34e/0x1740
[  242.867969]  ? dequeue_entity+0xd8/0x490
[  242.868253]  schedule+0x69/0x110
[  242.868525]  schedule_preempt_disabled+0xe/0x20
[  242.868792]  __mutex_lock.constprop.0+0x255/0x480
[  242.869056]  ? add_timer_on+0x115/0x180
[  242.869312]  __mutex_lock_slowpath+0x13/0x20
[  242.869564]  mutex_lock+0x38/0x50
[  242.869839]  rtnl_lock+0x15/0x20
[  242.870105]  switchdev_deferred_process_work+0xe/0x20
[  242.870339]  process_one_work+0x228/0x3d0
[  242.870567]  worker_thread+0x53/0x420
[  242.870792]  ? process_one_work+0x3d0/0x3d0
[  242.871021]  kthread+0x127/0x150
[  242.871248]  ? set_kthread_struct+0x50/0x50
[  242.871478]  ret_from_fork+0x1f/0x30
[  242.871707]  </TASK>
[  242.871940] INFO: task task UPID:proxm:1587 blocked for more than 120 seconds.
[  242.872180]       Tainted: P           O      5.15.64-1-pve #1
[  242.872423] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  242.872673] task:task UPID:proxm state:D stack:    0 pid: 1587 ppid:  1586 flags:0x00004000
[  242.872937] Call Trace:
[  242.873195]  <TASK>
[  242.873450]  __schedule+0x34e/0x1740
[  242.873706]  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
[  242.873970]  schedule+0x69/0x110
[  242.874260]  schedule_preempt_disabled+0xe/0x20
[  242.874524]  __mutex_lock.constprop.0+0x255/0x480
[  242.874789]  __mutex_lock_slowpath+0x13/0x20
[  242.875054]  mutex_lock+0x38/0x50
[  242.875310]  rtnl_lock+0x15/0x20
[  242.875559]  unregister_netdev+0x13/0x30
[  242.875804]  igbvf_remove+0x50/0x100 [igbvf]
[  242.876042]  pci_device_remove+0x3b/0xb0
[  242.876273]  __device_release_driver+0x1a8/0x2a0
[  242.876503]  device_release_driver+0x29/0x40
[  242.876737]  pci_stop_bus_device+0x74/0xa0
[  242.876976]  pci_stop_and_remove_bus_device+0x13/0x30
[  242.877213]  pci_iov_remove_virtfn+0xc5/0x130
[  242.877452]  sriov_disable+0x3a/0xf0
[  242.877687]  pci_disable_sriov+0x26/0x30
[  242.877956]  igb_disable_sriov+0x64/0x110 [igb]
[  242.878223]  igb_remove+0xca/0x210 [igb]
[  242.878456]  pci_device_remove+0x3b/0xb0
[  242.878686]  __device_release_driver+0x1a8/0x2a0
[  242.878919]  device_driver_detach+0x56/0xe0
[  242.879151]  unbind_store+0x12a/0x140
[  242.879384]  drv_attr_store+0x21/0x40
[  242.879614]  sysfs_kf_write+0x3c/0x50
[  242.879844]  kernfs_fop_write_iter+0x13c/0x1d0
[  242.880072]  new_sync_write+0x111/0x1b0
[  242.880300]  vfs_write+0x1d9/0x270
[  242.880529]  ksys_write+0x67/0xf0
[  242.880757]  __x64_sys_write+0x1a/0x20
[  242.880980]  do_syscall_64+0x59/0xc0
[  242.881197]  ? __x64_sys_newfstat+0x16/0x20
[  242.881406]  ? do_syscall_64+0x69/0xc0
[  242.881608]  ? do_syscall_64+0x69/0xc0
[  242.881833]  ? exc_page_fault+0x89/0x170
[  242.882057]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[  242.882256] RIP: 0033:0x7fdec53defb3
[  242.882451] RSP: 002b:00007ffd1e77f198 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  242.882657] RAX: ffffffffffffffda RBX: 000055b45d966c30 RCX: 00007fdec53defb3
[  242.882866] RDX: 000000000000000c RSI: 000055b45d966c30 RDI: 000000000000000e
[  242.883075] RBP: 000000000000000c R08: 0000000000000000 R09: 000055b4561343b0
[  242.883287] R10: 000055b45d955b38 R11: 0000000000000246 R12: 000055b45d965910
[  242.883499] R13: 000055b4570b72a0 R14: 000000000000000e R15: 000055b45d965910
[  242.883710]  </TASK>
[  242.883916] INFO: task sshd:1751 blocked for more than 120 seconds.
[  242.884132]       Tainted: P           O      5.15.64-1-pve #1
[  242.884353] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  242.884581] task:sshd            state:D stack:    0 pid: 1751 ppid:  1348 flags:0x00000000
[  242.884820] Call Trace:
[  242.885055]  <TASK>
[  242.885282]  __schedule+0x34e/0x1740
[  242.885511]  ? kmem_cache_free+0x24d/0x290
[  242.885743]  ? spl_kmem_cache_free+0x145/0x200 [spl]
[  242.886051]  schedule+0x69/0x110
[  242.886286]  schedule_preempt_disabled+0xe/0x20
[  242.886521]  __mutex_lock.constprop.0+0x255/0x480
[  242.886759]  ? rtnl_create_link+0x330/0x330
[  242.886994]  __mutex_lock_slowpath+0x13/0x20
[  242.887232]  mutex_lock+0x38/0x50
[  242.887465]  __netlink_dump_start+0xc7/0x300
[  242.887704]  ? rtnl_create_link+0x330/0x330
[  242.887942]  rtnetlink_rcv_msg+0x2b8/0x410
[  242.888179]  ? kernel_init_free_pages.part.0+0x4a/0x70
[  242.888422]  ? rtnl_create_link+0x330/0x330
[  242.888663]  ? rtnl_calcit.isra.0+0x130/0x130
[  242.888904]  netlink_rcv_skb+0x53/0x100
[  242.889146]  rtnetlink_rcv+0x15/0x20
[  242.889391]  netlink_unicast+0x224/0x340
[  242.889633]  netlink_sendmsg+0x23e/0x4a0
[  242.889875]  sock_sendmsg+0x66/0x70
[  242.890181]  __sys_sendto+0x113/0x190
[  242.890424]  ? syscall_exit_to_user_mode+0x27/0x50
[  242.890668]  ? __x64_sys_bind+0x1a/0x30
[  242.890913]  ? do_syscall_64+0x69/0xc0
[  242.891150]  ? exit_to_user_mode_prepare+0x37/0x1b0
[  242.891384]  __x64_sys_sendto+0x29/0x40
[  242.891615]  do_syscall_64+0x59/0xc0
[  242.891850]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[  242.892088] RIP: 0033:0x7f98367dafa6
[  242.892322] RSP: 002b:00007ffc18ce7808 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[  242.892567] RAX: ffffffffffffffda RBX: 00007ffc18ce8900 RCX: 00007f98367dafa6
[  242.892812] RDX: 0000000000000014 RSI: 00007ffc18ce8900 RDI: 0000000000000004
[  242.893055] RBP: 00007ffc18ce8950 R08: 00007ffc18ce88a4 R09: 000000000000000c
[  242.893290] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffc18ce88a4
[  242.893520] R13: 00000000000006d7 R14: 0000000000000004 R15: 00007ffc18ce8b60
[  242.893761]  </TASK>

Pinning previous kernel, the 5.15.60-2-pve, brought everything back to normal.

P.S. Additionally, this happened:

Code:

USAGE: qm stop <vmid> [OPTIONS]

  Stop virtual machine. The qemu process will exit immediately. Thisis akin
  to pulling the power plug of a running computer and may damage the VM
  data

(...)

  -skiplock  <boolean>

             Ignore locks - only root is allowed to use this option.

(...)

root@proxmox:~# qm stop 112 -skiplock
trying to acquire lock...
can't lock file '/var/lock/qemu-server/lock-112.conf' - got timeout
root@proxmox:~# qm stop 112 -skiplock true
trying to acquire lock...
can't lock file '/var/lock/qemu-server/lock-112.conf' - received interrupt
root@proxmox:~# qm stop 112 -skiplock 1
trying to acquire lock...
can't lock file '/var/lock/qemu-server/lock-112.conf' - got timeout

How is one supposed to use the skiplock option?

wrobelda · Nov 8, 2022

Anyone? This is a serious issue, isn't it?

mira · Nov 10, 2022

Have you tried it again with kernel 5.15.64-1-pve to see if it is reproducible?

wrobelda · Nov 10, 2022

Yes, just did, same behavior.

mira · Nov 10, 2022

Do you have the latest BIOS installed?
Can you provide some more information about your system? CPU, Mainboard, any RAID controllers?
Which NICs are you using?

What's the output of pveversion -v?

If you boot into the latest Ubuntu 22.04 live CD, do you see similar issues?
The PVE kernel 5.15 is based on the one in Ubuntu 22.04 with a few patches on top, so if your host is also affected when using Ubuntu, this would indicate that some some change in the Ubuntu kernel is the culprit, which would help us narrow it down.

wrobelda · Nov 16, 2022

mira said:
Do you have the latest BIOS installed?

Yes

mira said:
Can you provide some more information about your system? CPU, Mainboard, any RAID controllers?

Lenovo p340 Tiny system
Intel(R) Core(TM) i5-10500T CPU @ 2.30GHz
No RAID, just 1 x NVMe disk

mira said:
Which NICs are you using?

1 x on-board Intel I219-LM
4 x Intel I350 (via I350-T4 PCIe)

mira said:
What's the output of pveversion -v?

Code:

root@proxmox:~# pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.15.60-2-pve)
pve-manager: 7.2-11 (running version: 7.2-11/b76d3178)
pve-kernel-5.15: 7.2-13
pve-kernel-helper: 7.2-13
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.15.60-2-pve: 5.15.60-2
pve-kernel-5.15.53-1-pve: 5.15.53-1
pve-kernel-5.15.39-4-pve: 5.15.39-4
pve-kernel-5.15.39-1-pve: 5.15.39-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph-fuse: 15.2.15-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-3
libpve-guest-common-perl: 4.1-4
libpve-http-server-perl: 4.1-4
libpve-storage-perl: 7.2-10
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.7-1
proxmox-backup-file-restore: 2.2.7-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-3
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-6
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.6-pve1

mira said:
If you boot into the latest Ubuntu 22.04 live CD, do you see similar issues?

It's a headless system at this point, would be a hassle to try that. I could try with Intel AMT, but can't commit to any timeframes.

mira said:
The PVE kernel 5.15 is based on the one in Ubuntu 22.04 with a few patches on top, so if your host is also affected when using Ubuntu, this would indicate that some some change in the Ubuntu kernel is the culprit, which would help us narrow it down.

That makes sense. If anything, were there any intermediate releases on Ubuntu side between 5.15.60-2-pve and 5.15.64-1-pve? I could potentially git-bisect the kernel to find the culprit. Haven't done this before, but there appears to be a succinct write-up on that: https://wiki.ubuntu.com/Kernel/KernelBisection

EDIT: can you confirm that 5.15.60-2-pve is d893aeed506b40cdae043b2cfad8c25535286176 and 5.15.64-1-pve is 671ee57b9dfc6d1baf207dab57de63968df26107? I base this on https://github.com/proxmox/pve-kernel/commits/pve-kernel-5.15. If so, your kernel repo won't be much of help when it comes to bisecting.

EDIT: just tested 5.15.74-1-pve, same issues

mira · Nov 16, 2022

Yes, those are the commits in our repository, the kernel itself can be found in the submodule `ubuntu-jammy`. It should match the corresponding commit of the pve-kernel repo.

You can see the following commits that reference the tags in the ubunty-jammy submodule: `update sources to Ubuntu-5.15.0-50.56` so the tag `Ubuntu-5.15.0-50.56` should be the right one for 5.15.60-2-pve and the tag `Ubuntu-5.15.0-51.57` for 5.15.64-1-pve

wrobelda · Nov 16, 2022

Thanks, I should be able to take it from here. The annoying part will be the patches you provide, which will inevitably break while progressing from 5.15.0-50.56 to 5.15.0-51.57 — the latter of which updates them in "update patches for Ubuntu-5.15.0-54.60" commit. Will let you know about my findings.

mira · Nov 16, 2022

If you don't require backups/restores and containers during the testing period, you could simply not apply any of those patches.
If you can't reproduce it without those, you can still try again with them applied. If you reproduce the issue without those, we know that it's one of the changes in the Ubuntu kernel.

Thanks for doing this!

wrobelda · Nov 16, 2022

BTW. Your zfs submodule is inaccessible on GitHub:

Code:

test@proxmox:~/pve-kernel$ git submodule update
Cloning into '/home/test/pve-kernel/submodules/ubuntu-kinetic'...

Cloning into '/home/test/pve-kernel/submodules/zfsonlinux'...
Username for 'https://github.com': Password for 'https://github.com':

test@proxmox:~/pve-kernel$ git submodule
+ab2e786e8b1e6690c98424277abe512970850bd6 submodules/ubuntu-kinetic (Ubuntu-5.15.0-27.28)
-186fde725e63731d0c32913b5fe13dd562f15940 submodules/zfsonlinux

Looks like you didn't push it to GitHub, as the submodule links to a 404 there: https://github.com/proxmox/zfsonlinux/tree/186fde725e63731d0c32913b5fe13dd562f15940

Everything is OK using your own https://git.proxmox.com, though.

EDIT: the https://github.com/proxmox/mirror_ubuntu-jammy-kernel/ is inaccessible, too.

wrobelda · Nov 16, 2022

mira said:
If you don't require backups/restores and containers during the testing period, you could simply not apply any of those patches.
If you can't reproduce it without those, you can still try again with them applied. If you reproduce the issue without those, we know that it's one of the changes in the Ubuntu kernel.

Thanks for the heads up. That simplifies things greatly.

mira said:
Thanks for doing this!

Happy to help!

wrobelda · Nov 16, 2022

Immediately run into this:

Code:

dpkg-source: info: building zfs-linux using existing ./zfs-linux_2.1.6.orig.tar.gz
dpkg-source: info: using patch list from debian/patches/series
can't find file to patch at input line 16
Perhaps you used the wrong -p or --strip option?

Your script doesn't initialize zfsonlinux submodule's own submodule. The README also doesn't mention that. This solved it:

Code:

test@proxmox:~/pve-kernel/submodules/zfsonlinux$ git submodule init
Submodule 'zfs/upstream' (git://git.proxmox.com/git/mirror_zfs) registered for path 'upstream'
test@proxmox:~/pve-kernel/submodules/zfsonlinux$ git submodule update

wrobelda · Nov 18, 2022

Any idea why this happens?

Code:

diff -up -N fwlist-previous.sorted fwlist-5.15.60-1-pve.sorted > fwlist.diff
make[2]: *** [debian/rules:278: fwcheck] Error 1
make[2]: Leaving directory '/home/test/pve-kernel/build'
make[1]: *** [debian/rules:110: binary] Error 2
make[1]: Leaving directory '/home/test/pve-kernel/build'

fwlist.diff seems fine to me:

Code:

test@proxmox:~/pve-kernel$ cat ./build/fwlist.diff
--- fwlist-previous.sorted    2022-11-17 17:56:26.213661130 +0100
+++ fwlist-5.15.60-1-pve.sorted    2022-11-17 17:56:26.217661240 +0100
@@ -1767,7 +1767,6 @@ rtl_nic/rtl8402-1.fw kernel/drivers/net/
 rtl_nic/rtl8411-1.fw kernel/drivers/net/ethernet/realtek/r8169.ko
 rtl_nic/rtl8411-2.fw kernel/drivers/net/ethernet/realtek/r8169.ko
 rtlwifi/rtl8188efw.bin kernel/drivers/net/wireless/realtek/rtlwifi/rtl8188ee/rtl8188ee.ko
-rtlwifi/rtl8188eufw.bin kernel/drivers/staging/r8188eu/r8188eu.ko
 rtlwifi/rtl8192cfw.bin kernel/drivers/net/wireless/realtek/rtlwifi/rtl8192ce/rtl8192ce.ko
 rtlwifi/rtl8192cfwU_B.bin kernel/drivers/net/wireless/realtek/rtlwifi/rtl8192ce/rtl8192ce.ko
 rtlwifi/rtl8192cfwU.bin kernel/drivers/net/wireless/realtek/rtlwifi/rtl8192ce/rtl8192ce.ko
test@proxmox:~/pve-kernel$ nano debian/rules

EDIT: I commented the code out for now, and moving further I get:

Code:

input file 'debian/pve-headers-5.15.60-1-pve/usr/src/linux-headers-5.15.60-1-pve/Module.symvers' does not exist

That's because build/debian/pve-headers-5.15.60-1-pve/usr/src/ is non-existent.

It's getting impossible to build it, the discrepancy between the kernel version as expected by the scripts in your repo and the ubuntu kernel submodule, which I manually checked out to a different tag, is too significant. Your bespoke scripting may work well for you when preparing pve releases, but it proves very problematic in a basic git-blame scenario like this.

wrobelda · Nov 18, 2022

OK, I give up. I managed to build everything, only having to disable abicheck, and now left with:

Code:

debian/rules fwcheck
make[2]: Entering directory '/home/test/pve-kernel/build'
make[2]: warning: jobserver unavailable: using -j1.  Add '+' to parent make rule.
checking fwlist for changes since last built firmware package..
if this check fails, add fwlist-5.15.60-2-pve to the pve-firmware repository and upload a new firmware package together with the 5.15.60-2-pve kernel
sort fwlist-previous | uniq > fwlist-previous.sorted
sort fwlist-5.15.60-2-pve | uniq > fwlist-5.15.60-2-pve.sorted
diff -up -N fwlist-previous.sorted fwlist-5.15.60-2-pve.sorted > fwlist.diff
rm fwlist.diff fwlist-previous.sorted fwlist-5.15.60-2-pve.sorted
done, no need to rebuild pve-firmware
make[2]: Leaving directory '/home/test/pve-kernel/build'
dh_strip -Npve-headers-5.15.60-2-pve -Npve-kernel-libc-dev
dh_makeshlibs
dh_shlibdeps
dh_installdeb
dh_gencontrol
dpkg-gencontrol: warning: Depends field of package linux-tools-5.15: substitution variable ${shlibs:Depends} used, but is not defined
dh_md5sums
dh_builddeb
dpkg-deb: building package 'linux-tools-5.15' in '../linux-tools-5.15_5.15.60-2_amd64.deb'.
dpkg-deb: building package 'pve-headers-5.15.60-2-pve' in '../pve-headers-5.15.60-2-pve_5.15.60-2_amd64.deb'.
dpkg-deb: building package 'pve-kernel-5.15.60-2-pve' in '../pve-kernel-5.15.60-2-pve_5.15.60-2_amd64.deb'.
dpkg-deb: building package 'pve-kernel-libc-dev' in '../pve-kernel-libc-dev_5.15.60-2_amd64.deb'.
make[1]: Leaving directory '/home/test/pve-kernel/build'

However, .deb packages are somehow crippled at ~10-15kB in size:

Code:

-rw-r--r-- 1 test test   15032 Nov 18 17:02 pve-headers-5.15.60-2-pve_5.15.60-2_amd64.deb
-rw-r--r-- 1 test test   12944 Nov 18 17:02 pve-kernel_5.15.60-2_amd64.buildinfo
-rw-r--r-- 1 test test    2296 Nov 18 17:02 pve-kernel_5.15.60-2_amd64.changes
-rw-r--r-- 1 test test   15152 Nov 18 17:02 pve-kernel-5.15.60-2-pve_5.15.60-2_amd64.deb
-rw-r--r-- 1 test test   14428 Nov 18 17:02 pve-kernel-libc-dev_5.15.60-2_amd64.deb

wrobelda · Dec 7, 2022

FYI, tried with 5.19.17 and it has the same issue.

yolkfull · Jun 6, 2023

I also hit the same issue with 5.15.0. Anyone would help to suggest will be highly appreciated.

Code:

[448192.537507] Call Trace:
[448192.566588]  <TASK>
[448192.594854]  __schedule+0x2d8/0x890
[448192.624646]  ? id_idle_cpu+0x1a6/0x1e0
[448192.654768]  schedule+0x4e/0xb0
[448192.684164]  schedule_preempt_disabled+0xe/0x10
[448192.715003]  __mutex_lock.isra.0+0x208/0x470
[448192.745450]  ? netlink_lookup+0x13e/0x1c0
[448192.775592]  ? rtnl_bridge_notify+0xf0/0xf0
[448192.805495]  __mutex_lock_slowpath+0x13/0x20
[448192.834834]  mutex_lock+0x32/0x40
[448192.862551]  __netlink_dump_start+0x7c/0x290
[448192.890819]  ? rtnl_bridge_notify+0xf0/0xf0
[448192.918162]  rtnetlink_rcv_msg+0x2b4/0x410
[448192.944849]  ? rtnl_bridge_notify+0xf0/0xf0
[448192.971118]  ? rtnl_calcit.isra.0+0x130/0x130
[448192.996883]  netlink_rcv_skb+0x55/0x100
[448193.021454]  rtnetlink_rcv+0x15/0x20
[448193.044997]  netlink_unicast+0x1a8/0x250
[448193.068399]  netlink_sendmsg+0x23f/0x4a0
[448193.090918]  sock_sendmsg+0x65/0x70
[448193.112867]  __sys_sendto+0x113/0x190
[448193.134621]  ? fput+0x13/0x20
[448193.154873]  ? __sys_bind+0x8b/0x110
[448193.175317]  ? __x64_sys_socket+0x1a/0x20
[448193.195824]  __x64_sys_sendto+0x29/0x30
[448193.215370]  do_syscall_64+0x5c/0xc0
[448193.234032]  ? syscall_exit_to_user_mode+0x27/0x50
[448193.253683]  ? __x64_sys_bind+0x1a/0x20
[448193.271844]  ? do_syscall_64+0x69/0xc0
[448193.290028]  ? irqentry_exit_to_user_mode+0x9/0x20
[448193.309723]  ? irqentry_exit+0x19/0x30
[448193.328277]  ? exc_page_fault+0x89/0x160
[448193.346943]  ? asm_exc_page_fault+0x8/0x30
[448193.365982]  entry_SYSCALL_64_after_hwframe+0x44/0xae

wrobelda · Jun 6, 2023

For the record, I no longer have that issue. I changed my configuration slightly, though, from Intel i350 to Mellanox ConnectX3. I haven't changed it back to confirm it was indeed the Intel NIC causing this, but should be able to do that within a few next days.

yolkfull · Jun 6, 2023

Hi wrobelda, do you mean change the driver from i350 to mlx3? Let me try that as well. Thanks for the info.

Regards,

wrobelda · Jun 6, 2023

yolkfull said:
Hi wrobelda, do you mean change the driver from i350 to mlx3? Let me try that as well. Thanks for the info.

Regards,

No, I changed the hardware in my system, which naturally also implies a different set of kernel modules (drivers) is used. If you also use i350, then it could actually be the culprit.

yolkfull · Jun 7, 2023

wrobelda said:
No, I changed the hardware in my system, which naturally also implies a different set of kernel modules (drivers) is used. If you also use i350, then it could actually be the culprit.

Wow，got it, really appreciate.

5.15.64-1-pve kernel breaks scheduler? unable to ssh, start emu, or even reboot

Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member

Member

Member

Member

Member

New Member

Member

New Member

Member

New Member