Kernel BUG CPU Soft lockup.(VM/HOST freezes)

Hmm i've read my syslog and i think its the filesystem in my case. Look after the Calltrace, its definitely something with the filesystem. So i need to change my filesystem or what? Will try it out
Code:
Jun 01 07:41:27 lucy kernel: watchdog: BUG: soft lockup - CPU#5 stuck for 2094s! [kvm:2438]
Jun 01 07:41:27 lucy kernel: Modules linked in: tcp_diag inet_diag veth ebtable_filter ebtables ip_set ip6table_raw ip6table_filter ip6_tables nf_tables iptable_raw xt_multiport iptable_filter xt_MASQUERADE xt_nat xt_tcpudp iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bpfilter bonding tls softdog nfnetlink_log nfnetlink xfs ast drm_vram_helper drm_ttm_helper intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd ttm kvm_amd drm_kms_helper cec kvm rc_core irqbypass fb_sys_fops syscopyarea k10temp crct10dif_pclmul sysfillrect ccp sysimgblt ghash_clmulni_intel aesni_intel mac_hid crypto_simd cryptd wmi_bmof pcspkr rapl zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc drm ip_tables x_tables autofs4 btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c simplefb crc32_pclmul xhci_pci xhci_pci_renesas i2c_piix4 igb
Jun 01 07:41:27 lucy kernel:  i2c_algo_bit dca ahci xhci_hcd libahci wmi gpio_amdpt gpio_generic
Jun 01 07:41:27 lucy kernel: CPU: 5 PID: 2438 Comm: kvm Tainted: P           O L    5.15.35-1-pve #1
Jun 01 07:41:27 lucy kernel: Hardware name: Hetzner /B565D4-V1L, BIOS L0.23 02/23/2022
Jun 01 07:41:27 lucy kernel: RIP: 0010:rwsem_down_write_slowpath+0x1d2/0x4d0
Jun 01 07:41:27 lucy kernel: Code: 03 00 00 48 83 c4 60 4c 89 e8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 c6 45 c0 01 4c 89 e7 c6 07 00 0f 1f 40 00 fb 66 0f 1f 44 00 00 <45> 85 f6 74 1e 48 8b 03 a9 00 00 02 00 75 07 48 8b 03 a8 04 74 0d
Jun 01 07:41:27 lucy kernel: RSP: 0018:ffffb40cc7c83848 EFLAGS: 00000283
Jun 01 07:41:27 lucy kernel: RAX: 0000000000000006 RBX: ffff8f4e55b96300 RCX: ffffb40cc7ad3c38
Jun 01 07:41:27 lucy kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8f4e4b51f10c
Jun 01 07:41:27 lucy kernel: RBP: ffffb40cc7c838d0 R08: 0000000000000000 R09: ffff8f4e55b96300
Jun 01 07:41:27 lucy kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8f4e4b51f10c
Jun 01 07:41:27 lucy kernel: R13: ffff8f4e4b51f0f8 R14: 0000000000000000 R15: ffffb40cc7c83868
Jun 01 07:41:27 lucy kernel: FS:  00007f215cc0f1c0(0000) GS:ffff8f5d3eb40000(0000) knlGS:0000000000000000
Jun 01 07:41:27 lucy kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 01 07:41:27 lucy kernel: CR2: 00007f30f1121000 CR3: 0000000119316000 CR4: 0000000000350ee0
Jun 01 07:41:27 lucy kernel: Call Trace:
Jun 01 07:41:27 lucy kernel:  <TASK>
Jun 01 07:41:27 lucy kernel:  down_write+0x43/0x50
Jun 01 07:41:27 lucy kernel:  xfs_ilock+0x70/0xf0 [xfs]
Jun 01 07:41:27 lucy kernel:  xfs_vn_update_time+0xc9/0x1d0 [xfs]
Jun 01 07:41:27 lucy kernel:  file_update_time+0xea/0x140
Jun 01 07:41:27 lucy kernel:  file_modified+0x27/0x30
Jun 01 07:41:27 lucy kernel:  xfs_file_write_checks+0x244/0x2c0 [xfs]
Jun 01 07:41:27 lucy kernel:  xfs_file_dio_write_aligned+0x67/0x130 [xfs]
Jun 01 07:41:27 lucy kernel:  xfs_file_write_iter+0x10d/0x1b0 [xfs]
Jun 01 07:41:27 lucy kernel:  ? security_file_permission+0x2f/0x60
Jun 01 07:41:27 lucy kernel:  io_write+0xfe/0x320
Jun 01 07:41:27 lucy kernel:  io_issue_sqe+0x3e9/0x1fb0
Jun 01 07:41:27 lucy kernel:  ? __pollwait+0xd0/0xd0
Jun 01 07:41:27 lucy kernel:  ? __pollwait+0xd0/0xd0
Jun 01 07:41:27 lucy kernel:  __io_queue_sqe+0x35/0x310
Jun 01 07:41:27 lucy kernel:  ? fget+0x2a/0x30
Jun 01 07:41:27 lucy kernel:  io_submit_sqes+0xfb5/0x1b50
Jun 01 07:41:27 lucy kernel:  ? __pollwait+0xd0/0xd0
Jun 01 07:41:27 lucy kernel:  ? __fget_files+0x86/0xc0
Jun 01 07:41:27 lucy kernel:  __do_sys_io_uring_enter+0x520/0x9a0
Jun 01 07:41:27 lucy kernel:  ? __do_sys_io_uring_enter+0x520/0x9a0
Jun 01 07:41:27 lucy kernel:  __x64_sys_io_uring_enter+0x29/0x30
Jun 01 07:41:27 lucy kernel:  do_syscall_64+0x5c/0xc0
Jun 01 07:41:27 lucy kernel:  ? exit_to_user_mode_prepare+0x37/0x1b0
Jun 01 07:41:27 lucy kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jun 01 07:41:27 lucy kernel:  ? __x64_sys_read+0x1a/0x20
Jun 01 07:41:27 lucy kernel:  ? do_syscall_64+0x69/0xc0
Jun 01 07:41:27 lucy kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jun 01 07:41:27 lucy kernel:  ? do_syscall_64+0x69/0xc0
Jun 01 07:41:27 lucy kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jun 01 07:41:27 lucy kernel:  ? __x64_sys_write+0x1a/0x20
Jun 01 07:41:27 lucy kernel:  ? do_syscall_64+0x69/0xc0
Jun 01 07:41:27 lucy kernel:  ? do_syscall_64+0x69/0xc0
Jun 01 07:41:27 lucy kernel:  ? do_syscall_64+0x69/0xc0
Jun 01 07:41:27 lucy kernel:  ? do_syscall_64+0x69/0xc0
Jun 01 07:41:27 lucy kernel:  ? asm_common_interrupt+0x8/0x40
Jun 01 07:41:27 lucy kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
Jun 01 07:41:27 lucy kernel: RIP: 0033:0x7f21675e29b9
Jun 01 07:41:27 lucy kernel: Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a7 54 0c 00 f7 d8 64 89 01 48
Jun 01 07:41:27 lucy kernel: RSP: 002b:00007ffccdab09f8 EFLAGS: 00000212 ORIG_RAX: 00000000000001aa
Jun 01 07:41:27 lucy kernel: RAX: ffffffffffffffda RBX: 00007f1b3b0b0640 RCX: 00007f21675e29b9
Jun 01 07:41:27 lucy kernel: RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000011
Jun 01 07:41:27 lucy kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000008
Jun 01 07:41:27 lucy kernel: R10: 0000000000000000 R11: 0000000000000212 R12: 0000559b09eb4e68
Jun 01 07:41:27 lucy kernel: R13: 0000559b09eb4f20 R14: 0000559b09eb4e60 R15: 0000000000000001
Jun 01 07:41:27 lucy kernel:  </TASK>

How does your syslog file look like if you get the bug?
@tribumx Please take a look here https://forum.proxmox.com/threads/live-migration-auf-7-2-4-cpu-100-freeze.109815/page-2#post-476150
 
Same issue here. I have downgraded to 5.13.19-6-pve.

I have also two nested VMs (PVE/PBS) running with cpu host mode. Should I downgrade the kernels of these VMs too?
 
I just updated my cluster to 5.19 and have moved 6 or so VMs around. I haven't had one lock up yet. Let's hope this fixed it.
 
  • Like
Reactions: ales
I have upgrade to pve-kernel-5.19.7-1-pve
something need to know

live migrade guest from pve-kernel-5.15 to 5.19 still hang
so need to shutdown before migrade
after all node upgrade to 5.19
live migrate is ok
 
Same issue here, can anyone report if the upgrade to 5.19 fixed the issue?
I have a Ryzen system that randomly crashes every few days
 
The host crashed, or at least is unresponsive via UI/ SSH.
Only a hard reset helps
 
I have upgrade to pve-kernel-5.19.7-1-pve
something need to know

live migrade guest from pve-kernel-5.15 to 5.19 still hang
so need to shutdown before migrade
after all node upgrade to 5.19
live migrate is ok
So no soft lockup anymore, with newest kernel?
 
Hello everyone,

I had the same issue, I thought it was a faulty installation, I ended up reinstalling proxmox 7.2. It was smooth for two weeks, I decided to ugrade to 5.15-74-1.
After a day, my proxmox froze, caused by CPU softlock.
Cannot SSH or login to any of the VM. Only a hard reboot works.
Then I upgraded to 5.19. Same issue, after a couple of hours I had an another softlock.

I have an AMD Ryzen 5 3600.
I am using ZFS as filesystem.
 
Check the bios settings for power supply idle control, maybe your psu shuts off if the load is too low
 
  • Like
Reactions: Stead
I upgraded to 5.19.
I changed settings in my BIOS:
  • Power Supply Idle Control --> Typical Current Idle
  • Global C-state control --> Disabled
So far, so good, 5 days of uptime, no softlock.
I am not selling the bear's skin before I've caught it but @Vengance you seems to be right :)

Thanks a lot
 
  • Like
Reactions: Vengance
Experienced a CPU softlock earlier today. Xeon E series

Gut feeling says it's perhaps a spike in IO that causes it? I say this because I started relatively large upload to a VM via SSH and the issue happened shortly after - however this could be a total coincidence.

Code:
Linux proxm 5.15.30-2-pve #1 SMP PVE 5.15.30-3 (Fri, 22 Apr 2022 18:08:27 +0200) x86_64 GNU/Linux

Strangely I don't see any staff responses on this thread yet.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!