Soft CPU lockups inside VM, fixed after hypervisor reboot

tuxis

Famous Member
Jan 3, 2014
244
213
108
Ede, NL
www.tuxis.nl
Hi,

Last friday we encountered something 'funny'. We installed a VM on the hypervisor earlier that week and everything was fine. Friday, we noticed that the VM thought that the time in the VM was somewhere 16 December 2217. First I thought of a weird NTP daemon somewhere, combined with systemd-timesyncd, so I tried installing Chrony. That did not work, because the VM was terribly slow and crashing with some IO.

Following messages were shown in `dmesg`
Code:
Dec 16 22:47:24 da kernel: [6147262037.250097] watchdog: BUG: soft lockup - CPU#2 stuck for 1429948009s! [md5sum:1152634]
Dec 16 22:47:24 da kernel: [6147262037.252114] Modules linked in: tcp_diag udp_diag raw_diag inet_diag unix_diag binfmt_misc xt_set ip_set_hash_net ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_nat_ftp nf_conntrack_ftp ip_set ip6table_filter iptable_nat ip6table_mangle ip6table_raw ip6table_nat ip6_tables iptable_mangle iptable_raw iptable_filter xt_nat nft_chain_nat xt_REDIRECT nf_nat xt_owner xt_connlimit nf_conncount xt_recent nft_limit xt_limit xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_multiport nf_log_ipv4 nf_log_common xt_LOG nft_counter xt_tcpudp nft_compat nf_tables libcrc32c nfnetlink dm_mod isofs bochs_drm drm_vram_helper drm_ttm_helper ttm drm_kms_helper cec joydev sg virtio_console virtio_balloon evdev serio_raw pcspkr qemu_fw_cfg button drm fuse configfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic hid_generic usbhid hid sd_mod t10_pi crc_t10dif crct10dif_generic sr_mod cdrom virtio_net crct10dif_common net_failover ata_generic virtio_scsi
Dec 16 22:47:24 da kernel: [6147262037.252163]  failover ata_piix libata uhci_hcd psmouse ehci_hcd usbcore scsi_mod virtio_pci virtio_ring i2c_piix4 usb_common virtio floppy
Dec 16 22:47:24 da kernel: [6147262037.252174] CPU: 2 PID: 1152634 Comm: md5sum Tainted: G             L    5.10.0-21-amd64 #1 Debian 5.10.162-1
Dec 16 22:47:24 da kernel: [6147262037.252175] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Dec 16 22:47:24 da kernel: [6147262037.252181] RIP: 0033:0x55dd1f96e178
Dec 16 22:47:25 da kernel: [6147262037.252183] Code: 7c 24 d8 c1 c0 05 01 f0 8d bc 17 53 14 44 02 89 f2 31 c2 21 ca 31 f2 01 fa 41 8d bc 09 81 e6 a1 d8 89 c1 c1 c2 09 01 c2 31 d1 <21> f1 31 c1 01 f9 8b 7c 24 c8 44 8b 54 24 d4 c1 c1 0e 01 d1 8d bc
Dec 16 22:47:25 da kernel: [6147262037.252184] RSP: 002b:00007ffff9aded28 EFLAGS: 00000206
Dec 16 22:47:25 da kernel: [6147262037.252185] RAX: 000000005f402a47 RBX: 000055dd206f8d50 RCX: 00000000213e9344
Dec 16 22:47:25 da kernel: [6147262037.252185] RDX: 000000007e7eb903 RSI: 0000000022d2b3f9 RDI: 0000000090adae23
Dec 16 22:47:25 da kernel: [6147262037.252186] RBP: 0000000007bc07bc R08: 0000000000000004 R09: 0000000000200000
Dec 16 22:47:25 da kernel: [6147262037.252187] R10: 0000000000000000 R11: 00000000bb3ade75 R12: 00000000000004ab
Dec 16 22:47:25 da kernel: [6147262037.252187] R13: 0000000000000000 R14: 00000000012766a2 R15: 00000000fbff6405
Dec 16 22:47:25 da kernel: [6147262037.252188] FS:  00007f8f19a5b580 GS:  0000000000000000
Dec 16 22:47:37 da kernel: [6147262049.567688] watchdog: BUG: soft lockup - CPU#1 stuck for 1429948009s! [mandb:1163276]
Dec 16 22:47:37 da kernel: [6147262049.569956] Modules linked in: tcp_diag udp_diag raw_diag inet_diag unix_diag binfmt_misc xt_set ip_set_hash_net ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_nat_ftp nf_conntrack_ftp ip_set ip6table_filter iptable_nat ip6table_mangle ip6table_raw ip6table_nat ip6_tables iptable_mangle iptable_raw iptable_filter xt_nat nft_chain_nat xt_REDIRECT nf_nat xt_owner xt_connlimit nf_conncount xt_recent nft_limit xt_limit xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_multiport nf_log_ipv4 nf_log_common xt_LOG nft_counter xt_tcpudp nft_compat nf_tables libcrc32c nfnetlink dm_mod isofs bochs_drm drm_vram_helper drm_ttm_helper ttm drm_kms_helper cec joydev sg virtio_console virtio_balloon evdev serio_raw pcspkr qemu_fw_cfg button drm fuse configfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic hid_generic usbhid hid sd_mod t10_pi crc_t10dif crct10dif_generic sr_mod cdrom virtio_net crct10dif_common net_failover ata_generic virtio_scsi
Dec 16 22:47:37 da kernel: [6147262049.570109]  failover ata_piix libata uhci_hcd psmouse ehci_hcd usbcore scsi_mod virtio_pci virtio_ring i2c_piix4 usb_common virtio floppy
Dec 16 22:47:37 da kernel: [6147262049.570142] CPU: 1 PID: 1163276 Comm: mandb Tainted: G             L    5.10.0-21-amd64 #1 Debian 5.10.162-1

So after chrony was somewhat installed, I resetted the VM, hoping that the reboot would correctly set the time. It didn't, it just didn't boot without crashing. Also booting with a rescue CD, and not booting from the harddisk, did not help.

Setting the amount of cores to 1 helped a bit. But not completely.

Starting the VM on another node in the cluster did help. The VM booted normally, time was OK, and everything worked. So, trying to boot the VM again on the original node should help as well, right? Wrong. On node01, the VM didn't boot.

So last night, I rebooted node01, to see if that helped. And it did. After te reboot, all VM's on the node work like a charm, as they had done before!

So appearantly, there is something the matter with the hypervisor. It just doesn't show anywhere, because the OSD's are running without an issue and there is nothing logged or out of the ordinary on the host. There were other VM's running on the node, some of which were misbehaving, others that were working just fine (as far as I know).


Info about the host:
Code:
CPU: AMD EPYC 7302 16-Core Processor
pveversion: pve-manager/7.3-4/d69b70d4 (running kernel: 5.15.83-1-pve)
Kernel before reboot: 5.15.53-1-pve
Uptime before reboot: ~ 159 days

If anyone has any ideas, I'd be very interested, as I've never seen this before.
 
time in the VM was somewhere 16 December 2217
I've encountered a similar phenomenon when I live migrated VMs from/to Epyc/Xeon approx. half a year ago. (Of course this was not a well thought idea.)

I am not saying that the reason is the same for you. What I want to mention is that I had several files touched with that future date. Then this generated errors for mechanisms like logrotate - call it "disk-full-because-of-/var/log"-syndrom.

"Disk full" is always a real problem and the affected VMs killed internal processes because of this earlier or later..., so you might want to check this early enough.