One VM hangs, super high load avg (200+)

helpmyserverruns

New Member
Oct 7, 2024
3
0
1
I'm experiencing sporadic but reoccurring issues with one of my VMs.
One VM has a super high load average. top works sometimes, htop completely hangs. SSH sometimes works.
top shows basically no CPU usage.
This setup worked without issue for the last 3 months. This started about a week ago.
I did not change anything in the setup during the last 4 weeks. Just regular kernel updates etc.

I have attached a NanoKVM however I can't see how thats affecting a single VM only.

This is the 2nd time in the last week.
Proxmox automatic backup is scheduled at 2:30am so it should not be the culprit.

VM Settings:
CPU: 8 Core (host)
Mem: 32GB (of 64GB in host)
1728331586314.png


I have quite a lot of docker containers (~40) running in the VM so troubleshooting individual software is gonna be challenging.
To name a few:
- Authentik
- Harbor
- Swag
- Arr Stack + Plex + Jellyfin
- Grafana
- Netbox
- Unifi Controller
- Yourls
samba is running natively.

CPU: 12600k
MB: Z690-P
Proxmox VE 8.2.7
Passthrough PCI devices:
- ConnectX4 Lx Virtual Function
- LSI SAS 9207-8i
- 2x Intel P4511 4TB U.2 SSD
Code:
root@px1:~# uname -a
Linux px1 6.8.12-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-2 (2024-09-05T10:03Z) x86_64 GNU/Linux

Zabbix can still get metrics (up to a point):
View attachment 75911
1728331784790.png
This might be interesting:
1728331850791.png
1728331885754.png

Okay while writing it seems to have completely locked up... No more metrics, no more SSH.
Proxmox Web Console also not working.
Last few log lines from VM:
1728331689787.png

CPU usage in web ui is not showing anything strange:
View attachment 75912

No dmesg messages to speak of around the time it happened ( on the proxmox host):
Code:
Oct 07 21:15:55 px1 sudo[855938]: pam_unix(sudo:session): session closed for user root
Oct 07 21:15:55 px1 sudo[855940]: pam_unix(sudo:session): session closed for user root
Oct 07 21:15:55 px1 sudo[855932]: pam_unix(sudo:session): session closed for user root
Oct 07 21:15:55 px1 sudo[855934]: pam_unix(sudo:session): session closed for user root
Oct 07 21:15:55 px1 sudo[855942]: pam_unix(sudo:session): session closed for user root
Oct 07 21:15:55 px1 kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 21:15:55 px1 kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 21:15:55 px1 kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 21:15:55 px1 kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 21:15:55 px1 kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 21:15:55 px1 kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 21:15:55 px1 sudo[855944]: pam_unix(sudo:session): session closed for user root
Oct 07 21:15:55 px1 sudo[855931]: pam_unix(sudo:session): session closed for user root
Oct 07 21:15:55 px1 sudo[855936]: pam_unix(sudo:session): session closed for user root
Oct 07 21:15:55 px1 sudo[855933]: pam_unix(sudo:session): session closed for user root
Oct 07 21:15:55 px1 sudo[855935]: pam_unix(sudo:session): session closed for user root
Oct 07 21:15:55 px1 sudo[855941]: pam_unix(sudo:session): session closed for user root
Oct 07 21:15:55 px1 sudo[855943]: pam_unix(sudo:session): session closed for user root
Oct 07 21:15:55 px1 sudo[855945]: pam_unix(sudo:session): session closed for user root
Oct 07 21:15:55 px1 sudo[855937]: pam_unix(sudo:session): session closed for user root
Oct 07 21:15:55 px1 sudo[855939]: pam_unix(sudo:session): session closed for user root
Oct 07 21:17:01 px1 CRON[856491]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Oct 07 21:17:01 px1 CRON[856492]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Oct 07 21:17:01 px1 CRON[856491]: pam_unix(cron:session): session closed for user root
Oct 07 21:22:31 px1 smartd[20846]: Device: /dev/disk/by-id/ata-CT500MX500SSD1_2208E60D8197 [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 68 to 67
Oct 07 21:24:28 px1 pvedaemon[1996]: <root@pam> successful auth for user 'root@pam'
Oct 07 21:24:31 px1 pvedaemon[1995]: <root@pam> starting task UPID:px1:000D1F17:00A30BA0:6704356F:vncproxy:102:root@pam:
Oct 07 21:24:31 px1 pvedaemon[859927]: starting vnc proxy UPID:px1:000D1F17:00A30BA0:6704356F:vncproxy:102:root@pam:
Oct 07 21:38:40 px1 pvedaemon[1995]: <root@pam> end task UPID:px1:000D1F17:00A30BA0:6704356F:vncproxy:102:root@pam: OK
Oct 07 21:39:25 px1 pvedaemon[1997]: <root@pam> successful auth for user 'root@pam'

Other occurence zabbix screenshot:
1728331152251.png
 
DMESG from VM crash after reboot:
Code:
Oct 07 15:32:12 helios kernel: br-4507533d0bbc: port 7(veth586bf23) entered blocking state
Oct 07 15:32:12 helios kernel: br-4507533d0bbc: port 7(veth586bf23) entered forwarding state
Oct 07 16:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 16:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 16:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 16:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 16:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 16:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 17:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 17:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 17:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 17:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 17:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 17:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 18:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 18:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 18:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 18:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 18:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 18:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 19:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 19:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 19:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 19:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 19:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 19:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 19:52:51 helios kernel: mlx5_core 0000:01:00.0 enp1s0: Link up
Oct 07 20:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 20:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 20:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 20:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 20:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 20:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 21:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 21:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 21:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 21:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 21:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 21:11:14 helios kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 21:18:00 helios kernel: BUG: unable to handle page fault for address: ffff989d24878bc8
 
And now that I know how to retrieve it from 4 boots ago:
Code:
Oct 05 22:56:49.419461 helios kernel: BUG: unable to handle page fault for address: ffffcd66c54c5d78
Oct 05 22:56:49.419980 helios kernel: #PF: supervisor read access in kernel mode
Oct 05 22:56:49.422175 helios kernel: #PF: error_code(0x0000) - not-present page
Oct 05 22:56:49.422709 helios kernel: PGD 0 P4D 0
Oct 05 22:56:49.423146 helios kernel: Oops: 0000 [#3] PREEMPT SMP NOPTI
Oct 05 22:56:49.423194 helios kernel: CPU: 7 PID: 2174458 Comm: umount Tainted: P    B D    OE      6.1.0-23-amd64 #1  Debian 6.1.99-1
Oct 05 22:56:49.423218 helios kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 4.2023.08-4 02/15/2024
Oct 05 22:56:49.423244 helios kernel: RIP: 0010:sa_idx_tab_rele+0xe/0xf0 [zfs]
Oct 05 22:56:49.423265 helios kernel: Code: c1 48 c7 c7 f0 82 60 c1 e8 35 ab 59 ff e9 d1 fe ff ff e8 55 7c f8 e1 0f 1f 44 00 00 0f 1f 44 00 00 48 85 f6 74 5e 41 54 55 53 <48> 8b af 78 05 00 00 48 89 f3 48 89 ef e8 80 4e f9 e1 65 48 8b 04
Oct 05 22:56:49.423295 helios kernel: RSP: 0018:ffffa79a4ce17b58 EFLAGS: 00010286
Oct 05 22:56:49.423333 helios kernel: RAX: ffff8d67a5ae4c98 RBX: ffff8d67a5ae4c98 RCX: ffff8d67049108a0
Oct 05 22:56:49.423356 helios kernel: RDX: 0000000000000000 RSI: ffff8d68c6485ec0 RDI: ffffcd66c54c5800
Oct 05 22:56:49.423375 helios kernel: RBP: ffff8d6d93f14300 R08: ffffa79a4ce17a30 R09: ffff8d66c4e4b438
Oct 05 22:56:49.423401 helios kernel: R10: 00000000000379c0 R11: 0000000000000000 R12: ffff8d67a5ae4d30
Oct 05 22:56:49.423428 helios kernel: R13: ffff8d67015ca000 R14: ffff8d67966f0008 R15: ffff8d688ccf0f68
Oct 05 22:56:49.423458 helios kernel: FS:  00007f8771a82840(0000) GS:ffff8d6e1fdc0000(0000) knlGS:0000000000000000
Oct 05 22:56:49.423478 helios kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 05 22:56:49.423506 helios kernel: CR2: ffffcd66c54c5d78 CR3: 0000000147daa001 CR4: 0000000000770ee0
Oct 05 22:56:49.423527 helios kernel: PKRU: 55555554
Oct 05 22:56:49.423541 helios kernel: Call Trace:
Oct 05 22:56:49.423556 helios kernel:  <TASK>
Oct 05 22:56:49.423573 helios kernel:  ? __die_body.cold+0x1a/0x1f
Oct 05 22:56:49.423596 helios kernel:  ? page_fault_oops+0xd2/0x2b0
Oct 05 22:56:49.423614 helios kernel:  ? search_module_extables+0x15/0x60
Oct 05 22:56:49.423643 helios kernel:  ? search_bpf_extables+0x5b/0x80
Oct 05 22:56:49.423661 helios kernel:  ? exc_page_fault+0xca/0x170
Oct 05 22:56:49.423681 helios kernel:  ? asm_exc_page_fault+0x22/0x30
Oct 05 22:56:49.423696 helios kernel:  ? sa_idx_tab_rele+0xe/0xf0 [zfs]
Oct 05 22:56:49.423716 helios kernel:  sa_handle_destroy+0x55/0xd0 [zfs]
Oct 05 22:56:49.423744 helios kernel:  zfs_zinactive+0x8f/0xf0 [zfs]
Oct 05 22:56:49.423770 helios kernel:  zfs_inactive+0x95/0x230 [zfs]
Oct 05 22:56:49.423799 helios kernel:  zpl_evict_inode+0x3e/0x60 [zfs]
Oct 05 22:56:49.423836 helios kernel:  evict+0xcd/0x1d0
Oct 05 22:56:49.423859 helios kernel:  dispose_list+0x48/0x70
Oct 05 22:56:49.423878 helios kernel:  evict_inodes+0x1a0/0x1b0
Oct 05 22:56:49.423918 helios kernel:  generic_shutdown_super+0x37/0x130
Oct 05 22:56:49.423956 helios kernel:  kill_anon_super+0x14/0x30
Oct 05 22:56:49.423976 helios kernel:  deactivate_locked_super+0x2c/0xa0
Oct 05 22:56:49.423994 helios kernel:  cleanup_mnt+0xbd/0x150
Oct 05 22:56:49.424019 helios kernel:  task_work_run+0x56/0x90
Oct 05 22:56:49.424044 helios kernel:  exit_to_user_mode_prepare+0x1dd/0x1e0
Oct 05 22:56:49.424064 helios kernel:  syscall_exit_to_user_mode+0x17/0x40
Oct 05 22:56:49.424101 helios kernel:  do_syscall_64+0x61/0xb0
Oct 05 22:56:49.424115 helios kernel:  ? exit_to_user_mode_prepare+0x40/0x1e0
Oct 05 22:56:49.424156 helios kernel:  ? syscall_exit_to_user_mode+0x1e/0x40
Oct 05 22:56:49.424180 helios kernel:  ? do_syscall_64+0x61/0xb0
Oct 05 22:56:49.424201 helios kernel:  ? exit_to_user_mode_prepare+0x40/0x1e0
Oct 05 22:56:49.424215 helios kernel:  ? syscall_exit_to_user_mode+0x1e/0x40
Oct 05 22:56:49.424232 helios kernel:  ? do_syscall_64+0x61/0xb0
Oct 05 22:56:49.424254 helios kernel:  ? do_user_addr_fault+0x1b0/0x550
Oct 05 22:56:49.424284 helios kernel:  ? clear_bhb_loop+0x15/0x70
Oct 05 22:56:49.424307 helios kernel:  ? clear_bhb_loop+0x15/0x70
Oct 05 22:56:49.424324 helios kernel:  ? clear_bhb_loop+0x15/0x70
Oct 05 22:56:49.424348 helios kernel:  ? clear_bhb_loop+0x15/0x70
Oct 05 22:56:49.424479 helios kernel:  ? clear_bhb_loop+0x15/0x70
Oct 05 22:56:49.424506 helios kernel:  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Oct 05 22:56:49.424532 helios kernel: RIP: 0033:0x7f8771caea77
Oct 05 22:56:49.424553 helios kernel: Code: 8f 93 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 59 93 0c 00 f7 d8 64 89 02 b8
Oct 05 22:56:49.424573 helios kernel: RSP: 002b:00007ffd28f00258 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
Oct 05 22:56:49.424592 helios kernel: RAX: 0000000000000000 RBX: 00005644db340b78 RCX: 00007f8771caea77
Oct 05 22:56:49.424611 helios kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00005644db341950
Oct 05 22:56:49.424645 helios kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000073
Oct 05 22:56:49.424672 helios kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f8771de9264
Oct 05 22:56:49.424701 helios kernel: R13: 00005644db341950 R14: 00005644db341910 R15: 00005644db340a60
Oct 05 22:56:49.424728 helios kernel:  </TASK>
Oct 05 22:56:49.424767 helios kernel: Modules linked in: tcp_diag inet_diag tun bluetooth jitterentropy_rng drbg ansi_cprng ecdh_generic rfkill ecc xt_nat xt_tcpudp veth xt_conntrack nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter bridge stp llc nft_chain_nat xt>Oct 05 22:56:49.424852 helios kernel:  soundcore button evdev joydev serio_raw sg fuse loop efi_pstore configfs efivarfs qemu_fw_cfg ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic mlx5_ib ib_uverbs ib_core hid_generic usbhid hid dm_mod sd_mod nvme mlx5_core vir>Oct 05 22:56:49.424889 helios kernel: CR2: ffffcd66c54c5d78
Oct 05 22:56:49.424916 helios kernel: ---[ end trace 0000000000000000 ]---
Oct 05 22:56:49.424934 helios kernel: RIP: 0010:zap_huge_pmd+0x1c0/0x430
Oct 05 22:56:49.424965 helios kernel: Code: 21 c5 48 c1 e8 3a 8d 50 e4 49 c1 e5 06 4c 03 2d 76 b3 0d 01 83 fa 01 0f 86 5d 02 00 00 48 83 f8 1b 0f 84 53 02 00 00 45 31 e4 <49> 8b 45 08 a8 01 0f 85 59 02 00 00 0f 1f 44 00 00 4c 89 e8 f6 40
Oct 05 22:56:49.425030 helios kernel: RSP: 0018:ffffa79a69037668 EFLAGS: 00010246
Oct 05 22:56:49.425047 helios kernel: RAX: 0000000000000000 RBX: ffffa79a69037898 RCX: 0000040000000000
Oct 05 22:56:49.425060 helios kernel: RDX: 00000000ffffffe4 RSI: 0000000000000000 RDI: 0000040000000000
Oct 05 22:56:49.425071 helios kernel: RBP: ffff8d67a5ae7970 R08: 000fffffffffffff R09: 00007f3665e00000
Oct 05 22:56:49.425083 helios kernel: R10: 00007f3684645fff R11: 0000000000000000 R12: 0000000000000000
Oct 05 22:56:49.425088 helios kernel: R13: ffffe65e7fffffc0 R14: ffffe5de8796b9e8 R15: 00007f3664800000
Oct 05 22:56:49.425096 helios kernel: FS:  00007f8771a82840(0000) GS:ffff8d6e1fdc0000(0000) knlGS:0000000000000000
Oct 05 22:56:49.425105 helios kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 05 22:56:49.425111 helios kernel: CR2: ffffcd66c54c5d78 CR3: 0000000147daa001 CR4: 0000000000770ee0
Oct 05 22:56:49.425116 helios kernel: PKRU: 55555554
Oct 05 22:56:49.425124 helios kernel: note: umount[2174458] exited with irqs disabled
Oct 05 22:56:49.583101 helios systemd-shutdown[1]: Syncing filesystems and block devices.
Oct 05 22:57:29.585013 helios systemd-shutdown[1]: Syncing filesystems and block devices - timed out, issuing SIGKILL to PID 2174478.
Oct 05 22:57:29.585385 helios systemd-shutdown[1]: Sending SIGTERM to remaining processes...
Oct 05 22:57:29.585415 helios systemd-journald[336]: Received SIGTERM from PID 1 (systemd-shutdow).