One VM hangs, super high load avg (200+)

helpmyserverruns

New Member
Oct 7, 2024
2
0
1
I'm experiencing sporadic but reoccurring issues with one of my VMs.
One VM has a super high load average. top works sometimes, htop completely hangs. SSH sometimes works.
top shows basically no CPU usage.
This setup worked without issue for the last 3 months. This started about a week ago.
I did not change anything in the setup during the last 4 weeks. Just regular kernel updates etc.

I have attached a NanoKVM however I can't see how thats affecting a single VM only.

This is the 2nd time in the last week.
Proxmox automatic backup is scheduled at 2:30am so it should not be the culprit.

VM Settings:
CPU: 8 Core (host)
Mem: 32GB (of 64GB in host)
1728331586314.png


I have quite a lot of docker containers (~40) running in the VM so troubleshooting individual software is gonna be challenging.
To name a few:
- Authentik
- Harbor
- Swag
- Arr Stack + Plex + Jellyfin
- Grafana
- Netbox
- Unifi Controller
- Yourls
samba is running natively.

CPU: 12600k
MB: Z690-P
Proxmox VE 8.2.7
Passthrough PCI devices:
- ConnectX4 Lx Virtual Function
- LSI SAS 9207-8i
- 2x Intel P4511 4TB U.2 SSD
Code:
root@px1:~# uname -a
Linux px1 6.8.12-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-2 (2024-09-05T10:03Z) x86_64 GNU/Linux

Zabbix can still get metrics (up to a point):
View attachment 75911
1728331784790.png
This might be interesting:
1728331850791.png
1728331885754.png

Okay while writing it seems to have completely locked up... No more metrics, no more SSH.
Proxmox Web Console also not working.
Last few log lines from VM:
1728331689787.png

CPU usage in web ui is not showing anything strange:
View attachment 75912

No dmesg messages to speak of around the time it happened ( on the proxmox host):
Code:
Oct 07 21:15:55 px1 sudo[855938]: pam_unix(sudo:session): session closed for user root
Oct 07 21:15:55 px1 sudo[855940]: pam_unix(sudo:session): session closed for user root
Oct 07 21:15:55 px1 sudo[855932]: pam_unix(sudo:session): session closed for user root
Oct 07 21:15:55 px1 sudo[855934]: pam_unix(sudo:session): session closed for user root
Oct 07 21:15:55 px1 sudo[855942]: pam_unix(sudo:session): session closed for user root
Oct 07 21:15:55 px1 kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 21:15:55 px1 kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 21:15:55 px1 kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 21:15:55 px1 kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 21:15:55 px1 kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 21:15:55 px1 kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 07 21:15:55 px1 sudo[855944]: pam_unix(sudo:session): session closed for user root
Oct 07 21:15:55 px1 sudo[855931]: pam_unix(sudo:session): session closed for user root
Oct 07 21:15:55 px1 sudo[855936]: pam_unix(sudo:session): session closed for user root
Oct 07 21:15:55 px1 sudo[855933]: pam_unix(sudo:session): session closed for user root
Oct 07 21:15:55 px1 sudo[855935]: pam_unix(sudo:session): session closed for user root
Oct 07 21:15:55 px1 sudo[855941]: pam_unix(sudo:session): session closed for user root
Oct 07 21:15:55 px1 sudo[855943]: pam_unix(sudo:session): session closed for user root
Oct 07 21:15:55 px1 sudo[855945]: pam_unix(sudo:session): session closed for user root
Oct 07 21:15:55 px1 sudo[855937]: pam_unix(sudo:session): session closed for user root
Oct 07 21:15:55 px1 sudo[855939]: pam_unix(sudo:session): session closed for user root
Oct 07 21:17:01 px1 CRON[856491]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Oct 07 21:17:01 px1 CRON[856492]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Oct 07 21:17:01 px1 CRON[856491]: pam_unix(cron:session): session closed for user root
Oct 07 21:22:31 px1 smartd[20846]: Device: /dev/disk/by-id/ata-CT500MX500SSD1_2208E60D8197 [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 68 to 67
Oct 07 21:24:28 px1 pvedaemon[1996]: <root@pam> successful auth for user 'root@pam'
Oct 07 21:24:31 px1 pvedaemon[1995]: <root@pam> starting task UPID:px1:000D1F17:00A30BA0:6704356F:vncproxy:102:root@pam:
Oct 07 21:24:31 px1 pvedaemon[859927]: starting vnc proxy UPID:px1:000D1F17:00A30BA0:6704356F:vncproxy:102:root@pam:
Oct 07 21:38:40 px1 pvedaemon[1995]: <root@pam> end task UPID:px1:000D1F17:00A30BA0:6704356F:vncproxy:102:root@pam: OK
Oct 07 21:39:25 px1 pvedaemon[1997]: <root@pam> successful auth for user 'root@pam'

Other occurence zabbix screenshot:
1728331152251.png
 
And now that I know how to retrieve it from 4 boots ago:
Code:
Oct 05 22:56:49.419461 helios kernel: BUG: unable to handle page fault for address: ffffcd66c54c5d78
Oct 05 22:56:49.419980 helios kernel: #PF: supervisor read access in kernel mode
Oct 05 22:56:49.422175 helios kernel: #PF: error_code(0x0000) - not-present page
Oct 05 22:56:49.422709 helios kernel: PGD 0 P4D 0
Oct 05 22:56:49.423146 helios kernel: Oops: 0000 [#3] PREEMPT SMP NOPTI
Oct 05 22:56:49.423194 helios kernel: CPU: 7 PID: 2174458 Comm: umount Tainted: P    B D    OE      6.1.0-23-amd64 #1  Debian 6.1.99-1
Oct 05 22:56:49.423218 helios kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 4.2023.08-4 02/15/2024
Oct 05 22:56:49.423244 helios kernel: RIP: 0010:sa_idx_tab_rele+0xe/0xf0 [zfs]
Oct 05 22:56:49.423265 helios kernel: Code: c1 48 c7 c7 f0 82 60 c1 e8 35 ab 59 ff e9 d1 fe ff ff e8 55 7c f8 e1 0f 1f 44 00 00 0f 1f 44 00 00 48 85 f6 74 5e 41 54 55 53 <48> 8b af 78 05 00 00 48 89 f3 48 89 ef e8 80 4e f9 e1 65 48 8b 04
Oct 05 22:56:49.423295 helios kernel: RSP: 0018:ffffa79a4ce17b58 EFLAGS: 00010286
Oct 05 22:56:49.423333 helios kernel: RAX: ffff8d67a5ae4c98 RBX: ffff8d67a5ae4c98 RCX: ffff8d67049108a0
Oct 05 22:56:49.423356 helios kernel: RDX: 0000000000000000 RSI: ffff8d68c6485ec0 RDI: ffffcd66c54c5800
Oct 05 22:56:49.423375 helios kernel: RBP: ffff8d6d93f14300 R08: ffffa79a4ce17a30 R09: ffff8d66c4e4b438
Oct 05 22:56:49.423401 helios kernel: R10: 00000000000379c0 R11: 0000000000000000 R12: ffff8d67a5ae4d30
Oct 05 22:56:49.423428 helios kernel: R13: ffff8d67015ca000 R14: ffff8d67966f0008 R15: ffff8d688ccf0f68
Oct 05 22:56:49.423458 helios kernel: FS:  00007f8771a82840(0000) GS:ffff8d6e1fdc0000(0000) knlGS:0000000000000000
Oct 05 22:56:49.423478 helios kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 05 22:56:49.423506 helios kernel: CR2: ffffcd66c54c5d78 CR3: 0000000147daa001 CR4: 0000000000770ee0
Oct 05 22:56:49.423527 helios kernel: PKRU: 55555554
Oct 05 22:56:49.423541 helios kernel: Call Trace:
Oct 05 22:56:49.423556 helios kernel:  <TASK>
Oct 05 22:56:49.423573 helios kernel:  ? __die_body.cold+0x1a/0x1f
Oct 05 22:56:49.423596 helios kernel:  ? page_fault_oops+0xd2/0x2b0
Oct 05 22:56:49.423614 helios kernel:  ? search_module_extables+0x15/0x60
Oct 05 22:56:49.423643 helios kernel:  ? search_bpf_extables+0x5b/0x80
Oct 05 22:56:49.423661 helios kernel:  ? exc_page_fault+0xca/0x170
Oct 05 22:56:49.423681 helios kernel:  ? asm_exc_page_fault+0x22/0x30
Oct 05 22:56:49.423696 helios kernel:  ? sa_idx_tab_rele+0xe/0xf0 [zfs]
Oct 05 22:56:49.423716 helios kernel:  sa_handle_destroy+0x55/0xd0 [zfs]
Oct 05 22:56:49.423744 helios kernel:  zfs_zinactive+0x8f/0xf0 [zfs]
Oct 05 22:56:49.423770 helios kernel:  zfs_inactive+0x95/0x230 [zfs]
Oct 05 22:56:49.423799 helios kernel:  zpl_evict_inode+0x3e/0x60 [zfs]
Oct 05 22:56:49.423836 helios kernel:  evict+0xcd/0x1d0
Oct 05 22:56:49.423859 helios kernel:  dispose_list+0x48/0x70
Oct 05 22:56:49.423878 helios kernel:  evict_inodes+0x1a0/0x1b0
Oct 05 22:56:49.423918 helios kernel:  generic_shutdown_super+0x37/0x130
Oct 05 22:56:49.423956 helios kernel:  kill_anon_super+0x14/0x30
Oct 05 22:56:49.423976 helios kernel:  deactivate_locked_super+0x2c/0xa0
Oct 05 22:56:49.423994 helios kernel:  cleanup_mnt+0xbd/0x150
Oct 05 22:56:49.424019 helios kernel:  task_work_run+0x56/0x90
Oct 05 22:56:49.424044 helios kernel:  exit_to_user_mode_prepare+0x1dd/0x1e0
Oct 05 22:56:49.424064 helios kernel:  syscall_exit_to_user_mode+0x17/0x40
Oct 05 22:56:49.424101 helios kernel:  do_syscall_64+0x61/0xb0
Oct 05 22:56:49.424115 helios kernel:  ? exit_to_user_mode_prepare+0x40/0x1e0
Oct 05 22:56:49.424156 helios kernel:  ? syscall_exit_to_user_mode+0x1e/0x40
Oct 05 22:56:49.424180 helios kernel:  ? do_syscall_64+0x61/0xb0
Oct 05 22:56:49.424201 helios kernel:  ? exit_to_user_mode_prepare+0x40/0x1e0
Oct 05 22:56:49.424215 helios kernel:  ? syscall_exit_to_user_mode+0x1e/0x40
Oct 05 22:56:49.424232 helios kernel:  ? do_syscall_64+0x61/0xb0
Oct 05 22:56:49.424254 helios kernel:  ? do_user_addr_fault+0x1b0/0x550
Oct 05 22:56:49.424284 helios kernel:  ? clear_bhb_loop+0x15/0x70
Oct 05 22:56:49.424307 helios kernel:  ? clear_bhb_loop+0x15/0x70
Oct 05 22:56:49.424324 helios kernel:  ? clear_bhb_loop+0x15/0x70
Oct 05 22:56:49.424348 helios kernel:  ? clear_bhb_loop+0x15/0x70
Oct 05 22:56:49.424479 helios kernel:  ? clear_bhb_loop+0x15/0x70
Oct 05 22:56:49.424506 helios kernel:  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Oct 05 22:56:49.424532 helios kernel: RIP: 0033:0x7f8771caea77
Oct 05 22:56:49.424553 helios kernel: Code: 8f 93 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 59 93 0c 00 f7 d8 64 89 02 b8
Oct 05 22:56:49.424573 helios kernel: RSP: 002b:00007ffd28f00258 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
Oct 05 22:56:49.424592 helios kernel: RAX: 0000000000000000 RBX: 00005644db340b78 RCX: 00007f8771caea77
Oct 05 22:56:49.424611 helios kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00005644db341950
Oct 05 22:56:49.424645 helios kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000073
Oct 05 22:56:49.424672 helios kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f8771de9264
Oct 05 22:56:49.424701 helios kernel: R13: 00005644db341950 R14: 00005644db341910 R15: 00005644db340a60
Oct 05 22:56:49.424728 helios kernel:  </TASK>
Oct 05 22:56:49.424767 helios kernel: Modules linked in: tcp_diag inet_diag tun bluetooth jitterentropy_rng drbg ansi_cprng ecdh_generic rfkill ecc xt_nat xt_tcpudp veth xt_conntrack nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter bridge stp llc nft_chain_nat xt>Oct 05 22:56:49.424852 helios kernel:  soundcore button evdev joydev serio_raw sg fuse loop efi_pstore configfs efivarfs qemu_fw_cfg ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic mlx5_ib ib_uverbs ib_core hid_generic usbhid hid dm_mod sd_mod nvme mlx5_core vir>Oct 05 22:56:49.424889 helios kernel: CR2: ffffcd66c54c5d78
Oct 05 22:56:49.424916 helios kernel: ---[ end trace 0000000000000000 ]---
Oct 05 22:56:49.424934 helios kernel: RIP: 0010:zap_huge_pmd+0x1c0/0x430
Oct 05 22:56:49.424965 helios kernel: Code: 21 c5 48 c1 e8 3a 8d 50 e4 49 c1 e5 06 4c 03 2d 76 b3 0d 01 83 fa 01 0f 86 5d 02 00 00 48 83 f8 1b 0f 84 53 02 00 00 45 31 e4 <49> 8b 45 08 a8 01 0f 85 59 02 00 00 0f 1f 44 00 00 4c 89 e8 f6 40
Oct 05 22:56:49.425030 helios kernel: RSP: 0018:ffffa79a69037668 EFLAGS: 00010246
Oct 05 22:56:49.425047 helios kernel: RAX: 0000000000000000 RBX: ffffa79a69037898 RCX: 0000040000000000
Oct 05 22:56:49.425060 helios kernel: RDX: 00000000ffffffe4 RSI: 0000000000000000 RDI: 0000040000000000
Oct 05 22:56:49.425071 helios kernel: RBP: ffff8d67a5ae7970 R08: 000fffffffffffff R09: 00007f3665e00000
Oct 05 22:56:49.425083 helios kernel: R10: 00007f3684645fff R11: 0000000000000000 R12: 0000000000000000
Oct 05 22:56:49.425088 helios kernel: R13: ffffe65e7fffffc0 R14: ffffe5de8796b9e8 R15: 00007f3664800000
Oct 05 22:56:49.425096 helios kernel: FS:  00007f8771a82840(0000) GS:ffff8d6e1fdc0000(0000) knlGS:0000000000000000
Oct 05 22:56:49.425105 helios kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 05 22:56:49.425111 helios kernel: CR2: ffffcd66c54c5d78 CR3: 0000000147daa001 CR4: 0000000000770ee0
Oct 05 22:56:49.425116 helios kernel: PKRU: 55555554
Oct 05 22:56:49.425124 helios kernel: note: umount[2174458] exited with irqs disabled
Oct 05 22:56:49.583101 helios systemd-shutdown[1]: Syncing filesystems and block devices.
Oct 05 22:57:29.585013 helios systemd-shutdown[1]: Syncing filesystems and block devices - timed out, issuing SIGKILL to PID 2174478.
Oct 05 22:57:29.585385 helios systemd-shutdown[1]: Sending SIGTERM to remaining processes...
Oct 05 22:57:29.585415 helios systemd-journald[336]: Received SIGTERM from PID 1 (systemd-shutdow).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!