proxmox host very slow, VM can not run

Mar 29, 2019
7
0
41
46
I have a proxmox server that locks frequently and doesn't respond. When I run gzip, it's very slow. Proxmox gui not showing. VM won't run.
This is a no-subscription Proxmox, but I have a server that has a subscription that runs fine on older hardware.
The server is configured with ZFS RAID1.
Is there a command I can run to check for hardware issues or kernel bugs?



Code:
# qm start 105

start failed: command '/usr/bin/kvm -id 105 -name 'Onlyoffice,debug-threads=on' -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/105.qmp,server=on,wait=off' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/105.pid -daemonize -smbios 'type=1,uuid=8be265ec-f0ed-426a-8411-b97445d98a34' -smp '4,sockets=1,cores=4,maxcpus=4' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc 'unix:/var/run/qemu-server/105.vnc,password=on' -cpu kvm64,enforce,+kvm_pv_eoi,+kvm_pv_unhalt,+lahf_lm,+sep -m 8192 -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'vmgenid,guid=65cd6037-e05c-4a23-b218-fe5b39c1dc83' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'VGA,id=vga,bus=pci.0,addr=0x2' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3,free-page-reporting=on' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:3e89ea50538' -drive 'file=/var/lib/vz/template/iso/ubuntu-20.04.4-live-server-amd64.iso,if=none,id=drive-ide2,media=cdrom,aio=io_uring' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=101' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=/dev/zvol/rpool/data/vm-105-disk-0,if=none,id=drive-scsi0,cache=writeback,discard=on,format=raw,aio=io_uring,detect-zeroes=unmap' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap105i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=5A:39:78:08:41:50,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=102' -machine 'type=pc+pve0'' failed: got timeout


# pveversion

pve-manager/7.2-11/b76d3178 (running kernel: 5.15.60-2-pve)

Code:
Oct 12 20:33:47 proxmox03 kernel: [13150.454656] perf: interrupt took too long (49138 > 47071), lowering kernel.perf_event_max_sample_rate to 4000
Oct 12 21:15:28 proxmox03 kernel: [15651.445240] perf: interrupt took too long (63063 > 61422), lowering kernel.perf_event_max_sample_rate to 3000
Oct 12 22:32:40 proxmox03 kernel: [20283.221147] device tap105i0 entered promiscuous mode
Oct 12 22:32:41 proxmox03 kernel: [20283.973004] vmbr0: port 2(tap105i0) entered blocking state
Oct 12 22:32:41 proxmox03 kernel: [20284.016597] vmbr0: port 2(tap105i0) entered disabled state
Oct 12 22:32:41 proxmox03 kernel: [20284.061796] vmbr0: port 2(tap105i0) entered blocking state
Oct 12 22:32:41 proxmox03 kernel: [20284.094847] vmbr0: port 2(tap105i0) entered forwarding state
Oct 12 22:32:55 proxmox03 kernel: [20298.799229] vmbr0: port 2(tap105i0) entered disabled state
Oct 12 22:32:56 proxmox03 kernel: [20299.668787]  zd96: p1 p2 p3
Oct 12 22:34:26 proxmox03 kernel: [20389.403732] perf: interrupt took too long (78927 > 78828), lowering kernel.perf_event_max_sample_rate to 2500
Code:
180.143609] ================================================================================
[  180.143841] UBSAN: array-index-out-of-bounds in drivers/scsi/megaraid/megaraid_sas_fp.c:103:32
[  180.143934] index 1 is out of range for type 'MR_LD_SPAN_MAP [1]'
[  180.144024] CPU: 0 PID: 587 Comm: kworker/0:3 Not tainted 5.15.60-2-pve #1
[  180.144160] Hardware name: GIGABYTE R182-M80-00/MR92-FS1-00, BIOS F09 10/28/2021
[  180.144346] Workqueue: events work_for_cpu_fn
[  180.144575] Call Trace:
[  180.144666]  <TASK>
[  180.144805]  dump_stack_lvl+0x4a/0x63
[  180.144988]  dump_stack+0x10/0x16
[  180.145080]  ubsan_epilogue+0x9/0x49
[  180.145173]  __ubsan_handle_out_of_bounds.cold+0x44/0x49
[  180.145728]  mr_update_load_balance_params+0xbe/0xd0 [megaraid_sas]
[  180.146140]  MR_ValidateMapInfo+0x1f0/0xe50 [megaraid_sas]
[  180.147298]  ? __bpf_trace_tick_stop+0x20/0x20
[  180.147807]  ? wait_and_poll+0x59/0xc0 [megaraid_sas]
[  180.148591]  ? megasas_issue_polled+0x5d/0x70 [megaraid_sas]
[  180.149057]  megasas_init_adapter_fusion+0xb11/0xc90 [megaraid_sas]
[  180.149611]  megasas_probe_one.cold+0xbfa/0x195d [megaraid_sas]
[  180.150077]  ? finish_task_switch.isra.0+0x7e/0x2b0
[  180.150397]  local_pci_probe+0x48/0x90
[  180.150534]  work_for_cpu_fn+0x17/0x30
[  180.150671]  process_one_work+0x228/0x3d0
[  180.150901]  worker_thread+0x223/0x420
[  180.150993]  ? process_one_work+0x3d0/0x3d0
[  180.151085]  kthread+0x127/0x150
[  180.151269]  ? set_kthread_struct+0x50/0x50
[  180.151364]  ret_from_fork+0x1f/0x30
[  180.151546]  </TASK>
 
Last edited:
[ 180.143841] UBSAN: array-index-out-of-bounds in drivers/scsi/megaraid/megaraid_sas_fp.c:103:32 [ 180.143934] index 1 is out of range for type 'MR_LD_SPAN_MAP [1]' [ 180.144024] CPU: 0 PID: 587 Comm: kworker/0:3 Not tainted 5.15.60-2-pve #1 [ 180.144160] Hardware name: GIGABYTE R182-M80-00/MR92-FS1-00, BIOS F09 10/28/2021
Seems like it could be: https://bugzilla.kernel.org/show_bug.cgi?id=215943 for which a fix has been applied for the upcoming 6.1 kernel. We can look into how feasible it would be to backport it to current stable 5.15.
 
The backport applied straightforward and compiled, if you want to test it download it from
http://download.proxmox.com/temp/pve-kernel-5.15-megaraid-ubsan/

For example
Bash:
wget http://download.proxmox.com/temp/pve-kernel-5.15-megaraid-ubsan/pve-kernel-5.15.60-2-megaraid-fix-pve_5.15.60-3_amd64.deb

# verify checksum
sha256sum pve-kernel-5.15.60-2-megaraid-fix-pve_5.15.60-3_amd64.deb
5f064a9a881e1f0c20127a03789092ab7616dfe1abbda3474afd9a7edfcb073c  pve-kernel-5.15.60-2-megaraid-fix-pve_5.15.60-3_amd64.deb

apt install ./pve-kernel-5.15.60-2-megaraid-fix-pve_5.15.60-3_amd64.deb

# reboot

Would be good to know if the warnings are then gone and ideally the "very slow" symptom fixed too.

The server is configured with ZFS RAID1.
Oh, and you're sure you followed our recommendation and are using any HW raid controller in pass-through/HBA mode?
IOW. no HW raid below ZFS?
 
I found that one of the power supply LED indicator
is amber. Maybe the faulty power power supply makes the CPU throttled down. I unplugged the faulty power supply and the proxmox runs fine
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!