Proxmox 8.3.2 Random Lockups / Need to Reboot

jhavens12

New Member
May 15, 2024
4
0
1
Hi All,

I'm pulling my hair out with this one and would greatly appreciate some help. I've looked all over the forums and have found many people with similar issues but nothing has stopped by two hosts from randomly locking up and not responding. The hosts show as an "unknown" status and the VMs only display numbers and grey question marks. I have to restart the host to gain access again and get my VMs back up and running.

I started with two identical hosts (TSP31 and TSP32). Lenovo Thinkstation P3 Towers i9-13900K 128GB RAM 2x NVMe Drives (1x 2TB Samsung 990 as boot drive, 1x4TB Samsung 990 as ZFS VM Storage), 1x x550T2. Dual hosts in a cluster to migrate VMs back and forth to perform maintenance. One host can handle all VMs without issue. Also have a synology DS918+ which is a VM backup source and also has some fileshares that the VMs access.

What Happens:

Hosts will run fine for a week or two without issues. Interface is responsive. After about a week I get alerts that random services within the VMs are unresponsive, but sometimes come back up online after a few minutes. I go to check the GUI and sometimes can get in and sometimes cannot at all. The status usually shows VMs with all grey question marks without their names listed and I'm unable to do anything on the VMs as far as restart them, but can often SSH into about half of them - seemingly random. I can SSH into the host and restart/check on services but have never gotten the grey VM question marks to go away without a full reboot. Sometimes I've noticed that the backup job is running and doesn't have a status. I tried removing the backup job and mount point, but the issue persists.

What I have tried:

- Installing QDevice (sudo apt install corosync-qnetd corosync-qdevice) which has helped keep qurom, but does not solve the issue
- Upgrading firmware on all 4 samsung 990 drives
- Replacing boot 990 drive with Corsair M drive. Fresh install and migrated config with same result
- Upgrading kernal to 6.11 on both hosts with same result
- Disabling all metric servers and external shares (backup share)
- Changing GRUB to include option 'nvme_core.default_ps_max_latency_us=0' which made the freeze seemingly happen faster (at least the first time so far)
- Checking that TRIMM is enabled and the SMART values to see if the drives are used up/full

I'm really getting frustrated with this. I know the drives are consumer drives, but should I really expect Proxmox to not even be usable on them? Any help is appreciated. Thanks

Information:

root@TSP31:~# pveversion
pve-manager/8.3.2/3e76eec21c4a14a7 (running kernel: 6.11.0-2-pve)


root@TSP31:~# pvecm status
Cluster information
-------------------
Name: TSP3s
Config Version: 3
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Thu Jan 2 09:16:26 2025
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000002
Ring ID: 1.1b0
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate Qdevice

Membership information
----------------------
Nodeid Votes Qdevice Name
0x00000001 1 A,V,NMW 10.0.8.110
0x00000002 1 A,V,NMW 10.0.8.120 (local)
0x00000000 1 Qdevice

root@TSP31:~# nvme smart-log /dev/nvme1n1
Smart Log for NVME device:nvme1n1 namespace-id:ffffffff
critical_warning : 0
temperature : 47°C (320 Kelvin)
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 12%
endurance group critical warning summary: 0
Data Units Read : 163,456,463 (83.69 TB)
Data Units Written : 86,328,800 (44.20 TB)
host_read_commands : 3,232,395,959
host_write_commands : 1,869,446,170
controller_busy_time : 7,337
power_cycles : 42
power_on_hours : 3,528
unsafe_shutdowns : 24
media_errors : 0
num_err_log_entries : 0
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1 : 47°C (320 Kelvin)
Temperature Sensor 2 : 49°C (322 Kelvin)
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0

root@TSP31:~# nvme smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning : 0
temperature : 49°C (322 Kelvin)
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 5%
endurance group critical warning summary: 0
Data Units Read : 144,870,895 (74.17 TB)
Data Units Written : 97,045,185 (49.69 TB)
host_read_commands : 957,727,646
host_write_commands : 928,126,623
controller_busy_time : 4,733
power_cycles : 30
power_on_hours : 3,303
unsafe_shutdowns : 17
media_errors : 0
num_err_log_entries : 0
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1 : 49°C (322 Kelvin)
Temperature Sensor 2 : 59°C (332 Kelvin)
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0