Full system hang every few days

j1a2o

Member
Feb 14, 2021
34
4
13
39
Nothing is responsive, not even the power button on my machine. I have to pull the plug in order to restart the machine. Everything was working fine until about a week ago.

Nothing shows up in /var/log/syslog.

pve-manager/6.3-6/2184247e (running kernel: 5.4.103-1-pve)
ZFS miror

Machine is a Ryzen 4650G, MSI B450i motherboard

Everything was stable until I did the Proxmox update that went from ZFS 0.8.5 to ZFS 2.0. I'm highly suspicious of that update.

Anyone know what I should do next?
 
Did you also checked "/var/log/syslog.1"?

If even the power button isn't working anymore it sounds more like a hardware problem.
 
Did you also checked "/var/log/syslog.1"?

If even the power button isn't working anymore it sounds more like a hardware problem.
I actually meant that syslog didn't show anything meaningful. The log entries between when it froze and when I rebooted it were these:

Code:
Mar 17 16:39:00 pve systemd[1]: pvesr.service: Succeeded.
Mar 17 16:39:00 pve systemd[1]: Started Proxmox VE replication runner.
Mar 17 16:40:00 pve systemd[1]: Starting Proxmox VE replication runner...
Mar 17 16:40:00 pve systemd[1]: pvesr.service: Succeeded.
Mar 17 16:40:00 pve systemd[1]: Started Proxmox VE replication runner.
Mar 17 16:41:00 pve systemd[1]: Starting Proxmox VE replication runner...
Mar 17 16:41:00 pve systemd[1]: pvesr.service: Succeeded.
Mar 17 16:41:00 pve systemd[1]: Started Proxmox VE replication runner.
Mar 17 16:42:00 pve systemd[1]: Starting Proxmox VE replication runner...
Mar 17 16:42:00 pve systemd[1]: pvesr.service: Succeeded.
Mar 17 16:42:00 pve systemd[1]: Started Proxmox VE replication runner.
Mar 17 16:43:00 pve systemd[1]: Starting Proxmox VE replication runner...
Mar 17 16:43:00 pve systemd[1]: pvesr.service: Succeeded.
Mar 17 16:43:00 pve systemd[1]: Started Proxmox VE replication runner.
Mar 17 16:44:00 pve systemd[1]: Starting Proxmox VE replication runner...
Mar 17 16:44:00 pve systemd[1]: pvesr.service: Succeeded.
Mar 17 16:44:00 pve systemd[1]: Started Proxmox VE replication runner.
Mar 17 16:45:00 pve systemd[1]: Starting Proxmox VE replication runner...
Mar 17 16:45:00 pve systemd[1]: pvesr.service: Succeeded.
Mar 17 16:45:00 pve systemd[1]: Started Proxmox VE replication runner.
Mar 17 16:45:01 pve CRON[11441]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Mar 17 16:46:00 pve systemd[1]: Starting Proxmox VE replication runner...
Mar 17 16:46:00 pve systemd[1]: pvesr.service: Succeeded.
Mar 17 16:46:00 pve systemd[1]: Started Proxmox VE replication runner.
Mar 17 16:58:53 pve systemd[1]: Started Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling.
Mar 17 16:58:53 pve kernel: [    0.000000] Linux version 5.4.103-1-pve (build@pve) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.4.103-1 (Sun, 07 Mar 2021 15:55:09 +0100) ()
Mar 17 16:58:53 pve kernel: [    0.000000] Command line: initrd=\EFI\proxmox\5.4.103-1-pve\initrd.img-5.4.103-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs amdgpu.exp_hw_support=1
Mar 17 16:58:53 pve kernel: [    0.000000] KERNEL supported cpus:
Mar 17 16:58:53 pve kernel: [    0.000000]   Intel GenuineIntel
Mar 17 16:58:53 pve kernel: [    0.000000]   AMD AuthenticAMD
Mar 17 16:58:53 pve kernel: [    0.000000]   Hygon HygonGenuine
Mar 17 16:58:53 pve kernel: [    0.000000]   Centaur CentaurHauls
Mar 17 16:58:53 pve kernel: [    0.000000]   zhaoxin   Shanghai
Mar 17 16:58:53 pve kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Mar 17 16:58:53 pve kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Mar 17 16:58:53 pve kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Mar 17 16:58:53 pve kernel: [    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
Mar 17 16:58:53 pve kernel: [    0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'compacted' format.
Mar 17 16:58:53 pve kernel: [    0.000000] BIOS-provided physical RAM map:

If it's a hardware problem, then it'd be very coincidentally timed with the Proxmox update from about 1-2 weeks ago...
 
Also, I also have Telegraf logging to InfluxDB, and I don't see any signs of resource issues right before it hangs. Out of the 3 instances, CPU load was between 40-60%, CPU temperatures between 50-60C, and memory was about 80% utilization.