Proxmox "freezes" from time to time

pavljiks111

Member
Sep 18, 2020
4
0
21
41
Previously i was running i7-3770 and it was working 24/7 for 365 days a year. No issues with uptime unless i decided to update or reboot it. As i use this host for my-lab maximum memory of 32GB was not enough for me. I upgraded hardware to Ryzen 5 PRO 4650G with 128 GB ECC (4x32 GB Kingston). Should be even more stable... but not .

System can work for a week or even more (20 days streak) but then suddenly stops. On "average" it works for 5 days :).

1726481992347.png

How it looks:
By the stop/freeze it looks like that:
System HDD lets stops blinking (no I/O activity)
Server still responds to ping.
Any service/VM stops responding
I cannot login via ssh.
From terminal keyboard is working (Caps Lock light). I can type username but i don't get to password prompt. I can switch between terminals.

journalctl -b -1 -xe
Code:
Sep 16 05:24:56 ryzen-vtn-proxmox pvestatd[2099]: status update time (13.866 seconds)
Sep 16 05:25:01 ryzen-vtn-proxmox CRON[3375803]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 16 05:25:01 ryzen-vtn-proxmox CRON[3375804]: (root) CMD (for i in `lsblk | grep disk |grep -v "230" | awk {'print $1>
Sep 16 05:25:01 ryzen-vtn-proxmox CRON[3375803]: pam_unix(cron:session): session closed for user root
Sep 16 05:25:05 ryzen-vtn-proxmox pvestatd[2099]: status update time (8.859 seconds)
Sep 16 05:25:13 ryzen-vtn-proxmox pvestatd[2099]: status update time (7.792 seconds)
Sep 16 05:25:24 ryzen-vtn-proxmox dockerd[1967]: time="2024-09-16T05:25:24.656926330+03:00" level=error msg="[resolver] >
Sep 16 05:25:24 ryzen-vtn-proxmox dockerd[1967]: time="2024-09-16T05:25:24.656929286+03:00" level=error msg="[resolver] >
Sep 16 05:25:29 ryzen-vtn-proxmox pvestatd[2099]: status update time (12.891 seconds)
Sep 16 05:25:36 ryzen-vtn-proxmox pvestatd[2099]: status update time (6.647 seconds)
Sep 16 05:25:55 ryzen-vtn-proxmox pvestatd[2099]: status update time (15.784 seconds)
Sep 16 05:26:07 ryzen-vtn-proxmox pvestatd[2099]: status update time (12.479 seconds)
Sep 16 05:26:24 ryzen-vtn-proxmox pvestatd[2099]: status update time (17.012 seconds)
Sep 16 05:26:32 ryzen-vtn-proxmox pvestatd[2099]: status update time (8.215 seconds)
Sep 16 05:26:50 ryzen-vtn-proxmox pvestatd[2099]: status update time (15.544 seconds)
Sep 16 05:27:01 ryzen-vtn-proxmox pvestatd[2099]: status update time (10.782 seconds)
Sep 16 05:27:23 ryzen-vtn-proxmox pvestatd[2099]: status update time (11.602 seconds)
Sep 16 05:27:31 ryzen-vtn-proxmox pvestatd[2099]: status update time (8.365 seconds)
Sep 16 05:27:50 ryzen-vtn-proxmox pvestatd[2099]: status update time (7.394 seconds)
Sep 16 05:28:01 ryzen-vtn-proxmox pvestatd[2099]: status update time (7.872 seconds)
lines 1059-1107/1107 (END)

And nothing more. Maybe i can look somewhere else.



What have i tried:
Even this is ecc ram i tested ram using memtest86 - no issues.
I swapped motherboards (both has latest BIOS and both failed same way):
  • ASUS PRIME B450M-A II AMD B450
  • ASUS PRO B550M-C/CSM AMD B550
I swapped CPUs with iGPU to older one with external graphics card.
  • Ryzen 5 PRO 4650G -> Ryzen 5 2600 + GeForce GT 710
i turned off all possible energy saving states in BIOS.

Hdd temps are below 50 celsius, also installed latest updates for proxmox but still same freezes. More about config below.

1726481678429.png

1726481770837.png

1726481798354.png
 
Last edited:
Check dmesg, /var/log/messages and the other logs and the console at time of freeze. Why is there a dockerd error in your logs? Check SMART status for errors as well.

Here is a long list of things to try: https://gist.github.com/dlqqq/876d74d030f80dc899fc58a244b72df0 (install microcode package and disable cstates in boot time)
dmesg - messages since last boot up. I cannot check prev. dmesg and cannot get to console when it "freezes"
/var/log/messages - is missing. As i understand now days `journalctl` - fills his place.
Code:
Sep 16 05:25:29 ryzen-vtn-proxmox pvestatd[2099]: status update time (12.891 seconds)
Sep 16 05:25:36 ryzen-vtn-proxmox pvestatd[2099]: status update time (6.647 seconds)
Sep 16 05:25:55 ryzen-vtn-proxmox pvestatd[2099]: status update time (15.784 seconds)
Sep 16 05:26:07 ryzen-vtn-proxmox pvestatd[2099]: status update time (12.479 seconds)
Sep 16 05:26:24 ryzen-vtn-proxmox pvestatd[2099]: status update time (17.012 seconds)
Sep 16 05:26:32 ryzen-vtn-proxmox pvestatd[2099]: status update time (8.215 seconds)
Sep 16 05:26:50 ryzen-vtn-proxmox pvestatd[2099]: status update time (15.544 seconds)
Sep 16 05:27:01 ryzen-vtn-proxmox pvestatd[2099]: status update time (10.782 seconds)
Sep 16 05:27:23 ryzen-vtn-proxmox pvestatd[2099]: status update time (11.602 seconds)
Sep 16 05:27:31 ryzen-vtn-proxmox pvestatd[2099]: status update time (8.365 seconds)
Sep 16 05:27:50 ryzen-vtn-proxmox pvestatd[2099]: status update time (7.394 seconds)
Sep 16 05:28:01 ryzen-vtn-proxmox pvestatd[2099]: status update time (7.872 seconds)
-- Boot 882994b2f8ba4efba63ac00eb879cb08 --
Sep 16 11:40:27 ryzen-vtn-proxmox kernel: Linux version 6.8.12-1-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-1 (2024-08-05T16:17Z) ()
Sep 16 11:40:27 ryzen-vtn-proxmox kernel: Command line: BOOT_IMAGE=/vmlinuz-6.8.12-1-pve root=ZFS=/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet
Sep 16 11:40:27 ryzen-vtn-proxmox kernel: KERNEL supported cpus:
Sep 16 11:40:27 ryzen-vtn-proxmox kernel:   Intel GenuineIntel
Sep 16 11:40:27 ryzen-vtn-proxmox kernel:   AMD AuthenticAMD
Sep 16 11:40:27 ryzen-vtn-proxmox kernel:   Hygon HygonGenuine
Sep 16 11:40:27 ryzen-vtn-proxmox kernel:   Centaur CentaurHauls
Sep 16 11:40:27 ryzen-vtn-proxmox kernel:   zhaoxin   Shanghai
Sep 16 11:40:27 ryzen-vtn-proxmox kernel: BIOS-provided physical RAM map:
Sep 16 11:40:27 ryzen-vtn-proxmox kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009b3ff] usable
Sep 16 11:40:27 ryzen-vtn-proxmox kernel: BIOS-e820: [mem 0x000000000009b400-0x000000000009ffff] reserved

SMART is normal. Disks are in ZFS mirror pools. No issue around them.
Regarding docker I am running 3 docker containers along with proxmox and i don't think that a cause.


So the only hope is https://gist.github.com/dlqqq/876d74d030f80dc899fc58a244b72df0
Sad that all power efficiency effort goes under the carpet.

Strange that mine is not "total freeze". As i said console allows to type username and host responds to ICMP packets.

And powertop also reports that i am using max C2 states already:
1726492038594.png
 
Last edited:
Sorry, I meant /var/log/syslog which definitely still exists, messages is for RHEL-based systems. Dmesg gets written to /var/log/kern.log thus you should have /var/log/kern.log and kern.log.1 for the 'old' one.

Yes, AMD is shit at building functional processors, this problem occurs on Windows as well, it's somewhat fixed with firmware hacks but that also means your motherboard/BIOS has to be updated which not all vendors do and you have to install the amd microcode package in Linux, which I'm sure Proxmox builds being based on Ubuntu (apt install amd64-microcode). If you have the bug and your BIOS/CPU/OS combo don't have the proper firmware fixes you need to disable all C-state switching, the switch is where the problem lies so you can only have C1. You have to disable it both in your BIOS and in the kernel.
 
Sorry, I meant /var/log/syslog which definitely still exists, messages is for RHEL-based systems. Dmesg gets written to /var/log/kern.log thus you should have /var/log/kern.log and kern.log.1 for the 'old' one.

Yes, AMD is shit at building functional processors, this problem occurs on Windows as well, it's somewhat fixed with firmware hacks but that also means your motherboard/BIOS has to be updated which not all vendors do and you have to install the amd microcode package in Linux, which I'm sure Proxmox builds being based on Ubuntu (apt install amd64-microcode). If you have the bug and your BIOS/CPU/OS combo don't have the proper firmware fixes you need to disable all C-state switching, the switch is where the problem lies so you can only have C1. You have to disable it both in your BIOS and in the kernel.
Thanks for effort @guruevi but to (at least for proxmox 8.2.4 aka debian 12.7) /var/log doesn't contains such files:

Code:
ryzen-vtn-proxmox:/var/log# ls -altr
total 345
drwxr-xr-x   2 root     root                 2 May  5  2023 lxc
drwxr-xr-x   2 root     root                 2 May  7  2023 corosync
drwxr-xr-x   2 root     root                 2 May 24  2023 glusterfs
drwxr-x---   2 root     adm                  2 Oct 10  2023 samba
drwxrws--T   2 ceph     ceph                 2 Jan  9  2024 ceph
drwxr-xr-x  11 root     root                13 Feb  5  2024 ..
-rw-r--r--   1 root     root                 0 Feb  5  2024 faillog
drwxr-xr-x   4 root     root                 4 Mar 10  2024 proxmox-backup
drwxr-xr-x   3 root     root                 3 Mar 10  2024 runit
lrwxrwxrwx   1 root     root                39 Mar 10  2024 README -> ../../usr/share/doc/systemd/README.logs
drwx------   2 root     root                 2 Mar 10  2024 private
drwxr-xr-x   3 root     root                 3 Mar 10  2024 pve
drwxr-sr-x+  3 root     systemd-journal      3 Mar 10  2024 journal
drwxr-x---   2 _chrony  _chrony              2 Mar 10  2024 chrony
-rw-r--r--   1 root     root              3106 Mar 30 10:25 alternatives.log.6.gz
-rw-r--r--   1 root     root             45503 Mar 30 16:46 dpkg.log.6.gz
-rw-r--r--   1 root     root              1641 Apr 14 12:24 dpkg.log.5.gz
-rw-r--r--   1 root     root               176 Apr 29 16:25 alternatives.log.5.gz
-rw-r--r--   1 root     root              5324 May 23 22:36 dpkg.log.4.gz
-rw-r--r--   1 root     root               571 May 29 14:50 alternatives.log.4.gz
-rw-r--r--   1 root     root               174 Jun 19 06:32 alternatives.log.3.gz
-rw-r--r--   1 root     root               472 Jun 25 19:42 dpkg.log.3.gz
-rw-r--r--   1 root     root              5354 Jul  9 09:58 dpkg.log.2.gz
-rw-r--r--   1 root     root               388 Jul 25 22:26 alternatives.log.2.gz
-rw-rw----   1 root     utmp                 0 Aug  1 00:00 btmp.1
-rw-r--r--   1 root     root             22237 Aug 21 15:24 dpkg.log.1
-rw-r--r--   1 root     root              2475 Aug 30 08:48 alternatives.log.1
-rw-rw----   1 root     utmp               384 Sep  7 15:16 btmp
drwxr-xr-x   2 root     root                17 Sep  9 10:48 apt
-rw-r--r--   1 root     root              4858 Sep  9 10:49 fontconfig.log
-rw-r--r--   1 root     root             75427 Sep  9 10:50 dpkg.log
-rw-r-----   1 root     adm              12321 Sep 10 00:00 pve-firewall.log.7.gz
-rw-r-----   1 root     adm                124 Sep 11 00:00 pve-firewall.log.6.gz
drwxr-xr-x   2 root     root                32 Sep 11 05:06 vzdump
-rw-r-----   1 root     adm                125 Sep 12 00:00 pve-firewall.log.5.gz
-rw-r-----   1 root     adm                124 Sep 13 00:00 pve-firewall.log.4.gz
-rw-r-----   1 root     adm                124 Sep 14 00:00 pve-firewall.log.3.gz
drwx------   2 www-data www-data            10 Sep 15 00:00 pveproxy
-rw-r-----   1 root     adm                124 Sep 15 00:00 pve-firewall.log.2.gz
drwxr-xr-x   2 zabbix   zabbix              10 Sep 15 00:00 zabbix-agent
-rw-r-----   1 root     adm                179 Sep 16 00:00 pve-firewall.log.1
drwxr-xr-x  18 root     root                49 Sep 16 00:00 .
-rw-r--r--   1 root     root            256227 Sep 16 01:26 pveam.log
drwxr-xr-x  45 root     root                45 Sep 16 11:40 ifupdown2
-rw-r--r--   1 root     root              8520 Sep 16 11:40 apcupsd.events
-rw-r--r--   1 root     root              2862 Sep 16 11:40 alternatives.log
-rw-r-----   1 root     adm                289 Sep 16 11:40 pve-firewall.log
-rw-rw-r--   1 root     utmp            313344 Sep 16 17:08 wtmp
-rw-rw-r--   1 root     utmp               292 Sep 16 17:08 lastlog

and using default proxmox repositories
Code:
apt install amd64-microcode
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Package amd64-microcode is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source


E: Package 'amd64-microcode' has no installation candidate

Will have to dig more about that. This article looks more related
Strange thing in general because that i have other Ryzen 5/9 setups that run flawlessly with proxmox for months even with default BIOSes and Bios settings. But this homelab with ECC etc. grr...
 
Last edited:
For me it looks like complete some kind of complete I/O fail (not totally freeze).

1) Any way to dump kernel panic logs to some remote host? In case of such failure?
2) Some king of watchdog to automatically reboot server in case of unresponsiveness?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!