Random freeze of proxmox

Nicox

Member
Feb 20, 2022
7
0
6
75
Hello,

I need your help because I have an issue with my proxmox server.
Here is my config :

CPU : AMD Ryzen 9 3900X
Motherboard : Gigabyte B450M DS3H
RAM : 4x 32Go
GPu : GeForce GT 710 (only to get access to console screen)
Disk : SSD disk (no raid or anything)

Proxmox : 8.3.3
Kernel : Linux 6.8.12-8-pve

I have an issue for a long time, maybe last summer. Sometimes, maybe once a month, my proxmox setup completly freeze. The web ui is not available. SSH is impossible. The only thing working is a ping. Most of the VM are not available (while 1 or 2 continues to function correctly).

Here what I tried already :

  • Change kernel version
  • Update bios
  • Replace the SSD
  • change Bios settings :
    • Disable "Global C-State Control" in BIOS
    • Disable power optimization (Don't remember the exact name, but disble things that reduce power when low charge and others)
  • Update proxmox version

I am a bit out of idea. When the "freeze" or crash arise, it seams like there is not particular logs prior my manual reboot that could help me find what is happening (here the freeze or crash if at 05:00:00 and my manuel reboot near 09;00;00) :

Code:
Apr 10 02:17:01 sgc CRON[1450196]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Apr 10 02:17:01 sgc CRON[1450197]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Apr 10 02:17:01 sgc CRON[1450196]: pam_unix(cron:session): session closed for user root
Apr 10 02:30:54 sgc smartd[1119]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 55 to 56
Apr 10 03:00:54 sgc smartd[1119]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 56 to 55
Apr 10 03:10:01 sgc CRON[1460175]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Apr 10 03:10:01 sgc CRON[1460176]: (root) CMD (test -e /run/systemd/system || SERVICE_MODE=1 /sbin/e2scrub_all -A -r)
Apr 10 03:10:01 sgc CRON[1460175]: pam_unix(cron:session): session closed for user root
Apr 10 03:17:01 sgc CRON[1461484]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Apr 10 03:17:01 sgc CRON[1461485]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Apr 10 03:17:01 sgc CRON[1461484]: pam_unix(cron:session): session closed for user root
Apr 10 03:30:54 sgc smartd[1119]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 53 to 54
Apr 10 04:00:54 sgc smartd[1119]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 54 to 53
Apr 10 04:17:01 sgc CRON[1472691]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Apr 10 04:17:01 sgc CRON[1472692]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Apr 10 04:17:01 sgc CRON[1472691]: pam_unix(cron:session): session closed for user root
Apr 10 04:30:54 sgc smartd[1119]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 55 to 56
Apr 10 05:00:55 sgc smartd[1119]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 53 to 54
-- Boot 66f0503b5c4843f98ef7cdba37695d8f --
Apr 10 09:05:20 sgc kernel: Linux version 6.8.12-8-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-8 (2025-01-24T12:32Z) ()

Do you know how I can troubleshoot this ?
This is really strange that one or 2 VM are still working while the proxmox host is not responding to even the web ui or ssh.
I "seams" (not sure) that this issue happen since I updated from proxmox 7 to 8 last summer. This setup was running fine for years before.

Thank you in advance
 
Last edited:
any ideas ? Some VM had "host" CPU type, I migrated all VM to the default cpu type. I just got another freeze today, only 1 of my VM stayed to Host type (don't know why but it is not working with another cpu type).
 
Last edited:
Yesterday I :

  1. Upgraded to proxmox 8.4
  2. Change all VM except one from Host to KVM64
It sems like one of these made it way worse, since now the proxmox is freezing like once a day (where it was once a month before).
Don't really know what to do now. It seems linked to the cpu type of the VM, so i'll try to change all KVM64 to cpu type host to check.
 
Another freeze today. I am out of idea to troubleshoot. Anyone has any idea to troubleshoot this issue ?
 
Another freeze today. I am out of idea to troubleshoot. Anyone has any idea to troubleshoot this issue ?
I recommend to use memtest+86 to test this machine. Before do the test, it's recommend to remove all disks then you can only focus on motherboard/cpu/memory, please test at least 1 full cycle time, to make sure all major components is stable. if not, please consider remove 2 of 32GB DIMM and then retest again, or it may be consider to replace the Power Supply. And then plus component one by one for find which components caused this situation.
 
Last edited:
  • Like
Reactions: Johannes S
I successfully moved all VM to CPU type host.
Since this change, this worked for 3 weeks but still got a freeze today.

I would like to perform the memtest but this means that my services will be down for too long.
Changing one composant at time will cost me a lot of money just to find out which one fail.
 
Hi!

DO the followings:
-Disable SMT in BIOS,
-Disable All PowerSaving options in BIOS ( --- you already did ? )
-Change PowerSaving to "High Performance" in BIOS ( need to check all of the settings from "AUTO" to "DISABLE"or "ENABLE" ).

EDIT*: Add "Disable PowerSaving" kernel options in "/etc/default/grub"
Code:
GRUB_CMDLINE_LINUX=... pci=realloc=off pcie_port_pm=off pcie_aspm.policy=performance nox2apic"
 
Last edited:
It seems unusual this would happen exactly at 5 AM. are there any tasks (like backups) scheduled at that time ? Perhaps an external event ?
Take note of the freeze/inaccessible UI times.

I am curious about the host type as you indicate it made your setup more stable.

This thread has more details about the host type choice and its implications, perhaps something to explore.

You don't mention what resources your VMs are using, perhaps another thing to consider. Is load / CPU higher when the crash happens ?
 
  • Like
Reactions: Johannes S
Hello,

All the options in bios are already in disable, I will double check.

I found 2 interesting things :
- CPU type seems to help a lot, but host type (best one so far) doesn't completely fix the issue
- this often happen early in the morning (today was at 5am30).

I run backup in the night, but all backup were finished at 4:30, so not at the right time of the issues.

For the load, as it is in the middle of the night, it is really low at the moment of the freeze.

I don't really know but this seems related to the cpu (because of CPU type implication).
I don't really know how to proceed further.