Proxmox Host Crashing Daily - Debugging Tips

sterlinm

New Member
Jun 13, 2024
2
0
1
Hi folks!

I've got a Proxmox host that is freezing pretty much on a daily basis, and I'm looking for advice on how to get started debugging it. I've tried looking at the logs and don't see an obvious answer, but I'm not sure I know what to look for.

I've got the host connected to a KVM and when it freezes the host and the VMs on it disappear from my network, but on the KVM I can see the login shell ("Welcome to the Proxmox Virtual Environment") but it's not responsive and I can't log in, even if I attach a physical keyboard. I also ran memtest to see if there was an issue with the ram and that passed without complaint.

I'm happy to provide more information about the machine, logs, etc. if that would help.

Some of the questions I have in mind are:
  • What logs should I be looking at?
  • What sorts of issues could cause the machine to freeze but not just completely crash?
  • What would be the recommended debugging steps for isolating the problem?
Any advice would be appreciated, thanks!

Update: Providing some additional information based on @esi_y 's comments. journalctl output is attached, and here are some more details on the hardware.

Screenshot 2024-08-17 at 1.51.54 PM.png

Bash:
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          43 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   32
  On-line CPU(s) list:    0-31
Vendor ID:                AuthenticAMD
  BIOS Vendor ID:         Advanced Micro Devices, Inc.
  Model name:             AMD Ryzen 9 3950X 16-Core Processor
    BIOS Model name:      AMD Ryzen 9 3950X 16-Core Processor             Unknown CPU @ 3.5GHz
    BIOS CPU family:      107
    CPU family:           23
    Model:                113
    Thread(s) per core:   2
    Core(s) per socket:   16
    Socket(s):            1
    Stepping:             0
    Frequency boost:      enabled
    CPU(s) scaling MHz:   75%
    CPU max MHz:          4761.2300
    CPU min MHz:          2200.0000
    BogoMIPS:             6999.57
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse
                          2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid e
                          xtd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcn
                          t aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dno
                          wprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb c
                          at_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed
                           adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm
                          _mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vm
                          cb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl
                           umip rdpid overflow_recov succor smca sev sev_es
Virtualization features:
  Virtualization:         AMD-V
Caches (sum of all):
  L1d:                    512 KiB (16 instances)
  L1i:                    512 KiB (16 instances)
  L2:                     8 MiB (16 instances)
  L3:                     64 MiB (4 instances)
NUMA:
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-31
Vulnerabilities:
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
  Spec rstack overflow:   Mitigation; Safe RET
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected
                          ; BHI Not affected
  Srbds:                  Not affected
  Tsx async abort:        Not affected

Bash:
NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sdi                            8:128  0 238.5G  0 disk
├─sdi1                         8:129  0  1007K  0 part
├─sdi2                         8:130  0     1G  0 part /boot/efi
└─sdi3                         8:131  0 237.5G  0 part
  ├─pve-swap                 252:0    0     8G  0 lvm  [SWAP]
  ├─pve-root                 252:1    0  69.4G  0 lvm  /
  ├─pve-data_tmeta           252:2    0   1.4G  0 lvm
  │ └─pve-data-tpool         252:4    0 141.2G  0 lvm
  │   ├─pve-data             252:5    0 141.2G  1 lvm
  │   ├─pve-vm--100--disk--0 252:6    0     4M  0 lvm
  │   ├─pve-vm--100--disk--1 252:7    0    20G  0 lvm
  │   └─pve-vm--101--disk--0 252:8    0    32G  0 lvm
  └─pve-data_tdata           252:3    0 141.2G  0 lvm
    └─pve-data-tpool         252:4    0 141.2G  0 lvm
      ├─pve-data             252:5    0 141.2G  1 lvm
      ├─pve-vm--100--disk--0 252:6    0     4M  0 lvm
      ├─pve-vm--100--disk--1 252:7    0    20G  0 lvm
      └─pve-vm--101--disk--0 252:8    0    32G  0 lvm
sr0                           11:0    1  1024M  0 rom

Bash:
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Starship/Matisse IOMMU
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
00:05.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
00:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
00:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
00:08.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 61)
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 3
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 5
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 6
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 7
02:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] 500 Series Chipset USB 3.1 XHCI Controller
02:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] 500 Series Chipset SATA Controller
02:00.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] 500 Series Chipset Switch Upstream Port
03:06.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 43ea
03:08.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 43ea
03:09.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 43ea
04:00.0 SATA controller: ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 02)
29:00.0 VGA compatible controller: NVIDIA Corporation GM107 [GeForce GTX 750 Ti] (rev a2)
29:00.1 Audio device: NVIDIA Corporation GM107 High Definition Audio Controller [GeForce 940MX] (rev a1)
2a:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
2b:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3224 PCI-Express Fusion-MPT SAS-3 (rev 01)
2c:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function
2d:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP
2d:00.1 Encryption controller: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Cryptographic Coprocessor PSPCPP
2d:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller
2d:00.4 Audio device: Advanced Micro Devices, Inc. [AMD] Starship/Matisse HD Audio Controller
31:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
 

Attachments

  • attach.log
    185.4 KB · Views: 1
Last edited:
I'm happy to provide more information about the machine, logs, etc. if that would help.

Some of the questions I have in mind are:
  • What logs should I be looking at?

It would be good to start with entire such frozen boot, i.e. journalctl -b -1 > attach.log

  • What sorts of issues could cause the machine to freeze but not just completely crash?

Hardware (some passthrough situations) or kernel / driver bug. Different things more likely for different hardware (C states come to mind), it might help if you provide more details on your hardware.

  • What would be the recommended debugging steps for isolating the problem?

Obligatory questions: Was this hardware running just fine before last update / configuration / software (OS / kernel) / hardware change?
 
Last edited:
It would be good to start with entire such frozen boot, i.e. journalctl -b -1 > attach.log

Thanks so much! I've attached the logs and updated the original post with some additional information on the hardware.

Hardware (some passthrough situations) or kernel / driver bug. Different things more likely for different hardware (C states come to mind), it might help if you provide more details on your hardware.

Details on the hardware are above in the updated post (but if there are things I neglected to include I'm happy to provide more).

I'm only running 2 VMs on the machine right now, and I'm passing through a "Serial Attached SCSI controller: Broadcom / LSI SAS3224 PCI-Express Fusion-MPT SAS-3 (rev 01)" to one of those VMs. The SAS controller has a bunch of SATA drives connected to it.

Obligatory questions: Was this hardware running just fine before last update / configuration / software (OS / kernel) / hardware change?

I got the machine used, and this is the first thing I've installed on it. I guess one obvious first step might be to turn off the VM with the SAS controller passed through for a few days and see if the host crashes or not.

Thanks again!
 
I have the same problem.

Memtest without problems, stress tests on the cpu and disk also without problems. We replaced the PSU, and it continues. The freeze is random and has nothing to do with server load.
Proxmox was recently installed on this server. There are only 3 VMs with Windows 2022 in the 3 vms. Its a Ryzen 7 5800X server.

The kernel does not log any errors before freezing. Dmesg is also clean.

Kernel: 6.8.12-1

proxmox-ve: 8.2.0 (running kernel: 6.8.12-1-pve)
pve-manager: 8.2.4 (running version: 8.2.4/faa83925c9641325)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-1
proxmox-kernel-6.8.12-1-pve-signed: 6.8.12-1
proxmox-kernel-6.8.8-3-pve-signed: 6.8.8-3
amd64-microcode: 3.20230808.1.1~deb12u1
ceph-fuse: 16.2.11+ds-2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx9
intel-microcode: 3.20240514.1~deb12u1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.7
libpve-cluster-perl: 8.0.7
libpve-common-perl: 8.2.2
libpve-guest-common-perl: 5.1.4
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.9
libpve-storage-perl: 8.2.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.7-1
proxmox-backup-file-restore: 3.2.7-1
proxmox-firewall: 0.5.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.7
pve-container: 5.1.12
pve-docs: 8.2.3
pve-edk2-firmware: not correctly installed
pve-esxi-import-tools: 0.7.1
pve-firewall: 5.0.7
pve-firmware: 3.13-1
pve-ha-manager: 4.0.5
pve-i18n: 3.2.2
pve-qemu-kvm: 9.0.2-2
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.4
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.4-pve1
 
I think I missed notification on this one, sorry.

Thanks so much! I've attached the logs and updated the original post with some additional information on the hardware.

Unfortunately not much experience myself with Ryzens, but whenever I was looking for these kind of out-of-blue nothing-in-the-logs and good memory crashes, I would:

- disable HT, C states (in BIOS/EFI)
- try older kernel
- try plain Debian (if you get a crash within 24 hours as you do, it is usually something one can afford to do)
- try removing hardware
- try different PSU

I'm only running 2 VMs on the machine right now, and I'm passing through a "Serial Attached SCSI controller: Broadcom / LSI SAS3224 PCI-Express Fusion-MPT SAS-3 (rev 01)" to one of those VMs. The SAS controller has a bunch of SATA drives connected to it.

-try not passing through anything

I got the machine used, and this is the first thing I've installed on it.

The issue with PVE is that it uses lots of features and the kernels are quite fresh. But then again, running e.g. plain Debian with no load might not be more stable, but simply not have the issue exhibit itself.

I guess one obvious first step might be to turn off the VM with the SAS controller passed through for a few days and see if the host crashes or not.

So yeah, happens to be few days later, hopefully I get notifications how it went.

The last thing that comes to mind ... how does it crash / freeze? Do you have console connected? Sometimes you might find something on the console that did not get flushed onto drive.
 
I have the same problem.

These are rarely "the same". ;)

Memtest without problems, stress tests on the cpu and disk also without problems. We replaced the PSU, and it continues. The freeze is random and has nothing to do with server load.
Proxmox was recently installed on this server. There are only 3 VMs with Windows 2022 in the 3 vms. Its a Ryzen 7 5800X server.

I can only suggest same as the above, plus of course maybe share the log, but if you see nothing, I take you for your word. But it's a bold statement. :)

The kernel does not log any errors before freezing. Dmesg is also clean.

Kernel: 6.8.12-1

First thing ever for me in all of these cases would be to run older (i.e. stable) kernel to rule that one out ...
 
  • Like
Reactions: joabe
These are rarely "the same". ;)

I can only suggest same as the above, plus of course maybe share the log, but if you see nothing, I take you for your word. But it's a bold statement. :)

First thing ever for me in all of these cases would be to run older (i.e. stable) kernel to rule that one out ...

Just so i don't say i see nothing, sometimes i see:

Aug 22 12:21:29 ns5005826 kernel: mce: [Hardware Error]: Machine check events logged
Aug 22 12:21:29 ns5005826 kernel: mce: [Hardware Error]: CPU 4: Machine Check: 0 Bank 5: bea0000001000108
Aug 22 12:21:29 ns5005826 kernel: mce: [Hardware Error]: TSC 0 ADDR 7fa2374d54e3 MISC d012000100000000 SYND 4d000000 IPID 500b000000000
Aug 22 12:21:29 ns5005826 kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1724340077 SOCKET 0 APIC 8 microcode a20102b

This happens when the server simply turns off and turns on again by itself. The times it freezes it doesn't register anything.

I just switched to a slightly older kernel that works on my other identical servers: kernel 6.8.8-3. Now im monitoring whether it will happen again.
 
Just so i don't say i see nothing, sometimes i see:

Aug 22 12:21:29 ns5005826 kernel: mce: [Hardware Error]: Machine check events logged
Aug 22 12:21:29 ns5005826 kernel: mce: [Hardware Error]: CPU 4: Machine Check: 0 Bank 5: bea0000001000108
Aug 22 12:21:29 ns5005826 kernel: mce: [Hardware Error]: TSC 0 ADDR 7fa2374d54e3 MISC d012000100000000 SYND 4d000000 IPID 500b000000000
Aug 22 12:21:29 ns5005826 kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1724340077 SOCKET 0 APIC 8 microcode a20102b

Have a look here, for a start ...

https://forum.proxmox.com/threads/proxmox-ve-8-1-4-troubleshooting.142041/

This happens when the server simply turns off and turns on again by itself. The times it freezes it doesn't register anything.

I remember I once had such issue on a machine that only registered e.g. irq XX: nobody cared at bootup, crashed hours later. And it was the cause ...

I just switched to a slightly older kernel that works on my other identical servers: kernel 6.8.8-3. Now im monitoring whether it will happen again.

That's a good start, to look for differences when you have identical hardware.
 
  • Like
Reactions: joabe
Hey guys! I also face nearly daily freezes, unfortunately without any memtest errors.

I tried the newest kernels or also older ones https://www.thomas-krenn.com/de/wiki/Known_Issues_Proxmox_VE_8.2

Updated the microcodes, nothing changed.

I also let the hosting company change the whole system. Same issue.

As I have some servers I migrated the VMs to another host, on the next the issue was gone on the effected system but begun on the other system. So it is VM related.

Unfortunately giving all VMs another CPU type like qemu64 or kvm64 doesn’t help. I used host before. After giving all VMs these types, the issue seemed to be gone, but after a while it comes back.

Also updated the boot parameters with tons of various parameters, no change.

I use Proxmox with Ryzen 9 7950X.
 
Last edited:
I used host before. After giving all VMs these types, the issue seemed to be gone, but after a while it comes back.

That would be very bizzare if it was the cause indeed, but ... never say never. Just some causation evidence for me ... would be needed.
 
That would be very bizzare if it was the cause indeed, but ... never say never. Just some causation evidence for me ... would be needed.
Hard to find evidences but if I ran no VMs on the host, the system will have no issues. Just if you have VMs on it and some special workloads / instructions. But I did not find out which causes the issue.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!