[SOLVED] Proxmox Host crashing and Freezing daily

mRZ

Member
Jul 13, 2021
13
0
6
31
Hello,

Note: I am not that good at reading logs and knowing which file logs what, so just let me know if you need something.

What is the Problem:
My host is crashing/freezing daily for a long time now. I would say about the time when I upgraded to PVE7 (shortly after release) also I installed a GPU around this time for passthrough to my Plex VM. I have tried a lot of things now but the only thing I "achieved" was that the host is not crashing all the time, instead it's now freezing sometimes or at first only some random VMs or CTs freeze.

Pictures of some screenshotted crashes/freezes:
snapshot 2.jpg
snapshot 1.jpg
Proxmox Container Donw and Host up.jpg
Nach Feature=Nesting1.jpg
snapshot1.jpg

This was no crash/freeze but seemed weird so I screenshotted it:
snapshot.jpg

Before reinstalling Proxmox on the host, I got these GRUB messages (now I use UEFI):
Bios Bug.png

What is my Setup:

  • AMD Ryzen 5 PRO 4650G
  • 64GB RAM non ECC
  • Gigabyte Aorus AMD x570 PRO Motherboard (newest BIOS - F36)
  • NVIDIA Quadro P400
  • All VMs and Container run on M.2 (2x Samsung 970 Evo 1TB - ZFS Mirror) SSDs
  • Some have a data pool attached (2x Seagate Ironwolf Pro 6TB - ZFS Mirror + Log & Zil (2 partitions on same drive) on Seagate FireCuda M.2)
  • Host is running on 2x Seagate IronWolf 510 Sata SSD - ZFS Mirror
  • PVE Packages:
    proxmox-ve: 7.4-1 (running kernel: 5.15.104-1-pve)pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
    pve-kernel-5.15: 7.4-1
    pve-kernel-5.15.104-1-pve: 5.15.104-2
    pve-kernel-5.15.102-1-pve: 5.15.102-1
    ceph-fuse: 15.2.17-pve1
    corosync: 3.1.7-pve1
    criu: 3.15-1+pve-1
    glusterfs-client: 9.2-1
    ifupdown2: 3.1.0-1+pmx3
    ksm-control-daemon: 1.4-1
    libjs-extjs: 7.0.0-1
    libknet1: 1.24-pve2
    libproxmox-acme-perl: 1.4.4
    libproxmox-backup-qemu0: 1.3.1-1
    libproxmox-rs-perl: 0.2.1
    libpve-access-control: 7.4-2
    libpve-apiclient-perl: 3.2-1
    libpve-common-perl: 7.3-4
    libpve-guest-common-perl: 4.2-4
    libpve-http-server-perl: 4.2-3
    libpve-rs-perl: 0.7.5
    libpve-storage-perl: 7.4-2
    libspice-server1: 0.14.3-2.1
    lvm2: 2.03.11-2.1
    lxc-pve: 5.0.2-2
    lxcfs: 5.0.3-pve1
    novnc-pve: 1.4.0-1
    proxmox-backup-client: 2.4.1-1
    proxmox-backup-file-restore: 2.4.1-1
    proxmox-kernel-helper: 7.4-1
    proxmox-mail-forward: 0.1.1-1
    proxmox-mini-journalreader: 1.3-1
    proxmox-widget-toolkit: 3.6.5
    pve-cluster: 7.3-3
    pve-container: 4.4-3
    pve-docs: 7.4-2
    pve-edk2-firmware: 3.20230228-2
    pve-firewall: 4.3-1
    pve-firmware: 3.6-4
    pve-ha-manager: 3.6.0
    pve-i18n: 2.12-1
    pve-qemu-kvm: 7.2.0-8
    pve-xtermjs: 4.16.0-1
    qemu-server: 7.4-3
    smartmontools: 7.2-pve3
    spiceterm: 3.2-2
    swtpm: 0.8.0~bpo11+3
    vncterm: 1.7-1
    zfsutils-linux: 2.1.9-pve1
  • Currently, I have 4 VMs and 2 CTs

What I think the problem could have something to do with and what I have done so far:
  • My GPU passthrough works perfect but at some time a thought crashes have to do with it because of the BIOS ERROR, so I reconfigured everything. Also tried deactivated the PT.
  • BIOS - Updated to the newest BIOS version (2 times now)
  • Due to this post "kernel-panic-whole-server-crashes-about-every-day" I:
    • Installed microcode
    • Set all my storage Async IO from "io_uring" to "native"
    • Tried optional Linux Kernel (6.x)
  • Due to the APPARMOR DENIED messages (screenshots), I gave my privileged Nextcloud container the features: nesting, nfs, cifs. Now o get the STATUS messages from APPARMOR (profile_replace, error=-13, apparmor_parser) (Screenshots)
  • Maybe a motherboard problem? - But then, why was it running before upgrading PVE and installing a GPU just fine

I hope someone has has an Idea. Thank you so much for helping!!!
 
you are getting machine check errors.

Its either motherboard (eg, hw fault, bios issue/misconfig, physical problem eg warped, etc), CPU, or memory.

if this is a server or as factory built machine, open a ticket. if its home built, disassemble everything, reassemble, and test from minimal config upwards.
 
  • Like
Reactions: mRZ
Okay, that's strange, since all was working fine previously under PVE 6.

The server is home built.
I did memtest86 overnight and the RAM seems to be fine.

Where exactly can you see the machine check errors? Are you referring to the 6th picture? Because there was no crash or freeze and I saw this message only once.

Is there a way to get more information first before reassembling? I have read about mcelog and rasdaemon.

What exactly do you mean by "reassemble, and test from minimal config upwards" since I can't test my server without motherboard or CPU right?
 
What exactly do you mean by "reassemble, and test from minimal config upwards" since I can't test my server without motherboard or CPU right?
right.

remove everything from the motherboard. the reinstall the CPU (with new thermal paste,) and install ram one stick at a time. Once thats verified, start adding your PCI cards and usb devices.

My money is on the gpu,
 
  • Like
Reactions: mRZ
Okay, I guess I will go the other way around first and remove device by device to keep availability of the services as long as possible.

Could you please answer my question from before and tell me where you can see the machine check errors, so I can learn how they look like?

Could you explain why your bet is on the GPU when machine check errors come from CPU, MB or RAM as you explained?

Thanks for the help!
 
Could you please answer my question from before and tell me where you can see the machine check errors, so I can learn how they look like?
1683040180203.png
If your motherboard logs faults you should be able to see them there.
Could you explain why your bet is on the GPU when machine check errors come from CPU, MB or RAM as you explained?
Machine check errors are not an absolute indication of crashes- they are usually corrected in hardware so may not be directly related to a crash and rather serve as an early warning. GPU passthroughs, on the other hand, are often a cause of pain.
 
  • Like
Reactions: mRZ
Currently, I disabled everything for PCIe PT (vfio, iommu etc.) but still with GPU installed. Now Proxmox got stuck with a new error:

[13542. 001045] traps: pvestatd[3443] trap invalid opcode ip:7f3e427a606e sp:7ffe674fea38 error:O in libcrgpto.so.1.1[7f3e427a3000+1a8000]

any ideas specifically about that?
 
I can confirm, that the GPU is not the problem, since I have completely uninstalled the GPU with settings etc. and the "Kernel Panic Errors" still persist.

I decided to buy a Supermicro H12SSL-i + AMD EPYC 7232P with the associated RAM Modules.

I will report if this fixes my problems.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!