Kernel Panic, whole server crashes about every day

Making some progress, the one remaining VM is running a task at 5:00am each morning and that is the exact time the kernel panics. The VM is CentOS 7 but with a 5.13 kernel. The task does a bunch of data backups. I have just run each of them in order and the host panics when doing an rsync between the two 3TB drives inside the guest. The two drives are mounted by the host as LVM datastores, the guest has a volume /dev/vdb in one and /dev/vdc in the other both formatted as xfs.

This would suggest to me that under a high(ish) IO load on the host when running 5.11, something isn't happy.
 
Making some progress, the one remaining VM is running a task at 5:00am each morning and that is the exact time the kernel panics. The VM is CentOS 7 but with a 5.13 kernel. The task does a bunch of data backups. I have just run each of them in order and the host panics when doing an rsync between the two 3TB drives inside the guest. The two drives are mounted by the host as LVM datastores, the guest has a volume /dev/vdb in one and /dev/vdc in the other both formatted as xfs.

This would suggest to me that under a high(ish) IO load on the host when running 5.11, something isn't happy.

I agree this seems to be related also to io and I now think there may be an issue with virtio. I have changed disk from no cache to write back and and taken ballooning off, and I am now pushing the VM with huge reads/writes and touch wood it seems stable. Do you want to try the write back to see if this makes a difference for you?
 
I'm facing the same problem on my Proxmox VE 7.0-9 machine, powered by an AMD 3950, 64 GB of DDR4 RAM (2 x 32GB sticks). I'm using ZFS on some of my hard drives, don't know if this changes something. I performed a memtest to make sure that the problem isn't my RAM and yes, I'm pretty sure my hardware isn't causing this problem.
This was happening even on PVE 6.x, nothing changed since then. Seems to be happening every few days, but it doesn't look very deterministic so I have no clue.
It actually started happening after installing k3s on 4 VMs (maybe one or all of them are causing kernel panics... I'm not sure because they were created together).
Right now, I'm considering shutting them down and see if the situation improves but I've not tried that so far.
 
I'm facing the same problem on my Proxmox VE 7.0-9 machine, powered by an AMD 3950, 64 GB of DDR4 RAM (2 x 32GB sticks). I'm using ZFS on some of my hard drives, don't know if this changes something. I performed a memtest to make sure that the problem isn't my RAM and yes, I'm pretty sure my hardware isn't causing this problem.
This was happening even on PVE 6.x, nothing changed since then. Seems to be happening every few days, but it doesn't look very deterministic so I have no clue.
It actually started happening after installing k3s on 4 VMs (maybe one or all of them are causing kernel panics... I'm not sure because they were created together).
Right now, I'm considering shutting them down and see if the situation improves but I've not tried that so far.

Do you want to try changing disk from no cache to write back and and take ballooning off if used to see if that makes a difference. I am currently doing this and so far seems to be working great. Previously was crashing very quickly.
 
  • Like
Reactions: leofabri
Do you want to try changing disk from no cache to write back and and take ballooning off if used to see if that makes a difference. I am currently doing this and so far seems to be working great. Previously was crashing very quickly.
I think I'm going to try that next. I'll keep you updated about my situation
 
Right, not doing the rsync last night and the host did NOT panic. I have stopped the VM, changed the disks from Default(No Cache) to 'Write back' and restarted. Ran the rsync and, NO PANIC !! Yeah.

Lots of errors reported by Rsync, XFS is in a bit of a bad state, probably due to all the crashes. But looking good so far.
 
Right, not doing the rsync last night and the host did NOT panic. I have stopped the VM, changed the disks from Default(No Cache) to 'Write back' and restarted. Ran the rsync and, NO PANIC !! Yeah.

Lots of errors reported by Rsync, XFS is in a bit of a bad state, probably due to all the crashes. But looking good so far.
Great news. I too have been pushing it and no issues seen yet.

@t.lamprecht (proxmox staff), is this something that can be looked at, to get fixed?
 
Seemingly fixed it for me as well changing cache too write back. Been running some I/O tests with FIO, could get it to crash almost every run with no cache. Still no crashes with write back enabled on the vms. I posted some of my server info in this thread if anyone wants to look at similarities in system and setup. I'm on Intel btw, not AMD as many others have reported here.

I ran kdump-tools to see if I could catch the kernel panic, but that resulted in the following problem when dumping the crash log.
Code:
Jul 14 03:17:17 prox kdump-tools[762]: Starting kdump-tools:
Jul 14 03:17:17 prox kdump-tools[769]: running makedumpfile -F -c -d 31 /proc/vmcore | compress > /var/crash/202107140317/dump-incomplete.
Jul 14 03:17:17 prox kdump-tools[787]: The kernel version is not supported.
Jul 14 03:17:17 prox kdump-tools[787]: The makedumpfile operation may be incomplete.
 
Last edited:
Since it seems to coincide with IO load, could you try disabling io_uring? With "cache=writeback" that is already the case, so it would seem to be a candidate. That also would be something that clearly changed in PVE 7.0.

To test, edit your VM config in /etc/pve/qemu-server/<vmid>.conf and add ,aio=native to the end of your disks (i.e. scsi0, sata0, etc...). You can verify by making sure qm showcmd 100 --pretty | grep io_uring doesn't show anything.
 
  • Like
Reactions: MarvinE
I put cache back to Default (no cache) and added ,aio=native to my disks, but still crashing with aio set to native unfortunately.
Whats the recommended method to catch the kerneldumps?
 
Rebuilt the hosts disks yesterday which involved a lot of IO emptying them from inside the VM, pretty stable so far.
Is someone able to explain why 'Write back' doesn't trigger the panic and why no-cache does but only on AMD?
I had hoped to combine the 2x 3TB disks as an md raid1 set on the host, previously the 'disks' were raided in the VM, alas no mdadm in the Proxmox distro so have decided to give ZFS a go. The two disks are running as a 2 disk mirrored pool so fingers crossed. Am in the process now of reloading the data from the various places I had to squirrel it away.
 
Last edited:
Is someone able to explain why 'Write back' doesn't trigger the panic and why no-cache does but only on AMD?
On my upgraded and still stable test system (intel i3) my VM's use write back but on my upgraded main system (intel i7) the VM's mainly use no cache and will crash on the latest kernel. I was able to choose 5.4 and get back up and running without crashes (so far).

As for what the actual fix is, I am eager to hear as well.
 
Since it seems to coincide with IO load, could you try disabling io_uring? With "cache=writeback" that is already the case, so it would seem to be a candidate. That also would be something that clearly changed in PVE 7.0.

To test, edit your VM config in /etc/pve/qemu-server/<vmid>.conf and add ,aio=native to the end of your disks (i.e. scsi0, sata0, etc...). You can verify by making sure qm showcmd 100 --pretty | grep io_uring doesn't show anything.
With cache option as default (no cache), adding aio=native fixed the crashes for me. So far it has been stable. It used to crash for me when i am cloning a manifest during checkout process but that has not happened with aio=native.

I’ve passed through my crucial 1TB SSD and have installed ubuntu on the passedthrough disk instead of virtual disks. The cache options ‘writeback’ doesnt work on this particular VM whereas it works on my windows and untangle VM which uses virtual disks.

My build is Ryzen 1700x with latest bios on Asus prime x370 pro.
 
Last edited:
Guys, I think I messed up my VMs configuration. I discovered that some of them were running on my ISO volume (which isn't even ZFS)... how couldn't I notice?! I feel so stupid o_O.
Now I moved them, but I think I've made too many mistakes, and I'm seeing those kernel panics happening because of ext4 i/o errors on some VMs. I don't actually know if a VM crash could affect the entire PVE, but I guess it does.
I'll keep you updated if the fixes I've made now solved my problem
 
Last edited:
Guys, I managed to get some insights from the most recent crash.
This morning I noticed that pvestatd was dead even though my VMs are not. This time, I managed to gather some syslogs from the PVE management interface (yeah, not even ssh works), and after scrolling through the logs, I noticed some errors that I'm attaching in syslog-partial.log



Update: I still haven't rebooted the server yet. What I'm noticing is that I cannot connect through SSH, I can't reboot the PVE instance (perhaps because the status of the VM is unknown to proxmox):

screenshot-vm-status.PNG

SSH:
ssh-hangs.png


Update 2: Rebooted, everything went back to normal (until it crashes again). I'm attaching lshw (full-hardware-info) and pveversion --verbose to the post
 

Attachments

  • syslog-partial.log
    16.4 KB · Views: 11
  • lshw-full-hardware-info.txt
    43.7 KB · Views: 2
  • pveversion.txt
    1.4 KB · Views: 0
Last edited:
Not sure it's only AMD related....

I've got the same sort of kernel panic on a Dell R710 with Intel xeon.

My freenasvm can't be start without finally get a host (proxmox V7) hanging then crashing. All was working with PvE 6.
 
In the syslog we have segfaults in core libraries, general protection faults (access of memory addresses outside allowed virtual memory regions) and failure to handle a page fault.

So IMO it's either one of:
* bad HW (e.g. memory) - sometimes that can also be triggered (more often) by a different kernel version
* something causing havoc in kernel space, highly probably specific to some of the HW here in this thread

As we see quite some AMD 3xxx involved here it'd be good if you could ensure that the latest BIOS updates are installed, or to install the amd64-microcode package from the non-free Debian Repository component.
 
I've been following this thread.. won't be upgrading to v7. Is there some active proxmox testing going on this?
Sadly we cannot yet reproduce this behavior here in our testlabs or production setups, and we try hard to always buy different HW for workstations, testlabs and the like, to cover more area.
 
Sadly we cannot yet reproduce this behavior here in our testlabs or production setups, and we try hard to always buy different HW for workstations, testlabs and the like, to cover more area.

Ok I am glad you are trying though, it helps perhaps agree that it is more hardware related. I have a production 6.4 that is flawless and concerned about upgrading to 7.0. I don't see a rush at moment.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!