Kernel Panic, whole server crashes about every day

GandR · Jul 13, 2021

Making some progress, the one remaining VM is running a task at 5:00am each morning and that is the exact time the kernel panics. The VM is CentOS 7 but with a 5.13 kernel. The task does a bunch of data backups. I have just run each of them in order and the host panics when doing an rsync between the two 3TB drives inside the guest. The two drives are mounted by the host as LVM datastores, the guest has a volume /dev/vdb in one and /dev/vdc in the other both formatted as xfs.

This would suggest to me that under a high(ish) IO load on the host when running 5.11, something isn't happy.

PeterRoux · Jul 13, 2021

GodZone said:
Making some progress, the one remaining VM is running a task at 5:00am each morning and that is the exact time the kernel panics. The VM is CentOS 7 but with a 5.13 kernel. The task does a bunch of data backups. I have just run each of them in order and the host panics when doing an rsync between the two 3TB drives inside the guest. The two drives are mounted by the host as LVM datastores, the guest has a volume /dev/vdb in one and /dev/vdc in the other both formatted as xfs.

This would suggest to me that under a high(ish) IO load on the host when running 5.11, something isn't happy.

I agree this seems to be related also to io and I now think there may be an issue with virtio. I have changed disk from no cache to write back and and taken ballooning off, and I am now pushing the VM with huge reads/writes and touch wood it seems stable. Do you want to try the write back to see if this makes a difference for you?

leofabri · Jul 13, 2021

I'm facing the same problem on my Proxmox VE 7.0-9 machine, powered by an AMD 3950, 64 GB of DDR4 RAM (2 x 32GB sticks). I'm using ZFS on some of my hard drives, don't know if this changes something. I performed a memtest to make sure that the problem isn't my RAM and yes, I'm pretty sure my hardware isn't causing this problem.
This was happening even on PVE 6.x, nothing changed since then. Seems to be happening every few days, but it doesn't look very deterministic so I have no clue.
It actually started happening after installing k3s on 4 VMs (maybe one or all of them are causing kernel panics... I'm not sure because they were created together).
Right now, I'm considering shutting them down and see if the situation improves but I've not tried that so far.

PeterRoux · Jul 13, 2021

leofabri said:
I'm facing the same problem on my Proxmox VE 7.0-9 machine, powered by an AMD 3950, 64 GB of DDR4 RAM (2 x 32GB sticks). I'm using ZFS on some of my hard drives, don't know if this changes something. I performed a memtest to make sure that the problem isn't my RAM and yes, I'm pretty sure my hardware isn't causing this problem.
This was happening even on PVE 6.x, nothing changed since then. Seems to be happening every few days, but it doesn't look very deterministic so I have no clue.
It actually started happening after installing k3s on 4 VMs (maybe one or all of them are causing kernel panics... I'm not sure because they were created together).
Right now, I'm considering shutting them down and see if the situation improves but I've not tried that so far.

Do you want to try changing disk from no cache to write back and and take ballooning off if used to see if that makes a difference. I am currently doing this and so far seems to be working great. Previously was crashing very quickly.

leofabri · Jul 13, 2021

PeterRoux said:
Do you want to try changing disk from no cache to write back and and take ballooning off if used to see if that makes a difference. I am currently doing this and so far seems to be working great. Previously was crashing very quickly.

I think I'm going to try that next. I'll keep you updated about my situation

GandR · Jul 13, 2021

Right, not doing the rsync last night and the host did NOT panic. I have stopped the VM, changed the disks from Default(No Cache) to 'Write back' and restarted. Ran the rsync and, NO PANIC !! Yeah.

Lots of errors reported by Rsync, XFS is in a bit of a bad state, probably due to all the crashes. But looking good so far.

PeterRoux · Jul 13, 2021

GodZone said:
Right, not doing the rsync last night and the host did NOT panic. I have stopped the VM, changed the disks from Default(No Cache) to 'Write back' and restarted. Ran the rsync and, NO PANIC !! Yeah.

Lots of errors reported by Rsync, XFS is in a bit of a bad state, probably due to all the crashes. But looking good so far.

Great news. I too have been pushing it and no issues seen yet.

@t.lamprecht (proxmox staff), is this something that can be looked at, to get fixed?

Lutris · Jul 14, 2021

Seemingly fixed it for me as well changing cache too write back. Been running some I/O tests with FIO, could get it to crash almost every run with no cache. Still no crashes with write back enabled on the vms. I posted some of my server info in this thread if anyone wants to look at similarities in system and setup. I'm on Intel btw, not AMD as many others have reported here.

I ran kdump-tools to see if I could catch the kernel panic, but that resulted in the following problem when dumping the crash log.

Code:

Jul 14 03:17:17 prox kdump-tools[762]: Starting kdump-tools:
Jul 14 03:17:17 prox kdump-tools[769]: running makedumpfile -F -c -d 31 /proc/vmcore | compress > /var/crash/202107140317/dump-incomplete.
Jul 14 03:17:17 prox kdump-tools[787]: The kernel version is not supported.
Jul 14 03:17:17 prox kdump-tools[787]: The makedumpfile operation may be incomplete.

Stefan_R · Jul 14, 2021

Since it seems to coincide with IO load, could you try disabling io_uring? With "cache=writeback" that is already the case, so it would seem to be a candidate. That also would be something that clearly changed in PVE 7.0.

To test, edit your VM config in /etc/pve/qemu-server/<vmid>.conf and add ,aio=native to the end of your disks (i.e. scsi0, sata0, etc...). You can verify by making sure qm showcmd 100 --pretty | grep io_uring doesn't show anything.

Lutris · Jul 14, 2021

I put cache back to Default (no cache) and added ,aio=native to my disks, but still crashing with aio set to native unfortunately.
Whats the recommended method to catch the kerneldumps?

GandR · Jul 14, 2021

Rebuilt the hosts disks yesterday which involved a lot of IO emptying them from inside the VM, pretty stable so far.
Is someone able to explain why 'Write back' doesn't trigger the panic and why no-cache does but only on AMD?
I had hoped to combine the 2x 3TB disks as an md raid1 set on the host, previously the 'disks' were raided in the VM, alas no mdadm in the Proxmox distro so have decided to give ZFS a go. The two disks are running as a 2 disk mirrored pool so fingers crossed. Am in the process now of reloading the data from the various places I had to squirrel it away.

spl1974 · Jul 15, 2021

Is someone able to explain why 'Write back' doesn't trigger the panic and why no-cache does but only on AMD?

On my upgraded and still stable test system (intel i3) my VM's use write back but on my upgraded main system (intel i7) the VM's mainly use no cache and will crash on the latest kernel. I was able to choose 5.4 and get back up and running without crashes (so far).

As for what the actual fix is, I am eager to hear as well.

pjgowtham · Jul 15, 2021

Stefan_R said:
Since it seems to coincide with IO load, could you try disabling io_uring? With "cache=writeback" that is already the case, so it would seem to be a candidate. That also would be something that clearly changed in PVE 7.0.

To test, edit your VM config in /etc/pve/qemu-server/<vmid>.conf and add ,aio=native to the end of your disks (i.e. scsi0, sata0, etc...). You can verify by making sure qm showcmd 100 --pretty | grep io_uring doesn't show anything.

With cache option as default (no cache), adding aio=native fixed the crashes for me. So far it has been stable. It used to crash for me when i am cloning a manifest during checkout process but that has not happened with aio=native.

I’ve passed through my crucial 1TB SSD and have installed ubuntu on the passedthrough disk instead of virtual disks. The cache options ‘writeback’ doesnt work on this particular VM whereas it works on my windows and untangle VM which uses virtual disks.

My build is Ryzen 1700x with latest bios on Asus prime x370 pro.

leofabri · Jul 15, 2021

Guys, I think I messed up my VMs configuration. I discovered that some of them were running on my ISO volume (which isn't even ZFS)... how couldn't I notice?! I feel so stupid

.
Now I moved them, but I think I've made too many mistakes, and I'm seeing those kernel panics happening because of ext4 i/o errors on some VMs. I don't actually know if a VM crash could affect the entire PVE, but I guess it does.
I'll keep you updated if the fixes I've made now solved my problem

leofabri · Jul 16, 2021

Guys, I managed to get some insights from the most recent crash.
This morning I noticed that pvestatd was dead even though my VMs are not. This time, I managed to gather some syslogs from the PVE management interface (yeah, not even ssh works), and after scrolling through the logs, I noticed some errors that I'm attaching in syslog-partial.log

Update: I still haven't rebooted the server yet. What I'm noticing is that I cannot connect through SSH, I can't reboot the PVE instance (perhaps because the status of the VM is unknown to proxmox):

SSH:

Update 2: Rebooted, everything went back to normal (until it crashes again). I'm attaching lshw (full-hardware-info) and pveversion --verbose to the post

tikismoke · Jul 17, 2021

Not sure it's only AMD related....

I've got the same sort of kernel panic on a Dell R710 with Intel xeon.

My freenasvm can't be start without finally get a host (proxmox V7) hanging then crashing. All was working with PvE 6.

entilza · Jul 17, 2021

I've been following this thread.. won't be upgrading to v7. Is there some active proxmox testing going on this?

t.lamprecht · Jul 17, 2021

In the syslog we have segfaults in core libraries, general protection faults (access of memory addresses outside allowed virtual memory regions) and failure to handle a page fault.

So IMO it's either one of:
* bad HW (e.g. memory) - sometimes that can also be triggered (more often) by a different kernel version
* something causing havoc in kernel space, highly probably specific to some of the HW here in this thread

As we see quite some AMD 3xxx involved here it'd be good if you could ensure that the latest BIOS updates are installed, or to install the amd64-microcode package from the non-free Debian Repository component.

t.lamprecht · Jul 17, 2021

entilza said:
I've been following this thread.. won't be upgrading to v7. Is there some active proxmox testing going on this?

Sadly we cannot yet reproduce this behavior here in our testlabs or production setups, and we try hard to always buy different HW for workstations, testlabs and the like, to cover more area.

entilza · Jul 17, 2021

t.lamprecht said:
Sadly we cannot yet reproduce this behavior here in our testlabs or production setups, and we try hard to always buy different HW for workstations, testlabs and the like, to cover more area.

Ok I am glad you are trying though, it helps perhaps agree that it is more hardware related. I have a production 6.4 that is flawless and concerned about upgrading to 7.0. I don't see a rush at moment.

Kernel Panic, whole server crashes about every day

Well-Known Member

New Member

Member

New Member

Member

Well-Known Member

New Member

Member

Proxmox Retired Staff

Member

Well-Known Member

Member

Member

Member

Member

Attachments

Active Member

Renowned Member

Proxmox Staff Member

Proxmox Staff Member

Renowned Member

We value your privacy