Kernel Panic, whole server crashes about every day

aaronk6387 · Jul 2, 2021

Proxmox installed and was working properly. I started to add VMs and about once a day it will have a kernel panic and i will have to hard reboot it. It seems to be an issue when the VMs start to do some actual processing of any kind. It has a Ryzen 5 3600 cpu. I am not running server grade hardware.
I am hoping there is a setting in the bios or something i can change in the OS. I have another server running proxmox and its stable as a rock but its 8 years old.
Let me know if you any additional information can or needs to be gathered.

t.lamprecht · Jul 7, 2021

A few questions:
What mainboard is in use?
Any additional HW plugged in, or software installed?
Do you have hardware pass-through configured for some VMs?

What Kernel do you use? 5.4 or the for PVE 6.4 opt-in 5.11 based kernel ( https://forum.proxmox.com/threads/kernel-5-11.86225/ )

Did you tried to rule out if a specific VM causes this? So that it could be narrowed down.

It could also help to see the start of that kernel panic, as it's sadly cut off. In the 5.4 kernel the kernel scroll thing still works, so there you should be hopefully able to scroll up with SHIFT+PAGEUP

PeterRoux · Jul 8, 2021

I had same issue with Ryzen 5 5600X last night. Kernel Panic. Board is Asrock B550 Tachi, 64g mem. Also have separate 4 NIC Intel 350 nic PCI, but otherwise standard setup. Was working great on 6.4.

PeterRoux · Jul 8, 2021

Just got a 2nd kernel panic on Proxmox 7. After running for around 4 hours.

SHIFT+PG_UP not working. Completely frozen.

Was working fine in 6.4.

t.lamprecht · Jul 8, 2021

Looks a bit different from the original poster trace, but hard to say if it's two causes.

PeterRoux said:
Just got a 2nd kernel panic on Proxmox 7. After running for around 4 hours.

Which guest's running, can you post VM configs so that we can try to do an exact reproducer, maybe it's some CPU flag that's set.

PeterRoux said:
SHIFT+PG_UP not working. Completely frozen.

In the 5.11 kernel that feature is sadly not available any more, got patched out in the kernel as there was no maintainer and it was in a bit in a bad state, IIRC.

PeterRoux said:
Was working fine in 6.4.

Were you using the opt-in 5.11 kernel then under 6.4 or the default 5.4 Kernel one?

vactis · Jul 8, 2021

I had the same problem with 7.0 beta, and I had to go back to 6.4, see this thread:

https://forum.proxmox.com/threads/downgrade-7-0-to-6-4.91852/#post-400323

t.lamprecht · Jul 8, 2021

vactis said:
I had the same problem with 7.0 beta, and I had to go back to 6.4, see this thread:

https://forum.proxmox.com/threads/downgrade-7-0-to-6-4.91852/#post-400323

You never shared any log to be able to debug this, and here we already have two potential different issues in one thread, one crash is not necessarily caused by the same issue as another.

The HW info on the other thread also shows that you use DDR3 memory, so your CPU is probably not a Ryzen 5xxx series either I'd figure? Also, using a 2 x 8 GiB + 2 x 4 GiB memory module configuration seems a bit odd and can be a cause for trouble on its own.

vactis · Jul 8, 2021

No I didn't share any logs because I've been too busy and the quickest solution was to go back to 6.4. Anyway, it's a Fujitsu Primergy TX100 S3 with a Xeon CPU and it has been working fine with everything except PVE 7 and I don't understand why the memory modules would cause any problems.

vactis · Jul 8, 2021

PeterRoux · Jul 8, 2021

t.lamprecht said:
Looks a bit different from the original poster trace, but hard to say if it's two causes.

Which guest's running, can you post VM configs so that we can try to do an exact reproducer, maybe it's some CPU flag that's set.

In the 5.11 kernel that feature is sadly not available any more, got patched out in the kernel as there was no maintainer and it was in a bit in a bad state, IIRC.

Were you using the opt-in 5.11 kernel then under 6.4 or the default 5.4 Kernel one?

Hi Thomas, I am unsure on easiest way to provide this info so hope this okay.

I had built this box using the proxmox-ve_6.4-1.iso and done standard updates (apt update/upgrade). Nothing more. For upgrade just changed apt sources file and did apt-update/dist-upgrade. I did do pve6to7 --full without issue.

The machine this is running on is brand new build

- Asrock Taichi B550 motherboard
- CPU = AMD Ryzen 5600X.
- has built-in 2.5 g NIC - I use vmbr0 only for host
- Intel i350 4-NIC slotted into PCI - I use vmbr1 against this for all guests
- 64 g ram1

I only have the following guests running:

100 = VM running Docker with some containers.
102 = VM running Ubuntu Desktop - running Chia GUI.
105 = CT running Ubuntu Server - running Chia CLI
201 = CT running pi-hole. IP 192.168.1.50 (using as DNS server for the other guests above).

The only change I have done recently since this 2nd crash is to refresh the MAC address of all guests, by deleting the "old" MAC from the GUI config for each guest and having the system auto generate a new one. I did this as the 102 and 105 both lost access to ping out.

I am very knowledgeable on Linux - have used it since 1993, and have checked logs but nothing obvious.

Code:

root@pve:/etc/pve# qm config 100
balloon: 2048
boot: order=scsi0;ide2;net0
cores: 6
ide2: none,media=cdrom
memory: 8192
name: DockerMain
net0: virtio=02:B6:78:17:24:C9,bridge=vmbr1,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: VM1:vm-100-disk-0,size=100G
scsihw: virtio-scsi-pci
smbios1: uuid=9c78b4e8-b1a1-48cc-8f2a-787959971e5c
sockets: 1
vmgenid: e542ac38-8338-4785-9614-baa2c04ffe75


root@pve:/etc/pve# qm config 102
bios: ovmf
boot: order=scsi0;ide2
cores: 6
efidisk0: VM2:vm-102-disk-1,size=4M
ide2: none,media=cdrom
memory: 16384
name: Chia
net0: virtio=06:91:7C:0D:1C:35,bridge=vmbr1,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: VM2:vm-102-disk-0,size=14500G
scsihw: virtio-scsi-pci
smbios1: uuid=bcf3793f-075f-4365-8669-482d91a0b77d
sockets: 1
vmgenid: 58625eb6-7cc1-4c8a-8ef6-9f71ca5db5a6


root@pve:/etc/pve# cat nodes/pve/lxc/105.conf
arch: amd64
cores: 6
hostname: Chia-CLI
memory: 16384
net0: name=eth0,bridge=vmbr1,firewall=1,hwaddr=3A:20:D9:7C:87:5C,ip=dhcp,type=veth
ostype: ubuntu
rootfs: VM1:vm-105-disk-0,size=8T
swap: 512
unprivileged: 1


root@pve:~# cat /etc/pve/nodes/pve/lxc/201.conf
arch: amd64
cores: 2
hostname: PiHole
memory: 1024
net0: name=eth0,bridge=vmbr1,firewall=1,gw=192.168.1.1,hwaddr=16:76:F8:82:E1:49,ip=192.168.1.50/24,type=veth
onboot: 1
ostype: ubuntu
rootfs: VM1:vm-201-disk-0,size=8G
swap: 512
unprivileged: 1

t.lamprecht · Jul 8, 2021

From the bits I read in the parts of the traces shown I'd suspect it's something the VMs cause. One thing that got my attention is that you VMs are using the default CPU type "kvm64" (as no CPU property is set in the config) - can you try to set that one to host (in VM Hardware panel, edit CPU) more a hunch than anything else.

You should also still have the 5.4 kernel installed and so you could choose that on boot to see if it's a regression in the 5.11 kernel in combination with your HW/usage/configuration, but please try the host CPU change for the VMs first.

PeterRoux · Jul 9, 2021

t.lamprecht said:
From the bits I read in the parts of the traces shown I'd suspect it's something the VMs cause. One thing that got my attention is that you VMs are using the default CPU type "kvm64" (as no CPU property is set in the config) - can you try to set that one to host (in VM Hardware panel, edit CPU) more a hunch than anything else.

You should also still have the 5.4 kernel installed and so you could choose that on boot to see if it's a regression in the 5.11 kernel in combination with your HW/usage/configuration, but please try the host CPU change for the VMs first.

Just an update: I changed the CPU type to "host" for the VMs. VMs seemed to run fine. Only container running was my pihole (container 201). I did not start the 105 Container (which is large 8 Tb). No crashes overnight.

At 11:30 today I then started the 105 container - this is running Chia using only CLI. Another crash seen after an hour. So I suspect something on this specific container. I will create a new container with similar setup to see if this works.

I have also updated sysctl to enable core dump writes so if another kernel panic I am hoping I will have more info to debug.

GodZone · Jul 11, 2021

I am in the same boat. Upgraded from 6.4 to 7.0 on a development server (but still important to me) that has an AMD processor over the weekend. It is crashing every night with a Kernel Panic. "Fatal exception on interrupt"

I have boot the last 6.4 kernel, 5.4.124-1 but now none of the VMs will boot. Everything worked just fine on 6.4, this is the only AMD server I have its processor is AMD FX(tm)-8320 Eight-Core Processor.

The guests on this server are a single Windows 2019 server and 2 Centos7 running 5.x kernels and a RockLinux 8

GodZone · Jul 12, 2021

Rebooted on the 5.11 kernel, all the VMs now boot, so until the panic is resolved, I am trying to move the more important VMs to another host.

PeterRoux · Jul 12, 2021

Update: I deleted the VMs that had been created on the original 6.4, and recreated them again in 7.0, but still same problem - a crash - more than once per day. I tried various combinations, such as only running a select few VMs, but the moment a VM had any extra requirement for memory it resulted in bringing the entire system down.

On boot up, the Grub menu still shows the old proxmox versions, but when I selected any of these, the machine boots fine, but networking is not working at all - I could not ping anything even though no change made to the network settings. These settings were all showing as setup fine. So it seems the 7 upgrade has messed up the older networking. ip a would have an address, and systemd-resolve --status shows dns pointing correctly, but just no access. It seems 7 has broken some older links, making these grub choices useless as is.

I've seen others reporting the same issue also with them running AMD cpus - which may be a common factor here with latest Proxmox 7.0. Strange that this was working perfectly on 6.4.

PeterRoux · Jul 12, 2021

PeterRoux said:
Update: I deleted the VMs that had been created on the original 6.4, and recreated them again in 7.0, but still same problem - a crash - more than once per day. I tried various combinations, such as only running a select few VMs, but the moment a VM had any extra requirement for memory it resulted in bringing the entire system down.

On boot up, the Grub menu still shows the old proxmox versions, but when I selected any of these, the machine boots fine, but networking is not working at all - I could not ping anything even though no change made to the network settings. These settings were all showing as setup fine. So it seems the 7 upgrade has messed up the older networking. ip a would have an address, and systemd-resolve --status shows dns pointing correctly, but just no access. It seems 7 has broken some older links, making these grub choices useless as is.

I've seen others reporting the same issue also with them running AMD cpus - which may be a common factor here with latest Proxmox 7.0. Strange that this was working perfectly on 6.4.

What I have noticed is that I allocated say 16gb mem to a container in Proxmox settings. Within the container I run ubuntu server 20.04. If I only use 4gb internally in the container then it runs fine. However the moment I push an internal job to use 8gb (still only 50%) of the allocation showing in Proxmox then it dies quickly. So perhaps a memory issue. I have 3600 Corsair Vengeance RGB PRO ddr4 4x16gb C18.

GodZone · Jul 12, 2021

AMD host dead again this morning, only one guest running overnight so for me, I think that might rule out a guest doing something weird (maybe). Took a photo of the console but it is pretty much the same as the ones further back in this thread. My host is NOT production grade Hardware, only 16GB of RAM, 2x 3TB Western Digital HDDs, 1x1TB Seagate HDD and a 512GB HP SSD. I have a similar host but with an Intel CPU and I upgraded it to 7 before the AMD. It is running fine. It is now running a couple of the VMs from the AMD host that I copied across yesterday.

This issue seems to be related to the AMD CPU but I am no hardware expert so not sure where to go next to help resolve.

notfixingit · Jul 12, 2021

I had the same issue last week, however like Vactis, mine was with kvm_intel not kvm_amd and specifically with kernel 5.11.x. (proxmox 7) I spent the better part of the other night researching it and unfortunally I did not bookmark where I found the info, but someone in a Ubuntu kernel thread suggested there was some changes in 5.11 that would cause issues on an older bios/cpu system (mentioned instruction sets and microcode) that are earlier then winter of 2019 which mine was. I updated the bios to a late 2020 release for my system (which is the current one) and so far it's been up almost 48 hours now without issue but only more time will tell.

GodZone · Jul 12, 2021

I am going to go back through the PVE logs and see if there is anything that jumps out in terms of an activity that might coincide with the panics, this morning's was at 04:58. Whilst I am not a kernel guru, I have been using linux for many years and now that I have effectively emptied this host, I am happy to help by using it to try to collect more information to assist with a resolution. The only problem with that is it is only panic'ing once per day for me.

I am not using ZFS, I have LVM, Ext4 and NFS storage (2x NFS3 and 1 NFS4) all the other mounts appear to be the standard linux system ones.

PeterRoux · Jul 13, 2021

GodZone said:
I am going to go back through the PVE logs and see if there is anything that jumps out in terms of an activity that might coincide with the panics, this morning's was at 04:58. Whilst I am not a kernel guru, I have been using linux for many years and now that I have effectively emptied this host, I am happy to help by using it to try to collect more information to assist with a resolution. The only problem with that is it is only panic'ing once per day for me.

I am not using ZFS, I have LVM, Ext4 and NFS storage (2x NFS3 and 1 NFS4) all the other mounts appear to be the standard linux system ones.

I too am not using ZFS, but LVM. I have one large VM, allocated 14500gb (14tb) to do some chia mining. VM has 16g mem. If I plot using default 4gb ram then works fine for a long time, but if I run it with 8gb mem, then it crashes very quickly, often within minutes. So seems related to memory...

I have rebuilt this VM (using Ubuntu server 20.04) and same issue. Having searched the web there are mentions of an oom error by a few people. But I'm not seeing anything logged. It was working fine before moving to Proxmox 7, so definately a regression somewhere.

The other common factor it seems is this is affecting many with AMD. There is a new beta amd bios AMD AM4 AGESA Combo V2 PI 1.2.0.3 Patch B update I might try if I cannot find another workaround for this issue.

Kernel Panic, whole server crashes about every day

Member

Proxmox Staff Member

New Member

New Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member

New Member

Proxmox Staff Member

New Member

Well-Known Member

Well-Known Member

New Member

New Member

​

Well-Known Member

Member

Well-Known Member

New Member