Kernel Panic, whole server crashes about every day

galeido · Aug 3, 2021

It seems that our machines that are in version 5.11.22-6 will crash after 6-12 hours even without running virtual machines. -5 version have worked normally so far.

tikismoke · Aug 3, 2021

galeido said:
It seems that our machines that are in version 5.11.22-6 will crash after 6-12 hours even without running virtual machines. -5 version have worked normally so far.

Same for me upgrading kernel make it crash randomly but it was stable before (with native option). If this kernel patch was done for this seems to work on my r710 running some container and a vm for truenas.

fiona · Aug 4, 2021

galeido said:
It seems that our machines that are in version 5.11.22-6 will crash after 6-12 hours even without running virtual machines. -5 version have worked normally so far.

Was there any other kind of load on the machines that crashed? Did you manage to get a crash trace via netconsole this time?

Hi,

tikismoke said:
Same for me upgrading kernel make it crash randomly but it was stable before (with native option). If this kernel patch was done for this seems to work on my r710 running some container and a vm for truenas.

which kernel were you using before? Are you still using the aio=native option? Please also try to obtain a crash trace, see here.

tikismoke · Aug 4, 2021

Fabian_E said:
which kernel were you using before? Are you still using the aio=native option? Please also try to obtain a crash trace, see here.

5-11-22-5 seems stable for me.
Yes I didn't change the vm conf and it as again crash this morning in 5-11-22-6.

galeido · Aug 5, 2021

Fabian_E said:
Was there any other kind of load on the machines that crashed? Did you manage to get a crash trace via netconsole this time?

Unfortunately, I don't get any crash trace over netconsole, we have tried to get relevant information. Anyway, -5 seems to be stable, -6 crashes every 1 to 3 hours without any traffic just idling.

Longest stable period was 6 hours.

fiona · Aug 5, 2021

galeido said:
Unfortunately, I don't get any crash trace over netconsole, we have tried to get relevant information. Anyway, -5 seems to be stable, -6 crashes every 1 to 3 hours without any traffic just idling.

Longest stable period was 6 hours.

Just to make sure: did you verify that netconsole was set up properly, i.e. do you get the other syslog messages?

Could you do the following on the idling machine experiencing the crashes:

Code:

cd /sys/kernel/tracing
cat available_filter_functions | grep io_uring > set_graph_function
echo function_graph > current_tracer

The second command might take a bit of time to complete.

Then wait (at least a few minutes) and do

Code:

tail -n 2000 trace > /tmp/io_uring_trace.txt

Please share the resulting file /tmp/io_uring_trace.txt.

Lutris · Aug 7, 2021

Intel I7 7700k, 5 vms, had lots of trouble initially after updating to 7.x. Could get it to crash almost consistently while running heavy IO load on FIO.

Updated to 5.11.22-6 yesterday, been running without any trouble so far. Can't get it to crash anymore with FIO either. So far so good on Intel.

Code:

root@prox:~# uptime
 16:57:40 up 1 day,  1:14,  1 user,  load average: 0.27, 0.44, 0.66
root@prox:~# uname -a
Linux prox 5.11.22-3-pve #1 SMP PVE 5.11.22-6 (Wed, 28 Jul 2021 10:51:12 +0200) x86_64 GNU/Linux
root@prox:~#

boopzz · Aug 8, 2021

I've also been hitting this issue the past few days. I've been plotting using a VM onto an NVMe drive, whenever I've tried to copy off the NVMe onto spinning rust, whether thats an NFS target or reattaching the drive to a single VM or LXC, I've been getting kernel panics between 500Mb and 12Gb of traffic transferred. Have installed the non-free amd64-microcode and also added the aio=native option to all drives and straight away haven't had an issue. I had the issue on 6.4 and upgraded to 7 yesterday and still had it.

Ryzen 3900X
MSI B450 Tomahawk
32Gb DDR4 RAM
2x Samsung 870Evo 1TB (ZFS root mirror 1)
1x Sabrent 2TB NVMe
mix of 16Tb and 8Tb disks, either WD Reds, Ironwolf or Exeos

flames · Aug 9, 2021

Have you disabled C6 state in your bios/uefi?
In my case this helped with Ryzen 5800X on a Gigabyte X570 board and Kingston ECC unbuffered RAM. For me it is the setting: Power idle control: Typical current idle.
Could be different on your specific mainboard.

entilza · Aug 9, 2021

This thread contiues to have a variety of results based on the newer kernel, meaning that there is / was an issue with the current kernel. However I am still not sold on proxmox 7.0... 6.4 is rock solid so I am going to continue to wait. I don't feel the entire professional/enterprise community has rushed to upgrade 7.0 when 6.0 was basically just released at the start of the year... Plus odd numbers have a history of bad luck in computers

flames · Aug 9, 2021

My nodes with Ryzen 5800x, 5900x and 5950x crashed all the time (at least once per day, with or without load, even w/o a single VM/CT on them) with PVE 6.4 both kernels 5.4 and 5.11 and PVE 7.0. after disabled C6 state, all of them are now running rock solid since weeks.
The troubleshooting was hard. Since its "desktop" hardware, i first thought the non ECC RAM was the issues, bought ECC unbuffered, still crashes. Then researches lead to something with kernel + AMD hardware, updated from 5.4 to 5.11, still crashes, PVE 7.0 came just out, updates, still crashes. then i was sitting there and tested tons of bios settings, but it took a lot of time, because the crash was sporadical. Then found on a Linux forum a post from last year, where C6 state was mentioned with Buster + Ryzen 3600 ... tried out, and bam!
Try it out, i would be very interested in your results.

flames · Aug 9, 2021

look kern.log if you have something like this:

Code:

mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 15: Machine Check: 0 Bank 1: bc800800060c0859
mce: [Hardware Error]: TSC 0 ADDR 41576d480 MISC d012000000000000 IPID 100b000000000
mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1628433454 SOCKET 0 APIC f microcode a201009

this could be also intereseting:
https://bugzilla.kernel.org/show_bug.cgi?id=212087
https://github.com/r4m0n/ZenStates-Linux

galeido · Aug 14, 2021

Has the problem persisted -6 after the upgrade? We rolled the lab back to version 6.4 and are waiting for the problems to clear up. Unfortunately, that original host was no longer on the lines for analysis. Crashed/freeze straight into the boot.

tikismoke · Aug 14, 2021

galeido said:
Has the problem persisted -6 after the upgrade? We rolled the lab back to version 6.4 and are waiting for the problems to clear up. Unfortunately, that original host was no longer on the lines for analysis. Crashed/freeze straight into the boot.

For me on my dell r710 Intel CPU it's still present. Crash between 26h and max 43h.

TM876 · Aug 21, 2021

flames said:
My nodes with Ryzen 5800x, 5900x and 5950x crashed all the time (at least once per day, with or without load, even w/o a single VM/CT on them) with PVE 6.4 both kernels 5.4 and 5.11 and PVE 7.0. after disabled C6 state, all of them are now running rock solid since weeks.
The troubleshooting was hard. Since its "desktop" hardware, i first thought the non ECC RAM was the issues, bought ECC unbuffered, still crashes. Then researches lead to something with kernel + AMD hardware, updated from 5.4 to 5.11, still crashes, PVE 7.0 came just out, updates, still crashes. then i was sitting there and tested tons of bios settings, but it took a lot of time, because the crash was sporadical. Then found on a Linux forum a post from last year, where C6 state was mentioned with Buster + Ryzen 3600 ... tried out, and bam!
Try it out, i would be very interested in your results.

You legend, I don't want to get ahead of myself but so far uptime is a day with c states disabled in BIOS. Now the only issue I'm having with kernel 5.11 is lvm not attaching/activating properly at boot when it was fine with 5.4. My thread on it here if anyone can help.

flames · Aug 21, 2021

hey, not my achievement, just took my time to search the webz.
if you can afford to have another crashes for sake of testing, please, set your bios first to defaults and then _only_ set following options (or equivalents for your bios)...:
SVM = enable (virtualization aka vt-d in intel world)
IOMMU = enable (default = auto on most x570)
Power idle control = Typical current idle (cstate 6 disabled on some x570)

do not change anything else (no need to disable cstates completely, or "amd cool&quiet" or something. also no need to set B2 stepping... just let everything else default.
is it stable? would appreciate your info. thanks.

galeido · Aug 21, 2021

Fabian_E said:
Just to make sure: did you verify that netconsole was set up properly, i.e. do you get the other syslog messages?

Could you do the following on the idling machine experiencing the crashes:

Code:

cd /sys/kernel/tracing cat available_filter_functions | grep io_uring > set_graph_function echo function_graph > current_tracer

The second command might take a bit of time to complete.

Then wait (at least a few minutes) and do

Code:

tail -n 2000 trace > /tmp/io_uring_trace.txt

Please share the resulting file /tmp/io_uring_trace.txt.

Hi @Fabian_E ,

I tried to collect data yet again from crashing node. Node keeps crashing after few minutes and can't get any relevant output from log perspective.

phreeky82 · Aug 24, 2021

I'm a Promox newb, got everything up and running a couple of weeks ago. Sat there and played around a little but didn't bother keeping any VMs running for more than an hour or two, then decided to kick off some proper builds. The next day the server is unresponsive, even the console I couldn't get any response from. Hard reboot and it did it again the next day, and the next, and then I found this thread.

I set all of my disks with aio=native and now been running about 6 VMs 24/7 for a week, rock solid, until today when I kick off backups for 3 of the VMs in parallel - suddenly it locks up hard again.

I first suspected hardware, even though it had been running bare metal fine, but then the aio=native thing and this thread were telling me something else.

It's an older Dell server (R510) with a single Xeon processor, so ECC memory and all SAS disks with no error codes. I was about to move it all across to a spare almost identical server (just more RAM), but after today I'm now considering whether I should go for an older version of Proxmox...

edit: omg *facepalm* I think I hit the node shutdown button instead of for a VM! So I take the above back, EXCEPT the bit where aio=native did wonders.

chrcoluk · Aug 24, 2021

for those on non ecc ram, if you want to be sure ram isnt the problem you can 'apt install stressapptest'

then e.g. with 32 gig of ram test 28gig of it for an hour 'stressapptest -s 3600 -M 28000 -W'

t.lamprecht · Aug 25, 2021

Can you please open a new thread for that, the current one is rather for io_uring and an bug we actually could manage to reproduce after a lot of testing and also fixed together with upstream kernel devs.

pvesr itself does not uses io_uring at all currently, so this seems rather unrelated to this thread and possibly an issue with ZFS on your system.

Would be also great if you could add some details about the system (CPU/motherboard) and the ZFS setup in the new thread, thanks!

Kernel Panic, whole server crashes about every day

New Member

Active Member

Proxmox Staff Member

Active Member

New Member

Proxmox Staff Member

Member

Renowned Member

Renowned Member

Well-Known Member

Renowned Member

Renowned Member

New Member

Active Member

New Member

Renowned Member

New Member

Member

Renowned Member

Proxmox Staff Member

We value your privacy