Kernel Panic, whole server crashes about every day

galeido

New Member
Aug 2, 2021
9
0
1
35
It seems that our machines that are in version 5.11.22-6 will crash after 6-12 hours even without running virtual machines. -5 version have worked normally so far.
 

tikismoke

Member
Nov 3, 2018
11
1
8
41
It seems that our machines that are in version 5.11.22-6 will crash after 6-12 hours even without running virtual machines. -5 version have worked normally so far.
Same for me upgrading kernel make it crash randomly but it was stable before (with native option). If this kernel patch was done for this seems to work on my r710 running some container and a vm for truenas.
 

Fabian_E

Proxmox Staff Member
Staff member
Aug 1, 2019
1,337
206
68
It seems that our machines that are in version 5.11.22-6 will crash after 6-12 hours even without running virtual machines. -5 version have worked normally so far.
Was there any other kind of load on the machines that crashed? Did you manage to get a crash trace via netconsole this time?

Hi,
Same for me upgrading kernel make it crash randomly but it was stable before (with native option). If this kernel patch was done for this seems to work on my r710 running some container and a vm for truenas.
which kernel were you using before? Are you still using the aio=native option? Please also try to obtain a crash trace, see here.
 

galeido

New Member
Aug 2, 2021
9
0
1
35
Was there any other kind of load on the machines that crashed? Did you manage to get a crash trace via netconsole this time?

Unfortunately, I don't get any crash trace over netconsole, we have tried to get relevant information. Anyway, -5 seems to be stable, -6 crashes every 1 to 3 hours without any traffic just idling.

Longest stable period was 6 hours.
 
Last edited:

Fabian_E

Proxmox Staff Member
Staff member
Aug 1, 2019
1,337
206
68
Unfortunately, I don't get any crash trace over netconsole, we have tried to get relevant information. Anyway, -5 seems to be stable, -6 crashes every 1 to 3 hours without any traffic just idling.

Longest stable period was 6 hours.
Just to make sure: did you verify that netconsole was set up properly, i.e. do you get the other syslog messages?

Could you do the following on the idling machine experiencing the crashes:
Code:
cd /sys/kernel/tracing
cat available_filter_functions | grep io_uring > set_graph_function
echo function_graph > current_tracer
The second command might take a bit of time to complete.

Then wait (at least a few minutes) and do
Code:
tail -n 2000 trace > /tmp/io_uring_trace.txt
Please share the resulting file /tmp/io_uring_trace.txt.
 

SilentFez

New Member
Apr 17, 2020
9
1
3
Intel I7 7700k, 5 vms, had lots of trouble initially after updating to 7.x. Could get it to crash almost consistently while running heavy IO load on FIO.

Updated to 5.11.22-6 yesterday, been running without any trouble so far. Can't get it to crash anymore with FIO either. So far so good on Intel.

Code:
root@prox:~# uptime
 16:57:40 up 1 day,  1:14,  1 user,  load average: 0.27, 0.44, 0.66
root@prox:~# uname -a
Linux prox 5.11.22-3-pve #1 SMP PVE 5.11.22-6 (Wed, 28 Jul 2021 10:51:12 +0200) x86_64 GNU/Linux
root@prox:~#
 
Last edited:

boopzz

Active Member
Dec 6, 2016
43
3
28
UK
I've also been hitting this issue the past few days. I've been plotting using a VM onto an NVMe drive, whenever I've tried to copy off the NVMe onto spinning rust, whether thats an NFS target or reattaching the drive to a single VM or LXC, I've been getting kernel panics between 500Mb and 12Gb of traffic transferred. Have installed the non-free amd64-microcode and also added the aio=native option to all drives and straight away haven't had an issue. I had the issue on 6.4 and upgraded to 7 yesterday and still had it.

Ryzen 3900X
MSI B450 Tomahawk
32Gb DDR4 RAM
2x Samsung 870Evo 1TB (ZFS root mirror 1)
1x Sabrent 2TB NVMe
mix of 16Tb and 8Tb disks, either WD Reds, Ironwolf or Exeos
 

flames

Member
Feb 8, 2018
63
9
13
41
Have you disabled C6 state in your bios/uefi?
In my case this helped with Ryzen 5800X on a Gigabyte X570 board and Kingston ECC unbuffered RAM. For me it is the setting: Power idle control: Typical current idle.
Could be different on your specific mainboard.
 
Last edited:
Jan 6, 2021
82
12
8
46
This thread contiues to have a variety of results based on the newer kernel, meaning that there is / was an issue with the current kernel. However I am still not sold on proxmox 7.0... 6.4 is rock solid so I am going to continue to wait. I don't feel the entire professional/enterprise community has rushed to upgrade 7.0 when 6.0 was basically just released at the start of the year... Plus odd numbers have a history of bad luck in computers :)
 

flames

Member
Feb 8, 2018
63
9
13
41
My nodes with Ryzen 5800x, 5900x and 5950x crashed all the time (at least once per day, with or without load, even w/o a single VM/CT on them) with PVE 6.4 both kernels 5.4 and 5.11 and PVE 7.0. after disabled C6 state, all of them are now running rock solid since weeks.
The troubleshooting was hard. Since its "desktop" hardware, i first thought the non ECC RAM was the issues, bought ECC unbuffered, still crashes. Then researches lead to something with kernel + AMD hardware, updated from 5.4 to 5.11, still crashes, PVE 7.0 came just out, updates, still crashes. then i was sitting there and tested tons of bios settings, but it took a lot of time, because the crash was sporadical. Then found on a Linux forum a post from last year, where C6 state was mentioned with Buster + Ryzen 3600 ... tried out, and bam!
Try it out, i would be very interested in your results.
 

flames

Member
Feb 8, 2018
63
9
13
41
look kern.log if you have something like this:

Code:
mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 15: Machine Check: 0 Bank 1: bc800800060c0859
mce: [Hardware Error]: TSC 0 ADDR 41576d480 MISC d012000000000000 IPID 100b000000000
mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1628433454 SOCKET 0 APIC f microcode a201009

this could be also intereseting:
https://bugzilla.kernel.org/show_bug.cgi?id=212087
https://github.com/r4m0n/ZenStates-Linux
 
Last edited:

galeido

New Member
Aug 2, 2021
9
0
1
35
Has the problem persisted -6 after the upgrade? We rolled the lab back to version 6.4 and are waiting for the problems to clear up. Unfortunately, that original host was no longer on the lines for analysis. Crashed/freeze straight into the boot.
 

tikismoke

Member
Nov 3, 2018
11
1
8
41
Has the problem persisted -6 after the upgrade? We rolled the lab back to version 6.4 and are waiting for the problems to clear up. Unfortunately, that original host was no longer on the lines for analysis. Crashed/freeze straight into the boot.
For me on my dell r710 Intel CPU it's still present. Crash between 26h and max 43h.
 

TM876

New Member
Jul 8, 2021
3
1
3
My nodes with Ryzen 5800x, 5900x and 5950x crashed all the time (at least once per day, with or without load, even w/o a single VM/CT on them) with PVE 6.4 both kernels 5.4 and 5.11 and PVE 7.0. after disabled C6 state, all of them are now running rock solid since weeks.
The troubleshooting was hard. Since its "desktop" hardware, i first thought the non ECC RAM was the issues, bought ECC unbuffered, still crashes. Then researches lead to something with kernel + AMD hardware, updated from 5.4 to 5.11, still crashes, PVE 7.0 came just out, updates, still crashes. then i was sitting there and tested tons of bios settings, but it took a lot of time, because the crash was sporadical. Then found on a Linux forum a post from last year, where C6 state was mentioned with Buster + Ryzen 3600 ... tried out, and bam!
Try it out, i would be very interested in your results.
You legend, I don't want to get ahead of myself but so far uptime is a day with c states disabled in BIOS. Now the only issue I'm having with kernel 5.11 is lvm not attaching/activating properly at boot when it was fine with 5.4. My thread on it here if anyone can help.
 
  • Like
Reactions: flames

flames

Member
Feb 8, 2018
63
9
13
41
hey, not my achievement, just took my time to search the webz.
if you can afford to have another crashes for sake of testing, please, set your bios first to defaults and then _only_ set following options (or equivalents for your bios)...:
SVM = enable (virtualization aka vt-d in intel world)
IOMMU = enable (default = auto on most x570)
Power idle control = Typical current idle (cstate 6 disabled on some x570)

do not change anything else (no need to disable cstates completely, or "amd cool&quiet" or something. also no need to set B2 stepping... just let everything else default.
is it stable? would appreciate your info. thanks.
 
Last edited:

galeido

New Member
Aug 2, 2021
9
0
1
35
Just to make sure: did you verify that netconsole was set up properly, i.e. do you get the other syslog messages?

Could you do the following on the idling machine experiencing the crashes:
Code:
cd /sys/kernel/tracing
cat available_filter_functions | grep io_uring > set_graph_function
echo function_graph > current_tracer
The second command might take a bit of time to complete.

Then wait (at least a few minutes) and do
Code:
tail -n 2000 trace > /tmp/io_uring_trace.txt
Please share the resulting file /tmp/io_uring_trace.txt.
Hi @Fabian_E ,

I tried to collect data yet again from crashing node. Node keeps crashing after few minutes and can't get any relevant output from log perspective.
 
Last edited:

phreeky82

New Member
Aug 24, 2021
1
0
1
39
I'm a Promox newb, got everything up and running a couple of weeks ago. Sat there and played around a little but didn't bother keeping any VMs running for more than an hour or two, then decided to kick off some proper builds. The next day the server is unresponsive, even the console I couldn't get any response from. Hard reboot and it did it again the next day, and the next, and then I found this thread.

I set all of my disks with aio=native and now been running about 6 VMs 24/7 for a week, rock solid, until today when I kick off backups for 3 of the VMs in parallel - suddenly it locks up hard again.

I first suspected hardware, even though it had been running bare metal fine, but then the aio=native thing and this thread were telling me something else.

It's an older Dell server (R510) with a single Xeon processor, so ECC memory and all SAS disks with no error codes. I was about to move it all across to a spare almost identical server (just more RAM), but after today I'm now considering whether I should go for an older version of Proxmox...

edit: omg *facepalm* I think I hit the node shutdown button instead of for a VM! So I take the above back, EXCEPT the bit where aio=native did wonders.
 
Last edited:

chrcoluk

Member
Oct 7, 2018
115
16
23
42
for those on non ecc ram, if you want to be sure ram isnt the problem you can 'apt install stressapptest'

then e.g. with 32 gig of ram test 28gig of it for an hour 'stressapptest -s 3600 -M 28000 -W'
 

t.lamprecht

Proxmox Staff Member
Staff member
Jul 28, 2015
4,711
1,259
164
South Tyrol/Italy
shop.proxmox.com
Can you please open a new thread for that, the current one is rather for io_uring and an bug we actually could manage to reproduce after a lot of testing and also fixed together with upstream kernel devs.

pvesr itself does not uses io_uring at all currently, so this seems rather unrelated to this thread and possibly an issue with ZFS on your system.

Would be also great if you could add some details about the system (CPU/motherboard) and the ZFS setup in the new thread, thanks!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!