Same for me upgrading kernel make it crash randomly but it was stable before (with native option). If this kernel patch was done for this seems to work on my r710 running some container and a vm for truenas.It seems that our machines that are in version 5.11.22-6 will crash after 6-12 hours even without running virtual machines. -5 version have worked normally so far.
Was there any other kind of load on the machines that crashed? Did you manage to get a crash trace via netconsole this time?It seems that our machines that are in version 5.11.22-6 will crash after 6-12 hours even without running virtual machines. -5 version have worked normally so far.
which kernel were you using before? Are you still using theSame for me upgrading kernel make it crash randomly but it was stable before (with native option). If this kernel patch was done for this seems to work on my r710 running some container and a vm for truenas.
aio=native
option? Please also try to obtain a crash trace, see here.5-11-22-5 seems stable for me.which kernel were you using before? Are you still using theaio=native
option? Please also try to obtain a crash trace, see here.
Was there any other kind of load on the machines that crashed? Did you manage to get a crash trace via netconsole this time?
Just to make sure: did you verify that netconsole was set up properly, i.e. do you get the other syslog messages?Unfortunately, I don't get any crash trace over netconsole, we have tried to get relevant information. Anyway, -5 seems to be stable, -6 crashes every 1 to 3 hours without any traffic just idling.
Longest stable period was 6 hours.
cd /sys/kernel/tracing
cat available_filter_functions | grep io_uring > set_graph_function
echo function_graph > current_tracer
tail -n 2000 trace > /tmp/io_uring_trace.txt
/tmp/io_uring_trace.txt
.root@prox:~# uptime
16:57:40 up 1 day, 1:14, 1 user, load average: 0.27, 0.44, 0.66
root@prox:~# uname -a
Linux prox 5.11.22-3-pve #1 SMP PVE 5.11.22-6 (Wed, 28 Jul 2021 10:51:12 +0200) x86_64 GNU/Linux
root@prox:~#
amd64-microcode
and also added the aio=native
option to all drives and straight away haven't had an issue. I had the issue on 6.4 and upgraded to 7 yesterday and still had it.mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 15: Machine Check: 0 Bank 1: bc800800060c0859
mce: [Hardware Error]: TSC 0 ADDR 41576d480 MISC d012000000000000 IPID 100b000000000
mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1628433454 SOCKET 0 APIC f microcode a201009
For me on my dell r710 Intel CPU it's still present. Crash between 26h and max 43h.Has the problem persisted -6 after the upgrade? We rolled the lab back to version 6.4 and are waiting for the problems to clear up. Unfortunately, that original host was no longer on the lines for analysis. Crashed/freeze straight into the boot.
You legend, I don't want to get ahead of myself but so far uptime is a day with c states disabled in BIOS. Now the only issue I'm having with kernel 5.11 is lvm not attaching/activating properly at boot when it was fine with 5.4. My thread on it here if anyone can help.My nodes with Ryzen 5800x, 5900x and 5950x crashed all the time (at least once per day, with or without load, even w/o a single VM/CT on them) with PVE 6.4 both kernels 5.4 and 5.11 and PVE 7.0. after disabled C6 state, all of them are now running rock solid since weeks.
The troubleshooting was hard. Since its "desktop" hardware, i first thought the non ECC RAM was the issues, bought ECC unbuffered, still crashes. Then researches lead to something with kernel + AMD hardware, updated from 5.4 to 5.11, still crashes, PVE 7.0 came just out, updates, still crashes. then i was sitting there and tested tons of bios settings, but it took a lot of time, because the crash was sporadical. Then found on a Linux forum a post from last year, where C6 state was mentioned with Buster + Ryzen 3600 ... tried out, and bam!
Try it out, i would be very interested in your results.
Hi @Fabian_E ,Just to make sure: did you verify that netconsole was set up properly, i.e. do you get the other syslog messages?
Could you do the following on the idling machine experiencing the crashes:
The second command might take a bit of time to complete.Code:cd /sys/kernel/tracing cat available_filter_functions | grep io_uring > set_graph_function echo function_graph > current_tracer
Then wait (at least a few minutes) and do
Please share the resulting fileCode:tail -n 2000 trace > /tmp/io_uring_trace.txt
/tmp/io_uring_trace.txt
.