Kernel Panic, whole server crashes about every day

Kabbone · Jul 27, 2021

I ran into the same problems. After the upgrade from 6.4 to 7 last weekend I experienced multiple hang ups during high IO.
The disabling of io_uring with "aio=native" solved the issue for me for now.

Kernel (with 6.4 and 7): 5.11
CPU: Intel i5-6500T (no intel-microcode pkg installed)
RAM: 16GB DDR4
BIOS: newest available (01/2020 Fujitsu Esprimo D756)
VM: Root-fs on Proxmox Host LVM ext4
2 Blockdevice Passthrough (DM-Crypt BTRFS RAID-1)
Application: Proxmox Backup on a NFS (located on the BTRFS RAID-1 mentionend above) hosted in an VM

EDIT: all drives use the virtio driver

June · Jul 27, 2021

For the moment, aio-native seems to fix the crash so far for now

wiresandenergy · Jul 28, 2021

AIO-Native and amd64-microcode seems to be working here as well. Was experiencing this exact issue.

Specs:
AMD Ryzen 5950X
128GB DDR4 3200
Nvidia GeForce 1080ti
Asus B550-F Motherboard
LSI 9221-8i HBA

Just so I'm clear, aio-native is a temporary fix until the underlying issue is resolved?

gouthamravee · Jul 29, 2021

Hi folks, I've been having the similar crashes on two proxmox servers.
One running an Intel core i5 3570k and the other is AMD Ryzen 1800x. I have two other proxmox servers also on version 7, Ryzen 2200G and Intel Core i3 i3-3220T that do not crash. The only similarity between the two crashing servers is both have large multi terabyte arrays, not ZFS arrays. I monitor both using Netdata and always see a huge IO spike on the disks when they crash.

I've setup kernel crash logs on both, but neither have provided any information.

The week things have been okay on one server, but the other one had a crash recently. I'm going to use this weekend to setup the remote crash logs per - https://pve.proxmox.com/wiki/Kernel_Crash_Trace_Log

Wanted to post this so the proxmox team is aware it might not be limited to AMD systems.
The heavy disk use VMs on both servers have the HDDs passed through to them using virtio, no cache.

I will post more info as soon as I collect them.

Southsko · Jul 29, 2021

gouthamravee said:
Wanted to post this so the proxmox team is aware it might not be limited to AMD systems.
The heavy disk use VMs on both servers have the HDDs passed through to them using virtio, no cache.

Dude! I had the same problem. When I disabled virtio passthrough to a vm my system was fine. I ended up using nfs instead.

This happened on an upgraded 6 system and a brand new 7 install.

t.lamprecht · Jul 29, 2021

FYI, there's a newer kernel as package pve-kernel-5.11.22-3-pve version 5.11.22-6 which solves an issue with some unexpected EAGAIN's that the io_uring kernel code got from some subsystems softirq code paths.

Any how, please try to upgrade to that kernel and also reboot into it, the package is available on the pve-no-subscription repository at time of writing.
It's a bit hard to tell in general, as there's quite the mix of kernel oopses posted in this thread, but at least some of the issues reported here should be gone.

luckyluk83 · Jul 29, 2021

I've had kernel panic every day for the past week or so.
It will hang on random or while trying to make a backup and then doing some files copying.

My vms are:
OMV managing the storage
few Ubuntu Servers
Win10Pro
POPOS
Manjaro

My config is
e5-2698 v3
128GB ECC DDR4 2133
2x120GB SSD
6X250GB SSD
2x16TB HDD
10x4TB SAS HDD on LSI 2008

Today I've upgraded to the newest kernel as advised in t.lamprecht post.

gouthamravee · Jul 29, 2021

t.lamprecht said:
FYI, there's a newer kernel as package pve-kernel-5.11.22-3-pve version 5.11.22-6 which solves an issue with some unexpected EAGAIN's that the io_uring kernel code got from some subsystems softirq code paths.

Any how, please try to upgrade to that kernel and also reboot into it, the package is available on the pve-no-subscription repository at time of writing.
It's a bit hard to tell in general, as there's quite the mix of kernel oopses posted in this thread, but at least some of the issues reported here should be gone.

Thank you!

I tried to setup the remote kernel debugging last night but was not successful.
Will try again this weekend, but for now I've updated all my servers to the latest kernel and rebooted.

As a side note one of the servers has also been intermittently losing network connection, though it only seems to be affecting SSH connections. I'll still be able to access the VMs but proxmox it self will drop any ssh connections, usually appearing as offline on the proxmox dashboard.

Initially thought it was the ethernet cable but replacing with a known good cable didn't fix the issue.
That didn't happen for a while, but after updating and rebooting it just happened again.

luckyluk83 · Jul 29, 2021

I've done the update and the server didn't even last 2 hours.
I'll the 6.4 for now and see how it goes without updating the kernel

timproxmox · Jul 30, 2021

maybe not related, we rolled back to 6.4 after having multipath issues with pve7. This was not clear to begin with as the symptoms was the VM's went into a panic and stopped responding requiring reboots and volume repairs.

t.lamprecht · Jul 30, 2021

timproxmox said:
This was not clear to begin with as the symptoms was the VM's went into a panic and stopped responding requiring reboots and volume repairs.

This issue here is affecting only the host kernel, I mean if that crashes the VMs won't be happy, but pure VM crash and no host would mean that you have some different issue.

luckyluk83 said:
I've done the update and the server didn't even last 2 hours.

In what sense, you also only talked about hanging stuff, did you looked into the host and checked if that still worked, did others VMs still work, was there any error messages/panics in the syslog/dmesg? Any actual info could at least help to look into it.

luckyluk83 · Jul 30, 2021

t.lamprecht i think I've the solution for the Proxmox host hanging with last 2 kernel versions.
Since few months I've had Turbo Unlock bios on my Rampage V x99 Motherboard for E5-2698 v3. Everything worked fine until the second to last kernel. Then yesterday I've upgraded the Proxmox to the newest kernel as advised by you but still I've had problems even sooner than before. So what I've done is to come back to the original bios and the system is working fine for more than 12 hours now with 10 vms running, occupying 70% of CPU and 60GB of RAM. Disks are busy as well as the network. Maybe there was something in the last kernel which didn't like the missing microcode in the Turbo Unlock Bios ?

gouthamravee · Jul 30, 2021

t.lamprecht · Jul 30, 2021

luckyluk83 said:
Since few months I've had Turbo Unlock bios on my Rampage V x99 Motherboard for E5-2698 v3. Everything worked fine until the second to last kernel. Then yesterday I've upgraded the Proxmox to the newest kernel as advised by you but still I've had problems even sooner than before. So what I've done is to come back to the original bios and the system is working fine for more than 12 hours now with 10 vms running, occupying 70% of CPU and 60GB of RAM.

Yeah some BIOS FW feature can definitively interfere with system stability, and a newer kernel can also make such interference surface as new issues.

luckyluk83 said:
Maybe there was something in the last kernel which didn't like the missing microcode in the Turbo Unlock Bios ?

The very last kernel is just a single patch in SCSI related io_uring code, fixing the issue this thread is/should be about:
https://git.proxmox.com/?p=pve-kernel.git;a=commitdiff;h=437b51a73b3fbfe4e5b708316c685060214a21cc

It's so minimal and also actually would fix things like that, so I really would be surprised if that was the cause of the regression you're seeing.

A bit before that some more stable updates got pulled in, those could include a patch that would regress on your system, I skimmed the list through and nothing obvious, clock/stepping/scheduler/power related stuck out.
For now I'd recommend keeping the BIOS closer to the default, and that turbo setting disabled.

cocoboig · Jul 31, 2021

After the upgrade to Proxmox 7, after a day, more or less, my system begins to malfunction and the syslog shows:

Code:

[15595.468370] perf: interrupt took too long (2504 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
[25015.394686] perf: interrupt took too long (3132 > 3130), lowering kernel.perf_event_max_sample_rate to 63750
[58117.333752] perf: interrupt took too long (3919 > 3915), lowering kernel.perf_event_max_sample_rate to 51000
[88329.013489] INFO: task pvesr:503547 blocked for more than 120 seconds.
[88329.013522]       Tainted: P           O      5.11.22-3-pve #1
[88329.013541] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[88329.013565] task:pvesr           state:D stack:    0 pid:503547 ppid:     1 flags:0x00000000
[88329.013570] Call Trace:
[88329.013574]  __schedule+0x2ca/0x880
[88329.013582]  schedule+0x4f/0xc0
[88329.013584]  rwsem_down_write_slowpath+0x212/0x590
[88329.013591]  down_write+0x43/0x50
[88329.013594]  filename_create+0x7e/0x160
[88329.013600]  do_mkdirat+0x58/0x140
[88329.013604]  __x64_sys_mkdir+0x1b/0x20
[88329.013607]  do_syscall_64+0x38/0x90
[88329.013611]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[88329.013615] RIP: 0033:0x7f52d25b1b07
[88329.013618] RSP: 002b:00007ffe695daaf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[88329.013621] RAX: ffffffffffffffda RBX: 000055dd275d12a0 RCX: 00007f52d25b1b07
[88329.013622] RDX: 000055dd257b7a05 RSI: 00000000000001ff RDI: 000055dd2b63a400
[88329.013624] RBP: 0000000000000000 R08: 000055dd2ba7f228 R09: 0000000000000000
[88329.013625] R10: 0000000000000008 R11: 0000000000000246 R12: 000055dd2b63a400
[88329.013626] R13: 000055dd288ca5f8 R14: 000055dd2b862e58 R15: 00000000000001ff
[88449.842943] INFO: task pvesr:503547 blocked for more than 241 seconds.
[88449.842973]       Tainted: P           O      5.11.22-3-pve #1
[88449.842992] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[88449.843015] task:pvesr           state:D stack:    0 pid:503547 ppid:     1 flags:0x00000000
[88449.843019] Call Trace:
[88449.843023]  __schedule+0x2ca/0x880
[88449.843031]  schedule+0x4f/0xc0
[88449.843033]  rwsem_down_write_slowpath+0x212/0x590
[88449.843040]  down_write+0x43/0x50
[88449.843043]  filename_create+0x7e/0x160
[88449.843049]  do_mkdirat+0x58/0x140
[88449.843052]  __x64_sys_mkdir+0x1b/0x20
[88449.843056]  do_syscall_64+0x38/0x90
[88449.843059]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[88449.843064] RIP: 0033:0x7f52d25b1b07
[88449.843066] RSP: 002b:00007ffe695daaf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[88449.843069] RAX: ffffffffffffffda RBX: 000055dd275d12a0 RCX: 00007f52d25b1b07
[88449.843071] RDX: 000055dd257b7a05 RSI: 00000000000001ff RDI: 000055dd2b63a400
[88449.843072] RBP: 0000000000000000 R08: 000055dd2ba7f228 R09: 0000000000000000
[88449.843073] R10: 0000000000000008 R11: 0000000000000246 R12: 000055dd2b63a400
[88449.843074] R13: 000055dd288ca5f8 R14: 000055dd2b862e58 R15: 00000000000001ff
[88570.672693] INFO: task pvesr:503547 blocked for more than 362 seconds.
[88570.672722]       Tainted: P           O      5.11.22-3-pve #1
[88570.672741] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[88570.672765] task:pvesr           state:D stack:    0 pid:503547 ppid:     1 flags:0x00000000
[88570.672769] Call Trace:
[88570.672773]  __schedule+0x2ca/0x880
[88570.672780]  schedule+0x4f/0xc0
[88570.672782]  rwsem_down_write_slowpath+0x212/0x590
[88570.672789]  down_write+0x43/0x50
[88570.672792]  filename_create+0x7e/0x160
[88570.672797]  do_mkdirat+0x58/0x140
[88570.672801]  __x64_sys_mkdir+0x1b/0x20
[88570.672804]  do_syscall_64+0x38/0x90
[88570.672807]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[88570.672812] RIP: 0033:0x7f52d25b1b07
[88570.672814] RSP: 002b:00007ffe695daaf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[88570.672817] RAX: ffffffffffffffda RBX: 000055dd275d12a0 RCX: 00007f52d25b1b07
[88570.672818] RDX: 000055dd257b7a05 RSI: 00000000000001ff RDI: 000055dd2b63a400
[88570.672820] RBP: 0000000000000000 R08: 000055dd2ba7f228 R09: 0000000000000000
[88570.672821] R10: 0000000000000008 R11: 0000000000000246 R12: 000055dd2b63a400
[88570.672822] R13: 000055dd288ca5f8 R14: 000055dd2b862e58 R15: 00000000000001ff
[88691.502613] INFO: task pvesr:503547 blocked for more than 483 seconds.
[88691.502646]       Tainted: P           O      5.11.22-3-pve #1
[88691.502665] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[88691.502689] task:pvesr           state:D stack:    0 pid:503547 ppid:     1 flags:0x00000000
[88691.502694] Call Trace:
[88691.502698]  __schedule+0x2ca/0x880
[88691.502705]  schedule+0x4f/0xc0
[88691.502708]  rwsem_down_write_slowpath+0x212/0x590
[88691.502714]  down_write+0x43/0x50
[88691.502717]  filename_create+0x7e/0x160
[88691.502723]  do_mkdirat+0x58/0x140
[88691.502727]  __x64_sys_mkdir+0x1b/0x20
[88691.502730]  do_syscall_64+0x38/0x90
[88691.502734]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[88691.502738] RIP: 0033:0x7f52d25b1b07
[88691.502741] RSP: 002b:00007ffe695daaf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[88691.502744] RAX: ffffffffffffffda RBX: 000055dd275d12a0 RCX: 00007f52d25b1b07
[88691.502746] RDX: 000055dd257b7a05 RSI: 00000000000001ff RDI: 000055dd2b63a400
[88691.502747] RBP: 0000000000000000 R08: 000055dd2ba7f228 R09: 0000000000000000
[88691.502748] R10: 0000000000000008 R11: 0000000000000246 R12: 000055dd2b63a400
[88691.502750] R13: 000055dd288ca5f8 R14: 000055dd2b862e58 R15: 00000000000001ff
[88812.332695] INFO: task pvesr:503547 blocked for more than 604 seconds.
[88812.332725]       Tainted: P           O      5.11.22-3-pve #1
[88812.332743] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[88812.332765] task:pvesr           state:D stack:    0 pid:503547 ppid:     1 flags:0x00000000
[88812.332769] Call Trace:
[88812.332773]  __schedule+0x2ca/0x880
[88812.332779]  schedule+0x4f/0xc0
[88812.332782]  rwsem_down_write_slowpath+0x212/0x590
[88812.332788]  down_write+0x43/0x50
[88812.332790]  filename_create+0x7e/0x160
[88812.332812]  do_mkdirat+0x58/0x140
[88812.332815]  __x64_sys_mkdir+0x1b/0x20
[88812.332819]  do_syscall_64+0x38/0x90
[88812.332822]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[88812.332827] RIP: 0033:0x7f52d25b1b07
[88812.332829] RSP: 002b:00007ffe695daaf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[88812.332832] RAX: ffffffffffffffda RBX: 000055dd275d12a0 RCX: 00007f52d25b1b07
[88812.332833] RDX: 000055dd257b7a05 RSI: 00000000000001ff RDI: 000055dd2b63a400
[88812.332834] RBP: 0000000000000000 R08: 000055dd2ba7f228 R09: 0000000000000000
[88812.332836] R10: 0000000000000008 R11: 0000000000000246 R12: 000055dd2b63a400
[88812.332837] R13: 000055dd288ca5f8 R14: 000055dd2b862e58 R15: 00000000000001ff
[88933.162930] INFO: task pvesr:503547 blocked for more than 724 seconds.
[88933.163015]       Tainted: P           O      5.11.22-3-pve #1
[88933.163074] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[88933.163131] task:pvesr           state:D stack:    0 pid:503547 ppid:     1 flags:0x00000000
[88933.163135] Call Trace:
[88933.163139]  __schedule+0x2ca/0x880
[88933.163146]  schedule+0x4f/0xc0
[88933.163149]  rwsem_down_write_slowpath+0x212/0x590
[88933.163155]  down_write+0x43/0x50
[88933.163158]  filename_create+0x7e/0x160
[88933.163164]  do_mkdirat+0x58/0x140
[88933.163167]  __x64_sys_mkdir+0x1b/0x20
[88933.163170]  do_syscall_64+0x38/0x90
[88933.163174]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[88933.163178] RIP: 0033:0x7f52d25b1b07
[88933.163181] RSP: 002b:00007ffe695daaf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[88933.163183] RAX: ffffffffffffffda RBX: 000055dd275d12a0 RCX: 00007f52d25b1b07
[88933.163185] RDX: 000055dd257b7a05 RSI: 00000000000001ff RDI: 000055dd2b63a400
[88933.163186] RBP: 0000000000000000 R08: 000055dd2ba7f228 R09: 0000000000000000
[88933.163188] R10: 0000000000000008 R11: 0000000000000246 R12: 000055dd2b63a400
[88933.163189] R13: 000055dd288ca5f8 R14: 000055dd2b862e58 R15: 00000000000001ff
[89053.993122] INFO: task pvesr:503547 blocked for more than 845 seconds.
[89053.993153]       Tainted: P           O      5.11.22-3-pve #1
[89053.993172] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[89053.993196] task:pvesr           state:D stack:    0 pid:503547 ppid:     1 flags:0x00000000
[89053.993200] Call Trace:
[89053.993203]  __schedule+0x2ca/0x880
[89053.993210]  schedule+0x4f/0xc0
[89053.993213]  rwsem_down_write_slowpath+0x212/0x590
[89053.993218]  down_write+0x43/0x50
[89053.993221]  filename_create+0x7e/0x160
[89053.993227]  do_mkdirat+0x58/0x140
[89053.993230]  __x64_sys_mkdir+0x1b/0x20
[89053.993233]  do_syscall_64+0x38/0x90
[89053.993237]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[89053.993241] RIP: 0033:0x7f52d25b1b07
[89053.993243] RSP: 002b:00007ffe695daaf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[89053.993246] RAX: ffffffffffffffda RBX: 000055dd275d12a0 RCX: 00007f52d25b1b07
[89053.993248] RDX: 000055dd257b7a05 RSI: 00000000000001ff RDI: 000055dd2b63a400
[89053.993249] RBP: 0000000000000000 R08: 000055dd2ba7f228 R09: 0000000000000000
[89053.993250] R10: 0000000000000008 R11: 0000000000000246 R12: 000055dd2b63a400
[89053.993252] R13: 000055dd288ca5f8 R14: 000055dd2b862e58 R15: 00000000000001ff
[89174.823608] INFO: task pvesr:503547 blocked for more than 966 seconds.
[89174.823692]       Tainted: P           O      5.11.22-3-pve #1
[89174.823752] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[89174.823819] task:pvesr           state:D stack:    0 pid:503547 ppid:     1 flags:0x00000000
[89174.823823] Call Trace:
[89174.823827]  __schedule+0x2ca/0x880
[89174.823834]  schedule+0x4f/0xc0
[89174.823836]  rwsem_down_write_slowpath+0x212/0x590
[89174.823843]  down_write+0x43/0x50
[89174.823845]  filename_create+0x7e/0x160
[89174.823851]  do_mkdirat+0x58/0x140
[89174.823854]  __x64_sys_mkdir+0x1b/0x20
[89174.823857]  do_syscall_64+0x38/0x90
[89174.823861]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[89174.823865] RIP: 0033:0x7f52d25b1b07
[89174.823868] RSP: 002b:00007ffe695daaf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[89174.823870] RAX: ffffffffffffffda RBX: 000055dd275d12a0 RCX: 00007f52d25b1b07
[89174.823872] RDX: 000055dd257b7a05 RSI: 00000000000001ff RDI: 000055dd2b63a400
[89174.823873] RBP: 0000000000000000 R08: 000055dd2ba7f228 R09: 0000000000000000
[89174.823875] R10: 0000000000000008 R11: 0000000000000246 R12: 000055dd2b63a400
[89174.823876] R13: 000055dd288ca5f8 R14: 000055dd2b862e58 R15: 00000000000001ff
[89295.654089] INFO: task pvesr:503547 blocked for more than 1087 seconds.
[89295.654136]       Tainted: P           O      5.11.22-3-pve #1
[89295.654155] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[89295.654178] task:pvesr           state:D stack:    0 pid:503547 ppid:     1 flags:0x00000000
[89295.654182] Call Trace:
[89295.654186]  __schedule+0x2ca/0x880
[89295.654193]  schedule+0x4f/0xc0
[89295.654195]  rwsem_down_write_slowpath+0x212/0x590
[89295.654202]  down_write+0x43/0x50
[89295.654205]  filename_create+0x7e/0x160
[89295.654211]  do_mkdirat+0x58/0x140
[89295.654214]  __x64_sys_mkdir+0x1b/0x20
[89295.654217]  do_syscall_64+0x38/0x90
[89295.654221]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[89295.654226] RIP: 0033:0x7f52d25b1b07
[89295.654228] RSP: 002b:00007ffe695daaf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[89295.654231] RAX: ffffffffffffffda RBX: 000055dd275d12a0 RCX: 00007f52d25b1b07
[89295.654232] RDX: 000055dd257b7a05 RSI: 00000000000001ff RDI: 000055dd2b63a400
[89295.654234] RBP: 0000000000000000 R08: 000055dd2ba7f228 R09: 0000000000000000
[89295.654235] R10: 0000000000000008 R11: 0000000000000246 R12: 000055dd2b63a400
[89295.654236] R13: 000055dd288ca5f8 R14: 000055dd2b862e58 R15: 00000000000001ff
[89416.488541] INFO: task pvesr:503547 blocked for more than 1208 seconds.
[89416.488627]       Tainted: P           O      5.11.22-3-pve #1
[89416.488685] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[89416.488758] task:pvesr           state:D stack:    0 pid:503547 ppid:     1 flags:0x00000000
[89416.488770] Call Trace:
[89416.488777]  __schedule+0x2ca/0x880
[89416.488792]  schedule+0x4f/0xc0
[89416.488800]  rwsem_down_write_slowpath+0x212/0x590
[89416.488824]  down_write+0x43/0x50
[89416.488827]  filename_create+0x7e/0x160
[89416.488833]  do_mkdirat+0x58/0x140
[89416.488836]  __x64_sys_mkdir+0x1b/0x20
[89416.488839]  do_syscall_64+0x38/0x90
[89416.488843]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[89416.488847] RIP: 0033:0x7f52d25b1b07
[89416.488849] RSP: 002b:00007ffe695daaf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[89416.488852] RAX: ffffffffffffffda RBX: 000055dd275d12a0 RCX: 00007f52d25b1b07
[89416.488853] RDX: 000055dd257b7a05 RSI: 00000000000001ff RDI: 000055dd2b63a400
[89416.488855] RBP: 0000000000000000 R08: 000055dd2ba7f228 R09: 0000000000000000
[89416.488856] R10: 0000000000000008 R11: 0000000000000246 R12: 000055dd2b63a400
[89416.488857] R13: 000055dd288ca5f8 R14: 000055dd2b862e58 R15: 00000000000001ff

Kernel used:
Linux proxmox 5.11.22-3-pve #1 SMP PVE 5.11.22-6 (Wed, 28 Jul 2021 10:51:12 +0200) x86_64 GNU/Linux

CPU:
Intel(R) Xeon(R) CPU D-1541 @ 2.10GHz

Memory:
64GB

Updated virtio drivers in VM to last version, as recommended.

I have 3 servers with Proxmox 7, only 1 of them with high I/O sometimes (ZFS replication and sync) has malfunction.

I think that "pvesr" (Proxmox VE Storage Replication) process hasn't working as expected.

passedpawn1986 · Aug 2, 2021

FYI, there's a newer kernel as package pve-kernel-5.11.22-3-pve version 5.11.22-6 which solves an issue with some unexpected EAGAIN's that the io_uring kernel code got from some subsystems softirq code paths.

Any how, please try to upgrade to that kernel and also reboot into it, the package is available on the pve-no-subscription repository at time of writing.
It's a bit hard to tell in general, as there's quite the mix of kernel oopses posted in this thread, but at least some of the issues reported here should be gone.

Thank you! I updated to this yesterday and removed ,aio=native from my VM drive config.
So far it seems to be running good.

galeido · Aug 2, 2021

We also have the same problem with version 7. Machines running kernel version 5.11.22-6 crash constantly -> 5.11.22-5 seems to be working properly. Crashing also happens with through 5.11.22-<minors>.

Intel -based hypervisors

fiona · Aug 2, 2021

Hi,

galeido said:
We also have the same problem with version 7. Machines running kernel version 5.11.22-6 crash constantly -> 5.11.22-5 seems to be working properly. Crashing also happens with through 5.11.22-<minors>.

Intel -based hypervisors

so the only 5.11.22 kernel that's working is 5.11.22-5? Do you have a crash trace from syslog (or try using netconsole if you can't get it otherwise).

galeido · Aug 2, 2021

Yes, the only stable kernel seems to be that 5.11.22-5, e.g. 5.11.22-1 hosts seem to be stuck in the load. To add to the oddity, we have one host that doesn’t seem to work stably even on a -5 kernel but crashes when a load is generated over ZFS or NFS. I can't replicate that problem on other machines and same kernel version.

I’ll watch crash trace later this week, there’s nothing essential visible on syslog. Machine just freezes.

PS: The problem also seems to be limited to machines running ZFS+NFS. Machines that have hardware raid+NFS work normally.

cocoboig · Aug 2, 2021

With kernel 5.11.22-6 two days running without problems.

Kernel Panic, whole server crashes about every day

New Member

Member

Member

Well-Known Member

Member

Attachments

Proxmox Staff Member

Member

Well-Known Member

Member

Member

Proxmox Staff Member

Member

Well-Known Member

Proxmox Staff Member

Active Member

Member

New Member

Proxmox Staff Member

New Member

Active Member

We value your privacy