Proxmox 6.2 Kernel Error Bad RIP Value

I can confirm that there is a problem with the 3108 ROC chipset and or the megaraid_sas kernel modul.
But in kernel 5.3.18-2 this problem already exists.
Also if I use LVM instead LVM-Thin this problem does not occur.
So I guess it is related to the cache of the Raidcotroller and LVM-Thin.

I will make some more tests.
 
  • Like
Reactions: udo
Hi Wolfgang,
after the raid-extension was done, I updated yesterday evening the bios and boot the "old" kernel 5.3.13-3-pve and all looks well.
But till today at 6:50h only - because at this time (6:30) many daily-jobs starts and do IO.
Curiously the mostly IO is done by the zfs-raid1 (nvme), where the DB-Slave-Server are running which are doing a backup during this time, and an ceph-osd (also nvme).
Some Jobs are also running on the lvm, which is mostly writing on this nodes, because the nfs-VMs are all drbd-slaves now.

Yesterday evening the wait (atop) was around 1%, now with hanging lvm the wait is 98% - which can have to do with pve-storage-processes?!

Kernel Traces since 06:50h (the last one was at 07:08h)
Code:
Oct  5 06:50:31 pve02 kernel: [32145.498742] INFO: task lvs:899716 blocked for more than 120 seconds.
Oct  5 06:50:31 pve02 kernel: [32145.505582]       Tainted: P           O      5.3.13-3-pve #1
Oct  5 06:50:31 pve02 kernel: [32145.511786] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  5 06:50:31 pve02 kernel: [32145.520058] lvs             D    0 899716   5792 0x00000000
Oct  5 06:50:31 pve02 kernel: [32145.526027] Call Trace:
Oct  5 06:50:31 pve02 kernel: [32145.528865]  __schedule+0x2bb/0x660
Oct  5 06:50:31 pve02 kernel: [32145.532745]  ? __switch_to_asm+0x34/0x70
Oct  5 06:50:31 pve02 kernel: [32145.537058]  ? __switch_to_asm+0x40/0x70
Oct  5 06:50:31 pve02 kernel: [32145.541363]  schedule+0x33/0xa0
Oct  5 06:50:31 pve02 kernel: [32145.544886]  schedule_timeout+0x205/0x300
Oct  5 06:50:31 pve02 kernel: [32145.549276]  wait_for_completion+0xb7/0x140
Oct  5 06:50:31 pve02 kernel: [32145.553853]  ? wake_up_q+0x80/0x80
Oct  5 06:50:31 pve02 kernel: [32145.557657]  __flush_work+0x131/0x1e0
Oct  5 06:50:31 pve02 kernel: [32145.561725]  ? worker_detach_from_pool+0xb0/0xb0
Oct  5 06:50:31 pve02 kernel: [32145.566741]  ? work_busy+0x90/0x90
Oct  5 06:50:31 pve02 kernel: [32145.570550]  __cancel_work_timer+0x115/0x190
Oct  5 06:50:31 pve02 kernel: [32145.575488]  ? exact_lock+0x11/0x20
Oct  5 06:50:31 pve02 kernel: [32145.579388]  ? kobj_lookup+0xec/0x160
Oct  5 06:50:31 pve02 kernel: [32145.583441]  cancel_delayed_work_sync+0x13/0x20
Oct  5 06:50:32 pve02 kernel: [32145.588385]  disk_block_events+0x78/0x80
Oct  5 06:50:32 pve02 kernel: [32145.592724]  __blkdev_get+0x73/0x550
Oct  5 06:50:32 pve02 kernel: [32145.596709]  blkdev_get+0xe0/0x140
Oct  5 06:50:32 pve02 kernel: [32145.600652]  ? bd_acquire+0xd0/0xd0
Oct  5 06:50:32 pve02 kernel: [32145.604543]  blkdev_open+0x92/0x100
Oct  5 06:50:32 pve02 kernel: [32145.608435]  do_dentry_open+0x143/0x3a0
Oct  5 06:50:32 pve02 kernel: [32145.612671]  vfs_open+0x2d/0x30
Oct  5 06:50:32 pve02 kernel: [32145.616234]  path_openat+0x2bf/0x1570
Oct  5 06:50:32 pve02 kernel: [32145.620315]  ? filename_lookup.part.62+0xe0/0x170
Oct  5 06:50:32 pve02 kernel: [32145.625415]  ? strncpy_from_user+0x57/0x1b0
Oct  5 06:50:32 pve02 kernel: [32145.630000]  do_filp_open+0x93/0x100
Oct  5 06:50:32 pve02 kernel: [32145.633976]  ? strncpy_from_user+0x57/0x1b0
Oct  5 06:50:32 pve02 kernel: [32145.638561]  ? __alloc_fd+0x46/0x150
Oct  5 06:50:32 pve02 kernel: [32145.642531]  do_sys_open+0x177/0x280
Oct  5 06:50:32 pve02 kernel: [32145.646506]  __x64_sys_openat+0x20/0x30
Oct  5 06:50:32 pve02 kernel: [32145.650765]  do_syscall_64+0x5a/0x130
Oct  5 06:50:32 pve02 kernel: [32145.654847]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Oct  5 06:50:32 pve02 kernel: [32145.660305] RIP: 0033:0x7f5950fea1ae
Oct  5 06:50:32 pve02 kernel: [32145.664294] Code: Bad RIP value.
Oct  5 06:50:32 pve02 kernel: [32145.667934] RSP: 002b:00007ffd1a7b4220 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
Oct  5 06:50:32 pve02 kernel: [32145.675937] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5950fea1ae
Oct  5 06:50:32 pve02 kernel: [32145.683504] RDX: 0000000000044000 RSI: 000055bdbb1f6448 RDI: 00000000ffffff9c
Oct  5 06:50:32 pve02 kernel: [32145.691082] RBP: 00007ffd1a7b4380 R08: 000055bdba8bfa17 R09: 00007ffd1a7b4450
Oct  5 06:50:32 pve02 kernel: [32145.698677] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd1a7b6e5a
Oct  5 06:50:32 pve02 kernel: [32145.706264] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
The old kernel isn't enough to solve this issue…

Code:
pveversion -v
proxmox-ve: 6.2-2 (running kernel: 5.3.13-3-pve)
pve-manager: 6.2-11 (running version: 6.2-11/22fb4983)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.13-3-pve: 5.3.13-3
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 14.2.11-pve1
ceph-fuse: 14.2.11-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-6
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-12
pve-cluster: 6.1-8
pve-container: 3.2-2
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-1
pve-qemu-kvm: 5.0.0-13
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-14
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve1
Udo
 

Attachments

  • 20201005_114332_pve2.png
    20201005_114332_pve2.png
    184.7 KB · Views: 3
Can you please restest with pve-qemu-kvm=5.0.0-9 because here it looks good with all kernels?
 
Hi,
with pve-qemu-kvm: 5.0.0-9 lvm looks good til now.
Code:
pveversion -v
proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)
pve-manager: 6.2-11 (running version: 6.2-11/22fb4983)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.13-3-pve: 5.3.13-3
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
ceph: 14.2.11-pve1
ceph-fuse: 14.2.11-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-6
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-12
pve-cluster: 6.1-8
pve-container: 3.2-2
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-1
pve-qemu-kvm: 5.0.0-9
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-14
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
Yesterday it's stop working at this time.

Udo
 

Attachments

  • 20201006_070920_qemu-kvm_5.0.0-9.png
    20201006_070920_qemu-kvm_5.0.0-9.png
    171.9 KB · Views: 6
Thanks for this feedback.

please report back if something changes.
I will start to check the changes between these versions.
 
  • Like
Reactions: udo
Thanks for this feedback.

please report back if something changes.
I will start to check the changes between these versions.
Hi Wolfgang,
any news on this topic?
I've now two hosts with pinned qemu-kvm (both host have the same hardware/config).
With pve-qemu-kvm 5.0.0-9 it's stable, but I think that's not an solution for a long time.


Udo
 
Hi Udo,

No there is no news on this problem.
I found no obvious patches in the kernel or qemu that can make this problem.
 
I'm still working on this.

It looks like the kernel caching/allocation behavior has changed.
I try to adapt the lvm.conf so it is compatible with the raid cache.

I see that the kernel is waiting for ack but it never come or it is masked.
It works all great as long the raid cache and the kernel cache are not filled up.
But I do not complete understand this behavior.
 
  • Like
Reactions: udo
This setting should fix the hanging io.

create a lvm profile /etc/lvm/profile/thin-raid6.profile

Code:
allocation {
    thin_pool_chunk_size_policy = "performance"
    thin_pool_zero = 0
    cache_mode = "writethrough"
}

Then use the profile on the VG where the LV lies. All LV will inherit the profile.
Code:
vgchange Raid6 --profile thin-raid6

The profile needs a reactivate of the LVs or reboot.
 
Last edited:
  • Like
Reactions: udo
I removed the cache mode because this can differ from controller to controller. and I guess the default is the better choice.
 
I removed the cache mode because this can differ from controller to controller. and I guess the default is the better choice.
Hi Wolfgang,
unfortunality this don't help in any case.
I have an system with an raid1 (very simple 24/7-SSDs on a Dell PERC H730 Mini) and after the issue occour, I reboot, assign the lvm-profile and reboot again.
11 days later the same issue happens again…
Looks, that I must switch from lvm-thin to thick-lvm…

Udo