Problem with Ceph an latest kernel

Dear Proxmox team,

after last upgrade of one node of our cluster, I had a big issu with the ceph cluster. I got a lot of "slow requests are blocked" messages and the amount of slow requests were rising.
In cacti, I saw that the traffic felt down and the virtual machines got very very slow. After some research, I changed the kernel on boot from pve-kernel-4.15.18-5-pve back to pve-kernel-4.15.17-3-pve and the problem was solved.
What could be the problem with the new kernel?

Server is a HP DL380 Gen9. The OSDs are each a single raid 0. But this was no problem up to now.

Code:
~# pveversion -v
proxmox-ve: 5.2-2 (running kernel: 4.15.17-3-pve)
pve-manager: 5.2-9 (running version: 5.2-9/4b30e8f9)
pve-kernel-4.15: 5.2-8
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-5-pve: 4.15.18-24
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.15.17-1-pve: 4.15.17-9
pve-kernel-4.13.16-4-pve: 4.13.16-51
pve-kernel-4.13.16-3-pve: 4.13.16-50
pve-kernel-4.13.16-2-pve: 4.13.16-48
pve-kernel-4.4.117-1-pve: 4.4.117-109
pve-kernel-4.4.98-4-pve: 4.4.98-104
pve-kernel-4.4.83-1-pve: 4.4.83-96
pve-kernel-4.4.67-1-pve: 4.4.67-92
pve-kernel-4.4.44-1-pve: 4.4.44-84
pve-kernel-4.4.35-2-pve: 4.4.35-79
pve-kernel-4.4.35-1-pve: 4.4.35-77
ceph: 12.2.8-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-38
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-29
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-27
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-35
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.11-pve1~bpo1

Best regards,
Patrick
 
Server is a HP DL380 Gen9. The OSDs are each a single raid 0. But this was no problem up to now.
RAID is not good for Ceph, see the link for further explanation.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition

After some research, I changed the kernel on boot from pve-kernel-4.15.18-5-pve back to pve-kernel-4.15.17-3-pve and the problem was solved.
Which RAID controller do you use? There may be a change in driver. Are the VMs using krbd for their images? This can also be a factor, if the VMs on that node where affected alone.
 
yes I know, that the raid0 is no good idea. I orderd 4 HBAs and additional cages. Today, I will add the first HBA card and try the new kernel again.

The raid controller is a "Smart Array P440ar" (which I will use in future exclusively for OS disks).

I have two pools, one for LXC VMs with krbd enabled and the other pool vor KVM VMs without krbd. VMs on both pools were affected.
 
The raid controller is a "Smart Array P440ar" (which I will use in future exclusively for OS disks).
Between those kernel versions, there seem to be some update for the hpsa driver. This could be a cause.

I have two pools, one for LXC VMs with krbd enabled and the other pool vor KVM VMs without krbd. VMs on both pools were affected.
But only on that updated node or on the whole cluster?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!