Problem with Ceph an latest kernel

Patrick987 · Oct 18, 2018

Dear Proxmox team,

after last upgrade of one node of our cluster, I had a big issu with the ceph cluster. I got a lot of "slow requests are blocked" messages and the amount of slow requests were rising.
In cacti, I saw that the traffic felt down and the virtual machines got very very slow. After some research, I changed the kernel on boot from pve-kernel-4.15.18-5-pve back to pve-kernel-4.15.17-3-pve and the problem was solved.
What could be the problem with the new kernel?

Server is a HP DL380 Gen9. The OSDs are each a single raid 0. But this was no problem up to now.

Code:

~# pveversion -v
proxmox-ve: 5.2-2 (running kernel: 4.15.17-3-pve)
pve-manager: 5.2-9 (running version: 5.2-9/4b30e8f9)
pve-kernel-4.15: 5.2-8
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-5-pve: 4.15.18-24
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.15.17-1-pve: 4.15.17-9
pve-kernel-4.13.16-4-pve: 4.13.16-51
pve-kernel-4.13.16-3-pve: 4.13.16-50
pve-kernel-4.13.16-2-pve: 4.13.16-48
pve-kernel-4.4.117-1-pve: 4.4.117-109
pve-kernel-4.4.98-4-pve: 4.4.98-104
pve-kernel-4.4.83-1-pve: 4.4.83-96
pve-kernel-4.4.67-1-pve: 4.4.67-92
pve-kernel-4.4.44-1-pve: 4.4.44-84
pve-kernel-4.4.35-2-pve: 4.4.35-79
pve-kernel-4.4.35-1-pve: 4.4.35-77
ceph: 12.2.8-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-38
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-29
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-27
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-35
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.11-pve1~bpo1

Best regards,
Patrick

Alwin · Oct 19, 2018

Patrick987 said:
Server is a HP DL380 Gen9. The OSDs are each a single raid 0. But this was no problem up to now.

RAID is not good for Ceph, see the link for further explanation.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition

Patrick987 said:
After some research, I changed the kernel on boot from pve-kernel-4.15.18-5-pve back to pve-kernel-4.15.17-3-pve and the problem was solved.

Which RAID controller do you use? There may be a change in driver. Are the VMs using krbd for their images? This can also be a factor, if the VMs on that node where affected alone.

Patrick987 · Oct 19, 2018

yes I know, that the raid0 is no good idea. I orderd 4 HBAs and additional cages. Today, I will add the first HBA card and try the new kernel again.

The raid controller is a "Smart Array P440ar" (which I will use in future exclusively for OS disks).

I have two pools, one for LXC VMs with krbd enabled and the other pool vor KVM VMs without krbd. VMs on both pools were affected.

Alwin · Oct 19, 2018

Patrick987 said:
The raid controller is a "Smart Array P440ar" (which I will use in future exclusively for OS disks).

Between those kernel versions, there seem to be some update for the hpsa driver. This could be a cause.

Patrick987 said:
I have two pools, one for LXC VMs with krbd enabled and the other pool vor KVM VMs without krbd. VMs on both pools were affected.

But only on that updated node or on the whole cluster?

Patrick987 · Oct 19, 2018

Alwin said:
Between those kernel versions, there seem to be some update for the hpsa driver. This could be a cause.

that sounds to be the cause

Alwin said:
But only on that updated node or on the whole cluster?

no, the whole cluster was affected.

Search

Search

Problem with Ceph an latest kernel

Patrick987

Member

Alwin

Proxmox Retired Staff

Patrick987

Member

Alwin

Proxmox Retired Staff

Patrick987

Member