Hi all !
I'm hitting a kernel bug, possibly linked to ceph(/cephfs) that doesn't exactly look like other threads I've come across and would like to get some help
Setup is: Three servers, two have OSDs for ceph (I know, we'll get another one one day, third (more dev and local nvme for databases) counts in quorum). Web serving containers have MPs that tap into a cephfs folder... Files are sync, works pretty darn well, that is besides these recurring hiccups.
Randomly, between once a day and once a week, at any given time, one or the other of the ceph holding nodes will cr*p out on us.
It partially stalls some of the workload, mostly the php-fpm processes and the node will hit loads of 400-800 (~the number of php children actually), these processes can't be killed (trick is to kill the lxc-start, otherwise you'll never reach reboot)... the neighboring node will gradually start stalling as a side affect.
Once the afflicted node rebooted, everything comes back to normal.
I can recall seeing ceph status complaining about mds having stale state or not reporting flags, I'll try to fetch that the next time
I've tried 5.15 too on one of the nodes, same issue
AMD Epyc 7402 based on Tyan mobos.
Does this ring any bell ?
Thanks in advance
This morning
5.13 : https://gist.github.com/happyjaxx/ce5df5535718c0d8dfd881283777ce22
ceph logs ~same time
Yesterday on the other node
5.15 : https://gist.github.com/happyjaxx/04514e2494e4d5fbf26db6ab959d315f
versions (all up to date, ceph on pacific)
I'm hitting a kernel bug, possibly linked to ceph(/cephfs) that doesn't exactly look like other threads I've come across and would like to get some help
Setup is: Three servers, two have OSDs for ceph (I know, we'll get another one one day, third (more dev and local nvme for databases) counts in quorum). Web serving containers have MPs that tap into a cephfs folder... Files are sync, works pretty darn well, that is besides these recurring hiccups.
Randomly, between once a day and once a week, at any given time, one or the other of the ceph holding nodes will cr*p out on us.
It partially stalls some of the workload, mostly the php-fpm processes and the node will hit loads of 400-800 (~the number of php children actually), these processes can't be killed (trick is to kill the lxc-start, otherwise you'll never reach reboot)... the neighboring node will gradually start stalling as a side affect.
Once the afflicted node rebooted, everything comes back to normal.
I can recall seeing ceph status complaining about mds having stale state or not reporting flags, I'll try to fetch that the next time
I've tried 5.15 too on one of the nodes, same issue
AMD Epyc 7402 based on Tyan mobos.
Does this ring any bell ?
Thanks in advance
This morning
5.13 : https://gist.github.com/happyjaxx/ce5df5535718c0d8dfd881283777ce22
ceph logs ~same time
Code:
2022-03-07T08:33:58.933369+0100 mds.xxxx (mds.0) 2061 : cluster [WRN] client.10344135 isn't responding to mclientcaps(revoke), ino 0x1000331fdaa pending pAsLsXsFr issued pAsLsXsFscr, sent 63.387455 seconds ago
2022-03-07T08:34:00.012795+0100 mgr.xxxx (mgr.10335497) 863605 : cluster [DBG] pgmap v865668: 193 pgs: 193 active+clean; 845 GiB data, 1.7 TiB used, 12 TiB / 14 TiB avail; 4.4 MiB/s rd, 84 MiB/s wr, 199 op/s
2022-03-07T08:34:02.707226+0100 mon.xxxx (mon.0) 772386 : cluster [WRN] Health check failed: 1 clients failing to respond to capability release (MDS_CLIENT_LATE_RELEASE)
2022-03-07T08:34:02.715373+0100 mon.xxxx (mon.0) 772387 : cluster [DBG] mds.0 [v2:10.xxx.xx.x:6800/2040158580,v1:10.xxx.xx.x:6801/2040158580] up:active
2022-03-07T08:34:02.715401+0100 mon.xxxx (mon.0) 772388 : cluster [DBG] fsmap cephfs:1 {0=xxxx=up:active} 1 up:standby
Yesterday on the other node
5.15 : https://gist.github.com/happyjaxx/04514e2494e4d5fbf26db6ab959d315f
versions (all up to date, ceph on pacific)
Code:
proxmox-ve: 7.1-1 (running kernel: 5.15.19-2-pve)
pve-manager: 7.1-10 (running version: 7.1-10/6ddebafe)
pve-kernel-helper: 7.1-12
pve-kernel-5.15: 7.1-11
pve-kernel-5.13: 7.1-7
pve-kernel-5.15.19-2-pve: 5.15.19-2
pve-kernel-5.13.19-4-pve: 5.13.19-9
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-3
libpve-guest-common-perl: 4.1-1
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.1-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-2
proxmox-backup-client: 2.1.5-1
proxmox-backup-file-restore: 2.1.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-6
pve-cluster: 7.1-3
pve-container: 4.1-4
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-5
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.1-2
pve-xtermjs: 4.16.0-1
qemu-server: 7.1-4
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.2-pve1
Last edited: