cephfs clients failing to respond to capability release

layer7.net · Jan 25, 2022

Hi,

created a cephfs to hold ISO images for a 3 node (n1,n2,n3) proxmox cluster.

After mounting an ISO image on two nodes at the same time, ceph will show:

HEALTH_WARN 1 clients failing to respond to capability release
[WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release
mds.n1(mds.0): Client n2: failing to respond to capability release client_id: 84379

also changing ISO images while a VM is running is not possible ( connection timeout ).

Stopping this VM after trying to change ( or even completely unmount ) the ISO will result in:

trying to acquire lock...
TASK ERROR: can't lock file '/var/lock/qemu-server/lock-103.conf' - got timeout

Restarting the mds on the reporting n1 node "fixes" the problem ( while the 1st ISO mount will die ).

The problem is reproduceable.

Are there known issues ? The ISO's are obviously read-only. There should not be any locks. But as it seems there are.

Code:

pveversion --verbose
proxmox-ve: 7.1-1 (running kernel: 5.13.19-2-pve)
pve-manager: 7.1-7 (running version: 7.1-7/df5740ad)
pve-kernel-helper: 7.1-6
pve-kernel-5.13: 7.1-5
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-4
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.2.0-3
openvswitch-switch: 2.15.0+ds1-2
proxmox-backup-client: 2.1.2-1
proxmox-backup-file-restore: 2.1.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-4
pve-cluster: 7.1-2
pve-container: 4.1-2
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3

verdi · Jan 31, 2022

Hello.
Having the same issue here.

When you said:

Restarting the mds on the reporting n1 node "fixes" the problem ( while the 1st ISO mount will die ).

Do you mean that you had to restart only a service (which one)? or reboot the whole server?
Also, how many MDS managers are configured on your cluster (ceph status)? I have only one MDS manager, so I'm not pretty sure about the next actions.

layer7.net · Jan 31, 2022

Hi,

i have 3 MDS manager configured for redundancy.

And yes, you simply restart the MDS. But doing so might let you loose the client sessions temporary.

verdi · Jan 31, 2022

Thanks!. I've restarted MDS target service on the node affected which is also the MDS manager and the warning came back after about a minute.
I've restarted MDS target on the node affected again but also on the client and the warning is gone, at least so far.

Regards.

l.ansaloni · Jun 24, 2022

Hi there

I have 3 Proxmox nodes Supermicro SYS-120C-TN10R connected via Mellanox 100GbE ConnectX-6 Dx cards in cross-connect mode using MCP1600-C00AE30N DAC Cable, no switch.
I followed the guide: Full Mesh Network for Ceph Server and in particular I used Open vSwitch to configure the network.
I use this network for ceph storage public network, used disks are NVMe (KCD6XLUL960G), 4 each node.

Proxmox version 7.2-4 updated
Ceph version 16.2.7

I created a cephfs in proxmox and I didn't add it as storage, I installed an Ubuntu 20.04LTS VM on each cluster node, I created a docker swarm that uses the cephfs mounted to each VM as shared storage

Code:

#cephfs
10.15.15.1,10.15.15.2,10.15.15.3:/ /swarm ceph  name=admin,secret=XXX  0 0

I use the docker swarm to host apache, php and mysql containers for website hosting.

Everything works very well and quickly but once or twice a day, at always different times, in ceph it goes into the state of HEALTH_WARN because a client failing to respond to capability release:

Code:

2022-06-19T10:50:00.000703+0200 mon.server01px (mon.0) 630593 : cluster [WRN] Health detail: HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs report slow requests
2022-06-19T10:50:00.000718+0200 mon.server01px (mon.0) 630594 : cluster [WRN] [WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release
2022-06-19T10:50:00.000725+0200 mon.server01px (mon.0) 630595 : cluster [WRN]     mds.server03px(mds.0): Client docker103.newlogic.it failing to respond to capability release client_id: 5182258
2022-06-19T10:50:00.000737+0200 mon.server01px (mon.0) 630596 : cluster [WRN] [WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
2022-06-19T10:50:00.000744+0200 mon.server01px (mon.0) 630597 : cluster [WRN]     mds.server03px(mds.0): 1 slow requests are blocked > 30 secs

This state blocks all the resources that try to access the cephfs, for example the internet sites stop working and I can't do an ls of the folder mounted on the server.

After 5 minutes, evidently a timeout, the client is evicted and everything goes back to working normally.

As shown here: https://docs.ceph.com/en/quincy/cephfs/eviction/
Setting the variable mds_session_blocklist_on_timeout to false, dropping the client MDS sessions, but permit them to re-open sessions and permit them to continue talking to OSDs.

But I would like to find the cause of this problem, what can I do?

l.ansaloni · Oct 13, 2023

l.ansaloni said:
Hi there

I have 3 Proxmox nodes Supermicro SYS-120C-TN10R connected via Mellanox 100GbE ConnectX-6 Dx cards in cross-connect mode using MCP1600-C00AE30N DAC Cable, no switch.
I followed the guide: Full Mesh Network for Ceph Server and in particular I used Open vSwitch to configure the network.
I use this network for ceph storage public network, used disks are NVMe (KCD6XLUL960G), 4 each node.

Proxmox version 7.2-4 updated
Ceph version 16.2.7

I created a cephfs in proxmox and I didn't add it as storage, I installed an Ubuntu 20.04LTS VM on each cluster node, I created a docker swarm that uses the cephfs mounted to each VM as shared storage

Code:

#cephfs 10.15.15.1,10.15.15.2,10.15.15.3:/ /swarm ceph name=admin,secret=XXX 0 0

I use the docker swarm to host apache, php and mysql containers for website hosting.

Everything works very well and quickly but once or twice a day, at always different times, in ceph it goes into the state of HEALTH_WARN because a client failing to respond to capability release:

Code:

2022-06-19T10:50:00.000703+0200 mon.server01px (mon.0) 630593 : cluster [WRN] Health detail: HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs report slow requests 2022-06-19T10:50:00.000718+0200 mon.server01px (mon.0) 630594 : cluster [WRN] [WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release 2022-06-19T10:50:00.000725+0200 mon.server01px (mon.0) 630595 : cluster [WRN] mds.server03px(mds.0): Client docker103.newlogic.it failing to respond to capability release client_id: 5182258 2022-06-19T10:50:00.000737+0200 mon.server01px (mon.0) 630596 : cluster [WRN] [WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests 2022-06-19T10:50:00.000744+0200 mon.server01px (mon.0) 630597 : cluster [WRN] mds.server03px(mds.0): 1 slow requests are blocked > 30 secs

This state blocks all the resources that try to access the cephfs, for example the internet sites stop working and I can't do an ls of the folder mounted on the server.

After 5 minutes, evidently a timeout, the client is evicted and everything goes back to working normally.

As shown here: https://docs.ceph.com/en/quincy/cephfs/eviction/
Setting the variable mds_session_blocklist_on_timeout to false, dropping the client MDS sessions, but permit them to re-open sessions and permit them to continue talking to OSDs.

But I would like to find the cause of this problem, what can I do?

Good morning,
an update on my problem that I managed to solve.
I isolated the pve cluster traffic with one VLAN and the docker cluster traffic with a second VLAN.
For a few months the cephfs storage has no longer had any problems and has excellent performance.

Search

Search

cephfs clients failing to respond to capability release

layer7.net

Member

verdi

New Member

layer7.net

Member

verdi

New Member

l.ansaloni

Renowned Member

l.ansaloni

Renowned Member

We value your privacy