cephfs clients failing to respond to capability release

layer7.net

Member
Oct 5, 2021
41
3
13
23
Hi,

created a cephfs to hold ISO images for a 3 node (n1,n2,n3) proxmox cluster.

After mounting an ISO image on two nodes at the same time, ceph will show:

HEALTH_WARN 1 clients failing to respond to capability release
[WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release
mds.n1(mds.0): Client n2: failing to respond to capability release client_id: 84379

also changing ISO images while a VM is running is not possible ( connection timeout ).

Stopping this VM after trying to change ( or even completely unmount ) the ISO will result in:

trying to acquire lock...
TASK ERROR: can't lock file '/var/lock/qemu-server/lock-103.conf' - got timeout

Restarting the mds on the reporting n1 node "fixes" the problem ( while the 1st ISO mount will die ).

The problem is reproduceable.

Are there known issues ? The ISO's are obviously read-only. There should not be any locks. But as it seems there are.

Code:
pveversion --verbose
proxmox-ve: 7.1-1 (running kernel: 5.13.19-2-pve)
pve-manager: 7.1-7 (running version: 7.1-7/df5740ad)
pve-kernel-helper: 7.1-6
pve-kernel-5.13: 7.1-5
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-4
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.2.0-3
openvswitch-switch: 2.15.0+ds1-2
proxmox-backup-client: 2.1.2-1
proxmox-backup-file-restore: 2.1.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-4
pve-cluster: 7.1-2
pve-container: 4.1-2
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3
 
Hello.
Having the same issue here.

When you said:
Restarting the mds on the reporting n1 node "fixes" the problem ( while the 1st ISO mount will die ).
Do you mean that you had to restart only a service (which one)? or reboot the whole server?
Also, how many MDS managers are configured on your cluster (ceph status)? I have only one MDS manager, so I'm not pretty sure about the next actions.
 
Last edited:
Hi,

i have 3 MDS manager configured for redundancy.

And yes, you simply restart the MDS. But doing so might let you loose the client sessions temporary.
 
Thanks!. I've restarted MDS target service on the node affected which is also the MDS manager and the warning came back after about a minute.
I've restarted MDS target on the node affected again but also on the client and the warning is gone, at least so far.

Regards.
 
Last edited:
Hi there

I have 3 Proxmox nodes Supermicro SYS-120C-TN10R connected via Mellanox 100GbE ConnectX-6 Dx cards in cross-connect mode using MCP1600-C00AE30N DAC Cable, no switch.
I followed the guide: Full Mesh Network for Ceph Server and in particular I used Open vSwitch to configure the network.
I use this network for ceph storage public network, used disks are NVMe (KCD6XLUL960G), 4 each node.

Proxmox version 7.2-4 updated
Ceph version 16.2.7

I created a cephfs in proxmox and I didn't add it as storage, I installed an Ubuntu 20.04LTS VM on each cluster node, I created a docker swarm that uses the cephfs mounted to each VM as shared storage
Code:
#cephfs
10.15.15.1,10.15.15.2,10.15.15.3:/ /swarm ceph  name=admin,secret=XXX  0 0

I use the docker swarm to host apache, php and mysql containers for website hosting.

Everything works very well and quickly but once or twice a day, at always different times, in ceph it goes into the state of HEALTH_WARN because a client failing to respond to capability release:
Code:
2022-06-19T10:50:00.000703+0200 mon.server01px (mon.0) 630593 : cluster [WRN] Health detail: HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs report slow requests
2022-06-19T10:50:00.000718+0200 mon.server01px (mon.0) 630594 : cluster [WRN] [WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release
2022-06-19T10:50:00.000725+0200 mon.server01px (mon.0) 630595 : cluster [WRN]     mds.server03px(mds.0): Client docker103.newlogic.it failing to respond to capability release client_id: 5182258
2022-06-19T10:50:00.000737+0200 mon.server01px (mon.0) 630596 : cluster [WRN] [WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
2022-06-19T10:50:00.000744+0200 mon.server01px (mon.0) 630597 : cluster [WRN]     mds.server03px(mds.0): 1 slow requests are blocked > 30 secs
This state blocks all the resources that try to access the cephfs, for example the internet sites stop working and I can't do an ls of the folder mounted on the server.

After 5 minutes, evidently a timeout, the client is evicted and everything goes back to working normally.

As shown here: https://docs.ceph.com/en/quincy/cephfs/eviction/
Setting the variable mds_session_blocklist_on_timeout to false, dropping the client MDS sessions, but permit them to re-open sessions and permit them to continue talking to OSDs.

But I would like to find the cause of this problem, what can I do?
 
Last edited:
  • Like
Reactions: alexskysilk
Hi there

I have 3 Proxmox nodes Supermicro SYS-120C-TN10R connected via Mellanox 100GbE ConnectX-6 Dx cards in cross-connect mode using MCP1600-C00AE30N DAC Cable, no switch.
I followed the guide: Full Mesh Network for Ceph Server and in particular I used Open vSwitch to configure the network.
I use this network for ceph storage public network, used disks are NVMe (KCD6XLUL960G), 4 each node.

Proxmox version 7.2-4 updated
Ceph version 16.2.7

I created a cephfs in proxmox and I didn't add it as storage, I installed an Ubuntu 20.04LTS VM on each cluster node, I created a docker swarm that uses the cephfs mounted to each VM as shared storage
Code:
#cephfs
10.15.15.1,10.15.15.2,10.15.15.3:/ /swarm ceph  name=admin,secret=XXX  0 0

I use the docker swarm to host apache, php and mysql containers for website hosting.

Everything works very well and quickly but once or twice a day, at always different times, in ceph it goes into the state of HEALTH_WARN because a client failing to respond to capability release:
Code:
2022-06-19T10:50:00.000703+0200 mon.server01px (mon.0) 630593 : cluster [WRN] Health detail: HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs report slow requests
2022-06-19T10:50:00.000718+0200 mon.server01px (mon.0) 630594 : cluster [WRN] [WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release
2022-06-19T10:50:00.000725+0200 mon.server01px (mon.0) 630595 : cluster [WRN]     mds.server03px(mds.0): Client docker103.newlogic.it failing to respond to capability release client_id: 5182258
2022-06-19T10:50:00.000737+0200 mon.server01px (mon.0) 630596 : cluster [WRN] [WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
2022-06-19T10:50:00.000744+0200 mon.server01px (mon.0) 630597 : cluster [WRN]     mds.server03px(mds.0): 1 slow requests are blocked > 30 secs
This state blocks all the resources that try to access the cephfs, for example the internet sites stop working and I can't do an ls of the folder mounted on the server.

After 5 minutes, evidently a timeout, the client is evicted and everything goes back to working normally.

As shown here: https://docs.ceph.com/en/quincy/cephfs/eviction/
Setting the variable mds_session_blocklist_on_timeout to false, dropping the client MDS sessions, but permit them to re-open sessions and permit them to continue talking to OSDs.

But I would like to find the cause of this problem, what can I do?

Good morning,
an update on my problem that I managed to solve.
I isolated the pve cluster traffic with one VLAN and the docker cluster traffic with a second VLAN.
For a few months the cephfs storage has no longer had any problems and has excellent performance.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!