Hanging MDS after an unexpected reboot

jmi

New Member
Jul 31, 2024
3
1
3
We had an unexpected reboot of a cluster node. The node is part of our ceph setup.
Approx. one minute after corosync reported that a new membership was formed, the log of the host holding the active metadata service was flooded with lines like:

Code:
Aug 12 10:55:43 arthur ceph-mds[2717]: 2024-08-12T10:55:43.444+0200 7f0fe91c96c0 -1 failed to decode message of type 24 v6: End of buffer
Aug 12 10:55:43 arthur ceph-mds[2717]: 2024-08-12T10:55:43.444+0200 7f0fe89c86c0 -1 failed to decode message of type 24 v6: End of buffer
Aug 12 10:55:43 arthur ceph-mds[2717]: 2024-08-12T10:55:43.444+0200 7f0fe81c76c0 -1 failed to decode message of type 24 v6: End of buffer
Aug 12 10:55:43 arthur ceph-mds[2717]: 2024-08-12T10:55:43.444+0200 7f0fe89c86c0 -1 failed to decode message of type 24 v6: End of buffer
Aug 12 10:55:43 arthur kernel: libceph: mds0 (1)172.16.2.2:6801 socket closed (con state OPEN)
Aug 12 10:55:43 arthur ceph-mds[2717]: 2024-08-12T10:55:43.696+0200 7f0fe91c96c0 -1 failed to decode message of type 24 v6: End of buffer
Aug 12 10:55:43 arthur ceph-mds[2717]: 2024-08-12T10:55:43.696+0200 7f0fe89c86c0 -1 failed to decode message of type 24 v6: End of buffer
Aug 12 10:55:43 arthur kernel: libceph: mds0 (1)172.16.2.2:6801 socket closed (con state OPEN)

The logs of the other nodes contain lines like:

Code:
Aug 12 10:56:02 trillian kernel: libceph: mds0 (1)172.16.2.2:6801 socket error on write

Interestingly enough only the guests on some nodes were unreachable during the incident.
The ceph status was healthy when all nodes were up.

The MDS could not be stopped via the GUI.
Everything went back to normal after the nodes holding the active MDS was rebooted.

Has anybody experienced a situation like this?
What could be the reason why the MDS stopped working correctly?
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!