Hanging MDS after an unexpected reboot

jmi · Aug 13, 2024

We had an unexpected reboot of a cluster node. The node is part of our ceph setup.
Approx. one minute after corosync reported that a new membership was formed, the log of the host holding the active metadata service was flooded with lines like:

Code:

Aug 12 10:55:43 arthur ceph-mds[2717]: 2024-08-12T10:55:43.444+0200 7f0fe91c96c0 -1 failed to decode message of type 24 v6: End of buffer
Aug 12 10:55:43 arthur ceph-mds[2717]: 2024-08-12T10:55:43.444+0200 7f0fe89c86c0 -1 failed to decode message of type 24 v6: End of buffer
Aug 12 10:55:43 arthur ceph-mds[2717]: 2024-08-12T10:55:43.444+0200 7f0fe81c76c0 -1 failed to decode message of type 24 v6: End of buffer
Aug 12 10:55:43 arthur ceph-mds[2717]: 2024-08-12T10:55:43.444+0200 7f0fe89c86c0 -1 failed to decode message of type 24 v6: End of buffer
Aug 12 10:55:43 arthur kernel: libceph: mds0 (1)172.16.2.2:6801 socket closed (con state OPEN)
Aug 12 10:55:43 arthur ceph-mds[2717]: 2024-08-12T10:55:43.696+0200 7f0fe91c96c0 -1 failed to decode message of type 24 v6: End of buffer
Aug 12 10:55:43 arthur ceph-mds[2717]: 2024-08-12T10:55:43.696+0200 7f0fe89c86c0 -1 failed to decode message of type 24 v6: End of buffer
Aug 12 10:55:43 arthur kernel: libceph: mds0 (1)172.16.2.2:6801 socket closed (con state OPEN)

The logs of the other nodes contain lines like:

Code:

Aug 12 10:56:02 trillian kernel: libceph: mds0 (1)172.16.2.2:6801 socket error on write

Interestingly enough only the guests on some nodes were unreachable during the incident.
The ceph status was healthy when all nodes were up.

The MDS could not be stopped via the GUI.
Everything went back to normal after the nodes holding the active MDS was rebooted.

Has anybody experienced a situation like this?
What could be the reason why the MDS stopped working correctly?

Search

Search

Hanging MDS after an unexpected reboot

jmi

New Member