[SOLVED] CephFS read-only - MDS don't report being in read only mode?

scyto

Active Member
Aug 8, 2023
485
90
28
I had a weird cephFS issue i couldn't source the root of.
i only noticed it on the two nodes that had opt-in 6.14 kernels, that could just be coincidence.
I have reverted to the 6.8 kenel on those two nodes.
I am back working but want to do some log archaeology to see if i could figure out what went wrong.
This all happened when i was playing with bridging two thunderbolt ports from my ceph-mesh-network on just a single proxmox/cephnode.

Symptoms:
  • on any pve host i could ls the /mnt/pve/cephfs-name
  • i could read files
  • i could touch a new file and see the 0 byte file be created
  • i could start a nano to create a file, but when i went to write a file the write hung (and hung the SSH ssession running nano too)
What i tried:
  1. i tried restarting the MDS
  2. i tried restating the OSDs
  3. i tried rebooting node 1 and 3 without the opt-in kernel (node 2 always had the 6.8.x kernel so that didn't get rebooted as part of that test)
  4. removing the PWL cache settings and rbd plugin setting (i had introduced these a few days a go)
  5. in the end i shut down the cluster gracefully and then brought all nodes up one by waiting for things to converge - this seems to do the trick
None of my MDS ever went into read only mode (i grepped the journal to be sure).

I am in a stable position. Before i see if i get thunderbolt bridging to work / risk tanking the system what can i search for in the journal that might help.
I have allready tried:
Code:
journalctl  | grep -i "docker-cephFS"
journalctl -u ceph-mds@* | grep -i "read only"

I have seen a bunch of these from when i was having issues, but they only confirm i was seeing issues:

Apr 25 15:06:00 pve3 pvestatd[1835]: unable to activate storage 'docker-cephFS' - directory '/mnt/pve/docker-cephFS' does not exist or is unreachable
Apr 25 15:06:08 pve3 pvestatd[1835]: unable to activate storage 'docker-cephFS' - directory '/mnt/pve/docker-cephFS' does not exist or is unreachable
Apr 25 15:06:45 pve3 systemd[1]: mnt-pve-docker\x2dcephFS.mount: Directory /mnt/pve/docker-cephFS to mount over is not empty, mounting anyway.
Apr 25 15:06:45 pve3 systemd[1]: Mounting mnt-pve-docker\x2dcephFS.mount - /mnt/pve/docker-cephFS...
Apr 25 15:06:45 pve3 systemd[1]: Mounted mnt-pve-docker\x2dcephFS.mount - /mnt/pve/docker-cephFS.
Apr 25 15:09:44 pve3 pvestatd[1835]: unable to activate storage 'docker-cephFS' - directory '/mnt/pve/docker-cephFS' does not exist or is unreachable
Apr 25 15:10:20 pve3 pvestatd[1835]: unable to activate storage 'docker-cephFS' - directory '/mnt/pve/docker-cephFS' does not exist or is unreachable
Apr 25 15:10:22 pve3 pvestatd[1835]: unable to activate storage 'docker-cephFS' - directory '/mnt/pve/docker-cephFS' does not exist or is unreachable
Apr 25 15:38:42 pve3 pvestatd[1835]: unable to activate storage 'docker-cephFS' - directory '/mnt/pve/docker-cephFS' does not exist or is unreachable
Apr 25 15:38:59 pve3 pvestatd[1835]: unable to activate storage 'docker-cephFS' - directory '/mnt/pve/docker-cephFS' does not exist or is unreachable
Apr 25 15:39:01 pve3 pvestatd[1835]: unable to activate storage 'docker-cephFS' - directory '/mnt/pve/docker-cephFS' does not exist or is unreachable
Apr 25 15:39:18 pve3 pvestatd[1835]: unable to activate storage 'docker-cephFS' - directory '/mnt/pve/docker-cephFS' does not exist or is unreachable
Apr 25 15:39:32 pve3 pvestatd[1835]: unable to activate storage 'docker-cephFS' - directory '/mnt/pve/docker-cephFS' does not exist or is unreachable
Apr 25 15:40:00 pve3 pvedaemon[197243]: unable to activate storage 'docker-cephFS' - directory '/mnt/pve/docker-cephFS' does not exist or is unreachable
Apr 25 15:40:02 pve3 pvedaemon[187221]: unable to activate storage 'docker-cephFS' - directory '/mnt/pve/docker-cephFS' does not exist or is unreachable
Apr 25 15:40:03 pve3 pvestatd[1835]: unable to activate storage 'docker-cephFS' - directory '/mnt/pve/docker-cephFS' does not exist or is unreachable
Apr 25 15:40:20 pve3 pvestatd[1835]: unable to activate storage 'docker-cephFS' - directory '/mnt/pve/docker-cephFS' does not exist or is unreachable
Apr 25 15:40:35 pve3 systemd[1]: mnt-pve-docker\x2dcephFS.mount: Directory /mnt/pve/docker-cephFS to mount over is not empty, mounting anyway.
Apr 25 15:40:35 pve3 systemd[1]: Mounting mnt-pve-docker\x2dcephFS.mount - /mnt/pve/docker-cephFS...
Apr 25 15:42:05 pve3 systemd[1]: Failed to mount mnt-pve-docker\x2dcephFS.mount - /mnt/pve/docker-cephFS.
Apr 25 15:42:05 pve3 systemd[1]: mnt-pve-docker\x2dcephFS.mount: Directory /mnt/pve/docker-cephFS to mount over is not empty, mounting anyway.

i have updated from reef to squid this morning as it was on my todo list, everything seems great so far - this is just about trying to figure out what happened and why
 
Last edited:
ok with the help of chatGPT i think this was ultimately caused by invalid settings i had added to frr.conf that caused the network to become unstable causing OSDs and mons to get in weird sates.

its odd the cluster came backup with those settings still in place.... but i think the issue may only manifest on an frr.service reload - i even had a stack trace or two.

as such marking as resolved (and i removed the bad settings)

[for anyone interested - chatgpt helped my create ceph and time filtered log queries and i narrowned down the time of when it went from good to bad, i then had it analyze the logs from all events and it found the root cause of the frr crash / flapping / and bad frr settings]