[SOLVED] Ceph not mounting

meisuser · Jun 3, 2024

Hi, I have a 3 node cluster and things were running well for several months, suddenly I had found that some VMs were fenced (probably using that term wrong) and in my effort to troubleshoot found that Ceph was not mounting. Running ceph -s doesn't work, the GUI 500 errors if I try viewing any Ceph stuff and most of the errors I can find seem to not really turn up much or if they do the problem seems to be different. The monitors seem to be running but are spamming with:
[ 3725.264440] ceph: No mds server is up or the cluster is laggy
[ 3734.447225] libceph: mon1 (1)192.168.6.21:6789 socket closed (con state V1_BANNER)
[ 3734.704469] libceph: mon1 (1)192.168.6.21:6789 socket closed (con state V1_BANNER)
[ 3735.208468] libceph: mon1 (1)192.168.6.21:6789 socket closed (con state V1_BANNER)
[ 3736.656702] libceph: mon1 (1)192.168.6.21:6789 socket closed (con state V1_BANNER)
[ 3740.304488] libceph: mon0 (1)192.168.6.20:6789 socket closed (con state OPEN)

So I looked to see if mds is running and it seems to be restarting over and over with this in the logs
ceph-mds[9702]: failed to fetch mon config (--no-mon-config to skip)

Some other troubleshooting I tried was seeing if I could communicate over the storage network to all the nodes, pings work, connecting to 6789 on each node to itself works but fails between nodes, even with the pve-firewall disabled. I have no idea if that is normal but since I normally work on ESXi and am evaluating this as a replacement I am totally a noob.

I am on PVE 8.2.2 upgraded a few weeks before, I also at one point seemed to upgrade Ceph to the newer version successfully. The storage network is separate, and I am not sure what other details one might need to help but let me know if I didn't provide enough info.

rit1001 · Jun 3, 2024

What do you get returned if you run ceph -s from the console shell on each of your nodes?

meisuser · Jun 3, 2024

It just hangs, I haven't tested for longer than 10-15 mins, but I can fire up tmux and leave it if it might eventually return.

rit1001 · Jun 3, 2024

This thread may help as it has two ideas

https://forum.proxmox.com/threads/ceph-status-times-out.68173/

- check how much disk space is free on (I guess) the proxmox system volume of each node as the poster found that ceph-mon could not load due to lack of space.

- link to proxmox support notes on ceph

meisuser · Jun 3, 2024

Okay seems to have plenty of space on all 3:

Filesystem Size Used Avail Use% Mounted on
udev 95G 0 95G 0% /dev
tmpfs 19G 1.8M 19G 1% /run
/dev/mapper/pve-root 78G 17G 57G 24% /
tmpfs 95G 66M 95G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
efivarfs 64K 45K 15K 76% /sys/firmware/efi/efivars
/dev/sda2 1022M 356K 1022M 1% /boot/efi
/dev/fuse 128M 80K 128M 1% /etc/pve
tmpfs 95G 24K 95G 1% /var/lib/ceph/osd/ceph-1
tmpfs 95G 24K 95G 1% /var/lib/ceph/osd/ceph-2
tmpfs 95G 24K 95G 1% /var/lib/ceph/osd/ceph-0
tmpfs 19G 0 19G 0% /run/user/0

Filesystem Size Used Avail Use% Mounted on
udev 95G 0 95G 0% /dev
tmpfs 19G 1.8M 19G 1% /run
/dev/mapper/pve-root 78G 9.6G 65G 13% /
tmpfs 95G 51M 95G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
efivarfs 64K 17K 43K 28% /sys/firmware/efi/efivars
/dev/sda2 1022M 356K 1022M 1% /boot/efi
/dev/fuse 128M 80K 128M 1% /etc/pve
tmpfs 95G 24K 95G 1% /var/lib/ceph/osd/ceph-4
tmpfs 95G 24K 95G 1% /var/lib/ceph/osd/ceph-3
tmpfs 95G 24K 95G 1% /var/lib/ceph/osd/ceph-5
tmpfs 19G 0 19G 0% /run/user/0

Filesystem Size Used Avail Use% Mounted on
udev 95G 0 95G 0% /dev
tmpfs 19G 1.8M 19G 1% /run
/dev/mapper/pve-root 78G 9.5G 65G 13% /
tmpfs 95G 51M 95G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
efivarfs 64K 12K 48K 19% /sys/firmware/efi/efivars
/dev/sda2 1022M 356K 1022M 1% /boot/efi
/dev/fuse 128M 80K 128M 1% /etc/pve
tmpfs 95G 24K 95G 1% /var/lib/ceph/osd/ceph-7
tmpfs 95G 24K 95G 1% /var/lib/ceph/osd/ceph-6
tmpfs 95G 24K 95G 1% /var/lib/ceph/osd/ceph-8
tmpfs 19G 0 19G 0% /run/user/0

I did get the ceph -s to eventually timeout but I think this is just a standard timeout message:
2024-06-03T10:53:48.383-0400 778ad60006c0 0 monclient(hunting): authenticate timed out after 300
[errno 110] RADOS timed out (error connecting to the cluster)

rit1001 · Jun 3, 2024

What output do you get from running

pveceph status

and

pvecm status

meisuser · Jun 3, 2024

The first one returns pretty quick that "ceph -s" timed out. The other returns this on all 3 (the "local" at the bottom just moves between whichever I run it on):

Cluster information
-------------------
Name: node0
Config Version: 3
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Mon Jun 3 17:49:22 2024
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.292
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.64.6.x (local)
0x00000002 1 10.64.6.x
0x00000003 1 10.64.6.x

rit1001 · Jun 4, 2024

OK, I can honestly say that I have reached the limits of my understanding of the processes involved, but hopefully, someone else can build on the info you have provided.

One last thing can you post the output of

ps -A | grep ceph

As that will show which ceph processes are running

meisuser · Jun 4, 2024

Thanks for the help anyway! The one that stands out is ceph-crash, but I think that is a crash dump generator and is probably triggered by the part that is restarting over and over:

ceph 2212 1 0 10:14 ? 00:00:00 /usr/bin/python3 /usr/bin/ceph-crash
ceph 3095 1 0 10:14 ? 00:02:04 /usr/bin/ceph-mon -f --cluster ceph --id localhost --setuser ceph --setgroup ceph
root 3598 2 0 10:14 ? 00:00:00 [kworker/R-ceph-]
root 28226 2 0 15:40 ? 00:00:00 [kworker/23:0-ceph-msgr]
ceph 40822 1 0 18:54 ? 00:00:00 /usr/bin/ceph-mds -f --cluster ceph --id pmox0 --setuser ceph --setgroup ceph
root 40869 3356 0 18:55 ? 00:00:00 systemctl start mnt-pve-cephfs.mount
root 40870 1 0 18:55 ? 00:00:00 /bin/mount 192.168.6.x,192.168.6.x,192.168.6.x:/ /mnt/pve/cephfs -t ceph -o name=admin,secretfile=/etc/pve/priv/ceph/cephfs.secret,conf=/etc/pve/ceph.conf,fs=cephfs
root 40871 40870 0 18:55 ? 00:00:00 /sbin/mount.ceph 192.168.6.x,192.168.6.x,192.168.6.x:/ /mnt/pve/cephfs -o rw name admin secretfile /etc/pve/priv/ceph/cephfs.secret conf /etc/pve/ceph.conf fs cephfs
root 40877 2 0 18:55 ? 00:00:00 [kworker/R-ceph-]
root 40878 2 0 18:55 ? 00:00:00 [kworker/R-ceph-]
ceph 40898 1 2 18:56 ? 00:00:00 /usr/bin/ceph-mgr -f --cluster ceph --id localhost --setuser ceph --setgroup ceph

meisuser · Jun 5, 2024

If I was getting successful backups to a Proxmox Backup server, can I safely blow this cluster away and rebuild it? I think the issue is that the Ceph monitors can't connect node to node even though there does not appear to be any network issues other than the remote monitors rejecting connections. Locally from each node I get this output from nmap:
PORT STATE SERVICE
22/tcp open ssh
25/tcp open smtp
111/tcp open rpcbind
1311/tcp open rxmon
3128/tcp open squid-http
3300/tcp open ceph
6789/tcp open ibm-db2-admin

ss shows the Ceph monitor listening on 0.0.0.0

However to the other remote nodes:
PORT STATE SERVICE
22/tcp open ssh
111/tcp open rpcbind
1311/tcp open rxmon
3128/tcp open squid-http
MAC Address: BC:30:5B:EE:04:B8 (Dell)

I tried disabling the pve firewall with pve-firewall stop, that didn't seem to do anything, iptables and nft don't seem to show any rules at all. I can ssh to each node from each other node over the storage network so I am pretty sure the networking is okay. I can't seem to find any actual reason or error log that shows why the monitors can't communicate.

meisuser · Jun 5, 2024

Oddly when I navigate in the UI to Ceph on the individual nodes, two of them claim Ceph is not installed at all, even though apt search ceph | grep installed shows it is.

meisuser · Jun 5, 2024

Well that was the answer, I don't know what the heck happened, but everything *except* the ceph-mon package was somehow installed on two of the nodes. This was previously working so that does not make much sense. On top of that the apt source for ceph on one of the nodes went back to the enterprise subscription and even after editing it to match the others I couldn't get apt to behave until I copied the lists from the one functional node.

After reinstalling the missing package on one node it was able to get a quorum and all the storage appeared. I have one VM (that I know of) that seems to have become corrupted so I might have to restore that from backup.

Search

Search

[SOLVED] Ceph not mounting

meisuser

Member

rit1001

New Member

meisuser

Member

rit1001

New Member

meisuser

Member

rit1001

New Member

meisuser

Member

rit1001

New Member

meisuser

Member

meisuser

Member

meisuser

Member

meisuser

Member

We value your privacy