[SOLVED] Ceph Mount Issues on Additional PVE Cluster Servers – Need Help

skydiablo

Member
Dec 10, 2020
18
2
23
42
Hello!

I have a PVE cluster consisting of 8 servers that has been running for quite some time (since PVE v5 days) and always kept up to date. A big shoutout to the developers—everything runs very smoothly!

Three of the servers act as a Ceph cluster (with 10 disks each), and two additional servers are used for VMs. Everything is working without any issues! I've also always had a CephFS running, and the two VM servers have this mounted as well, which works perfectly. Now, I have three more servers where I wanted to mount the CephFS too. So, I went into the storage settings and added the additional servers there. In the GUI, I can see that the CephFS is listed, but it shows a "?" and I can't access the CephFS from these servers.

I've tried and tested many things but haven't had any real success yet. Using the systemd entry "mnt-pve-cephfs.mount," I could see in the journal that it couldn't find the config file (/etc/ceph/ceph.conf). I made this file available via a symlink from the "/etc/pve/" universe, and then it reported:

Code:
auth: unable to find a keyring on /etc/pve/priv/ceph.client.guest.keyring: (2) No such file or directory

This is correct; the file doesn't exist. However, there is a file with "admin" instead of "guest." So, for testing purposes, I copied the "admin.keyring" file as "guest.keyring." Now it gives me this message:

Code:
mount error: no mds (Metadata Server) is up. The cluster might be laggy, or you may not be authorized

Somehow, it's not working as it should. Interestingly, none of this is necessary on the VM servers where CephFS runs without any issues. There is no "ceph.conf" under "/etc/ceph/" nor any other special settings I can see. I can remove and re-add the CephFS in the GUI on the VM servers, and it always works seamlessly.

I should mention that other Ceph storages also don't work. I can't mount pure volume storages on the other servers either, although this works fine on the VM servers. What am I doing wrong? The additional servers are in the same network, and I can ping the Ceph servers from them. Somehow, it all feels a bit strange. It's probably something very simple that I overlooked—something I did on the VM servers back then but have now forgotten.

Does anyone have an idea what might be causing this? I'm grateful for any hints!

Best regards and thanks in advance,
Volker
 
A bit more infos would be useful. The new nodes are part of the same cluster?
Can they reach the Ceph nodes on the Ceph Public network?
 
I have tested all possible solutions, but I cannot identify the source of the problem.

To recap: I have 3 servers running as a Ceph cluster, and 2 another servers dedicated to VMs. This is the existing setup, running on the latest versions (PVE 8.3 and Ceph Squid), and everything works fine.

I added new servers to the cluster, which Proxmox detects and manages without any issues. However, when trying to integrate Ceph on the new servers, I always get a timeout.

I created a symlink on the new server:
/etc/ceph/ceph.conf -> /etc/pve/ceph.conf

The systemd service mnt-pve-cephfs.mount fails to mount CephFS, and when I run the command manually, I get a timeout:

Code:
# /bin/mount -v 100.64.0.50,100.64.0.51,100.64.0.52:/ /mnt/pve/cephfs -t ceph -o name=admin,secretfile=/etc/pve/priv/ceph/cephfs.secret
parsing options: rw,name=admin,secretfile=/etc/pve/priv/ceph/cephfs.secret
mount.ceph: options "name=admin".
invalid new device string format
mount.ceph: resolved to: "100.64.0.50,100.64.0.51,100.64.0.52"
mount.ceph: trying mount with old device syntax: 100.64.0.50,100.64.0.51,100.64.0.52:/
mount.ceph: options "name=admin,key=admin,fsid=00d789f8-da3b-4a4c-902a-14f9bcc48ec8" will pass to kernel
mount error 110 = Connection timed out

The dmesg output shows repeated session lost and hunting for new mon messages:


Code:
# dmesg | grep ceph
[69275.160742] libceph: mon2 (1)100.64.0.52:6789 session established
[69306.761069] libceph: mon2 (1)100.64.0.52:6789 session lost, hunting for new mon
...
[69549.969729] libceph: mon1 (1)100.64.0.51:3300 socket closed (con state V1_BANNER)

The same setup works fine on the older servers.

From the new servers, I can ping and telnet to the Ceph servers on ports 6789 and 3300, and tcpdump shows traffic in both directions on port 3300.

I am unable to pinpoint the issue. What can I check or configure to resolve this?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!