Cannot connect to external Ceph Nautilus "rbd error: rbd: listing images failed"

Leo David

Member
Apr 25, 2017
102
3
23
41
Hi,
We had a pretty stable Proxmox 5 nodes cluster running "pve-manager/6.3-3/eee5f901 (running kernel: 5.4.78-2-pve)".
The situation just became very bad right after one of the external ceph monitors decided to die, and after reinstalling it version went from 14.2.9 to 14.2.16.
So for a good consistency, all the ceph nodes ( including the other 2 monitors ) where upgraded to 14.2.16.
Now the situation is like this:
- one node for some reason still has luminous client ( not sure why is this happening since all pve hosts are running "6.3-3/eee5f901" and updated at the same time, with the same sources repos ). This pve host is the only host that can list rbd's and also to run the vms.
- the rest of the pve 4 nodes that have nautilus client, are not able anymore to list the rbd's and neither to run vms
The only temporarily workaround to make these pve nodes connect to external ceph and start vms was to copy ceph.conf and the keyring to /etc/ceph directory. ( this was just a desperate solution to get the critical vms back online )
The message displayed in UI went trying to list the disk images from the nodes with nautilus client is:

"rbd error: rbd: listing images failed: (2) No such file or directory (500)"

Any thoughs about this weird behaviour ?
Is the new nautilus client not working with the common way of adding external ceph keyring in "/etc/pve/priv/ceph" ?
Please let me know your thoughs, I am just out of any ideas...
Thank you so much !

Leo
 

Leo David

Member
Apr 25, 2017
102
3
23
41
Hi,
This becomes way too weird, I just cant find a real cause. Ceph is healthy, one of the pve nodes can normally connect, but the other pve servers throw that error. Did anyone experienced something similar ?
Also I am not sure what logs to trace, so to try find what is going wrong here...
Thank you,

Leo
 
Last edited:

Leo David

Member
Apr 25, 2017
102
3
23
41
Hi,
Placing ceph.conf and the keyring for the external cluster on the nodes was not such a good iddea, the vms work fine for a while, then they get stucked.
So at the moment the only pve host the can run vms and also list images from ceph external cluster is the one with luminous client.
On the other pve nodes that have nautilus client, I can only run vms placed on the local storage.
Any changes that it would be a version missmatch ? I just do not understand how to address these issues, the behaviour is way too odd, and a lot of vms cannot be started...
Cheers,

Leo
 

Leo David

Member
Apr 25, 2017
102
3
23
41
Hi,
Just to update if this info if tit helps for any suggestion.
The nodes that are not able to list/run vms disks, are still able to display the summary page of the external ceph usage.
They just throw the "rbd error: rbd: listing images failed: (2) No such file or directory (500)" error, as the "rbd" pool does not exist. But the rbd pool exists of course and its normally accesible from the node that runs the v12.0 ceph client.
Any thoughts ?

Cheers,

Leo
 

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
4,617
431
88
- one node for some reason still has luminous client ( not sure why is this happening since all pve hosts are running "6.3-3/eee5f901" and updated at the same time, with the same sources repos ). This pve host is the only host that can list rbd's and also to run the vms.
As installed version or only as running version? If the later, then you will need to migrate any VM/CT to another node and back. Since the connection is kept on update.

The only temporarily workaround to make these pve nodes connect to external ceph and start vms was to copy ceph.conf and the keyring to /etc/ceph directory. ( this was just a desperate solution to get the critical vms back online )
Usually the ceph.conf is a symlink to /etc/pve/ceph.conf and the ceph.client.admin.keyring is copied on creation into the /etc/ceph folder. But this is only needed if the Ceph cluster is installed (hyper-converged) on that cluster. Otherwise the storage config is sufficient.

The nodes that are not able to list/run vms disks, are still able to display the summary page of the external ceph usage.
They just throw the "rbd error: rbd: listing images failed: (2) No such file or directory (500)" error, as the "rbd" pool does not exist. But the rbd pool exists of course and its normally accesible from the node that runs the v12.0 ceph client.
What version is the Ceph cluster? And what do you mean by external ceph usage? Isn't the cluster hyper-converged?
 

Leo David

Member
Apr 25, 2017
102
3
23
41
Hi Alwin,
Thank you so much for helping me out with this issue.
Ceph is running on a 7 nodes external cluster, and attached to pve cluster as per documantation. Everything worked normally until we got into this problem.
As per version,

[root@ceph1 ~]# ceph mon versions { "ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)": 3 }

ceph mon dump dumped monmap epoch 3 epoch 3 fsid 053b3af8-f1a6-4ebd-9512-bf5fe1341c6a last_changed 2021-01-07 14:04:38.082529 created 2019-09-24 16:48:27.114309 min_mon_release 14 (nautilus) 0: [v2:10.10.6.2:3300/0,v1:10.10.6.2:6789/0] mon.ceph2 1: [v2:10.10.6.3:3300/0,v1:10.10.6.3:6789/0] mon.ceph3 2: [v2:10.10.6.1:3300/0,v1:10.10.6.1:6789/0] mon.ceph1

Also regarding pve cluster details:

pvecm nodes Membership information ---------------------- Nodeid Votes Name 1 1 c6220-ch1-node1 2 1 c6220-ch1-node2 3 1 c6220-ch1-node3 4 1 c6220-ch1-node4 5 1 dell-r630-2 (local)

Node "dell-r630-2" is this only node that still works normally.

In "/etc/pve/storage.cfg" we have:

rbd: external-ceph-rbd content images krbd 1 monhost 10.10.6.1 10.10.6.2 10.10.6.3 pool rbd username admin

File /etc/pve/priv/ceph/external-ceph-rbd.keyring is also present.

Regarding "display the summary page of the external ceph usage", I mean that on any pve node when I select the "external-ceph-rbd" storage, and going to "Summary" it displays the space used.
But "VM Disks" only lists the disks from "dell-r630-2". When trying to list the disks from the other node, it throws that error. Somehow, its like the first 4 nodes are not able to list the "rbd" pool content, or just can't find that pool at all.

Now, taking each pve node:

root@dell-r630-2:~# pveversion pve-manager/6.3-3/eee5f901 (running kernel: 5.4.78-2-pve) root@dell-r630-2:~# ceph -v ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable)

root@c6220-ch1-node1:~# pveversion pve-manager/6.3-3/eee5f901 (running kernel: 5.4.78-2-pve) root@c6220-ch1-node1:~# ceph -v ceph version 14.2.16 (5d5ae817209e503a412040d46b3374855b7efe04) nautilus (stable)

root@c6220-ch1-node2:~# pveversion pve-manager/6.3-3/eee5f901 (running kernel: 5.4.78-2-pve) root@c6220-ch1-node2:~# ceph -v ceph version 14.2.16 (5d5ae817209e503a412040d46b3374855b7efe04) nautilus (stable)

root@c6220-ch1-node3:~# pveversion pve-manager/6.3-3/eee5f901 (running kernel: 5.4.78-2-pve) root@c6220-ch1-node3:~# ceph -v ceph version 14.2.16 (5d5ae817209e503a412040d46b3374855b7efe04) nautilus (stable)

root@c6220-ch1-node4:~# pveversion pve-manager/6.3-3/eee5f901 (running kernel: 5.4.78-2-pve) root@c6220-ch1-node4:~# ceph -v ceph version 14.2.16 (5d5ae817209e503a412040d46b3374855b7efe04) nautilus (stable)

The external ceph also looks happy:

[root@ceph1 ~]# ceph -s cluster: id: 053b3af8-f1a6-4ebd-9512-bf5fe1341c6a health: HEALTH_OK services: mon: 3 daemons, quorum ceph2,ceph3,ceph1 (age 30h) mgr: ceph1(active, since 30h), standbys: ceph3, ceph2 mds: cephfs:1 {0=ceph3=up:active} 2 up:standby osd: 28 osds: 28 up (since 30h), 28 in (since 30h) rgw: 2 daemons active (ceph1.rgw0, ceph2.rgw0) rgw-nfs: 1 daemon active (ceph2) task status: scrub status: mds.ceph3: idle data: pools: 17 pools, 1073 pgs objects: 3.21M objects, 6.6 TiB usage: 20 TiB used, 18 TiB / 38 TiB avail pgs: 1072 active+clean 1 active+clean+scrubbing+deep io: client: 6.2 KiB/s rd, 1.8 MiB/s wr, 4 op/s rd, 69 op/s wr

Any thoughts ?
Again, thank you so much for help !

Leo
 
Last edited:

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
4,617
431
88
monhost 10.10.6.1 10.10.6.2 10.10.6.3
Can these MONs be reached by all nodes in the cluster?

Anything in the log files of those nodes?
 

Leo David

Member
Apr 25, 2017
102
3
23
41
Can these MONs be reached by all nodes in the cluster?
Yes, they definetelly can be reached. I assume that displaying the external storage capacity usage would not be possible otherwise.

Anything in the log files of those nodes?
I am not sure what logs to trace, that would reveal storage accesibiliity issues...
 

Leo David

Member
Apr 25, 2017
102
3
23
41
I jave just tried now to conect to the same ceph cluster from a different single instance pve node that has:

pveversion pve-manager/6.3-2/22f57405 (running kernel: 5.4.78-1-pve)

ceph -v ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable)

Everything works normal....

Again, the issue seems to be isolated to the nodes with 14.x Nautilus client installed.
I was even thinking to downgrade the ceph client on the affected nodes back to 12.x client, but I am not sure how should i do it in a secure way without breaking down the nodes ( this will also remove important pve packages as being dependent) .
Of course, main target should be not to downgrade, but to upgrade.
My issue is that I am just out of any ideeas of was could be the root cause.
Any changes that the new V2 monitor protocol version has been introduced in newer ceph versions and also the ceph.conf looks a bit different to include newerly introduced port 3300/tcp along with old well-known 6789/tcp ?
Thank you,

Leo
 
Last edited:

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
4,617
431
88
Again, the issue seems to be isolated to the nodes with 14.x Nautilus client installed.
Yes, but your cluster is Nautilus as well. Hence, why to look in the logs (any / all of them), to see what might be logged that can explain it.

Any changes that the new V2 monitor protocol version has been introduced in newer ceph versions and also the ceph.conf looks a bit different to include newerly introduced port 3300/tcp along with old well-known 6789/tcp ?
After restart of the MONs, they are listing on both ports, assuming the msgrv2 protocol has been activated. Is there a firewall in between?
 

Leo David

Member
Apr 25, 2017
102
3
23
41
Is there a firewall in between?
No firewall between, and also iptables on ceph nodes are not blocking anything.
Was is a bit disruptive is that ceph can be interogated about its usage, but it cannot show information regarding pool and pool content...

Tailing "/var/log/pveproxy/access.log" when selecting "external-ceph-rbd" ( wich displays current usage )

10.10.2.14 - root@pam [12/01/2021:11:37:57 +0000] "GET /api2/json/nodes/c6220-ch1-node1/storage/external-ceph-rbd/status HTTP/1.1" 200 141 10.10.2.14 - root@pam [12/01/2021:11:37:57 +0000] "GET /api2/json/cluster/tasks HTTP/1.1" 200 4935 10.10.2.14 - root@pam [12/01/2021:11:37:57 +0000] "GET /api2/json/cluster/resources HTTP/1.1" 200 4459 10.10.2.14 - root@pam [12/01/2021:11:37:59 +0000] "GET /api2/json/nodes/c6220-ch1-node1/storage/external-ceph-rbd/status HTTP/1.1" 200 141

When selecting "VM Disks":

10.10.2.14 - root@pam [12/01/2021:11:38:51 +0000] "GET /api2/json/nodes/c6220-ch1-node1/storage/external-ceph-rbd/content?content=images HTTP/1.1" 500 13 10.10.2.14 - root@pam [12/01/2021:11:38:52 +0000] "GET /api2/json/cluster/tasks HTTP/1.1" 200 4933

Then UI displays : "rbd error: rbd: listing images failed: (2) No such file or directory (500)"

Are there any other pve logs to trace that woudl reveal something regarding storage connectivity ?

Thank you so much for help !

Leo
 
Last edited:

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
4,617
431
88
Then UI displays : "rbd error: rbd: listing images failed: (2) No such file or directory (500)"
Does a manual listing on the cli work? You will need to specify the MONs, user and key by hand.
 

Leo David

Member
Apr 25, 2017
102
3
23
41
Ok,
So i have followed same path as we previously used for the servers with issues.
1. Installed pve 6.0.1 on a single node.
2. Upgraded to latest 6.3-3 && reboot
- noticed that still ceph client v12 is still present
3. Added external ceph cluster
- everything is fine, vms disks are listed
4. Created and added ceph nautilus repo to "/etc/apt/sources.list.d/ceph.list"
deb http://download.proxmox.com/debian/ceph-nautilus buster main
5. Issued "Upgrade" from UI - now it installed the 14.6 ceph client
6. Now external ceph behavies like the other nodes
- Ceph storage usage is displayed, but when trying to list the disks gives "rbd error: rbd: listing images failed: (2) No such file or directory (500)"

Any thoughts ?
 

Leo David

Member
Apr 25, 2017
102
3
23
41
Hi,
Desperately looking for a solution, so I have reistalled a fresh single node with the latest iso image.
Result was having running:

pveversion pve-manager/6.3-2/22f57405 (running kernel: 5.4.73-1-pve) ceph -v ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable)

After adding:

deb http://download.proxmox.com/debian/pve buster pve-no-subscription

and update the system, I have :

pveversion pve-manager/6.3-3/eee5f901 (running kernel: 5.4.78-2-pve) ceph -v ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable)

So still on Luminous client.

1. Is this expected: to have ceph v12 client installed even with the latest pve version and only by adding the ceph.repo will update the ceph client ?

Now, comming to my root problem, it seems that the only way to maje things back working would be to downgrade ceph client back to v12, so:

2. Is there a safe way to downgrade ceph back to Luminous without braking pve components ? If so, what would it be the steps ?

Thank you so much for helping me out !

Cheers,

Leo
 
Last edited:

Leo David

Member
Apr 25, 2017
102
3
23
41
Hello everyone,
I have tryed to downgrade ceph client on a cluster member pve node, but it seems not possible. I've ended up with uninstalling pve core packages, and when i've re-installed them, v14 ceph client was added back, so got me in the same situation.
Any thoughts about procedure to downgrade ceph client to v12 ? Is it even possible ?
I am just trying to spare all 4 nodes of reinstalling from scratch just because of ceph client version.
Thank you,

Leo
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!