Cannot connect to external Ceph Nautilus "rbd error: rbd: listing images failed"

Leo David · Jan 8, 2021

Hi,
We had a pretty stable Proxmox 5 nodes cluster running "pve-manager/6.3-3/eee5f901 (running kernel: 5.4.78-2-pve)".
The situation just became very bad right after one of the external ceph monitors decided to die, and after reinstalling it version went from 14.2.9 to 14.2.16.
So for a good consistency, all the ceph nodes ( including the other 2 monitors ) where upgraded to 14.2.16.
Now the situation is like this:
- one node for some reason still has luminous client ( not sure why is this happening since all pve hosts are running "6.3-3/eee5f901" and updated at the same time, with the same sources repos ). This pve host is the only host that can list rbd's and also to run the vms.
- the rest of the pve 4 nodes that have nautilus client, are not able anymore to list the rbd's and neither to run vms
The only temporarily workaround to make these pve nodes connect to external ceph and start vms was to copy ceph.conf and the keyring to /etc/ceph directory. ( this was just a desperate solution to get the critical vms back online )
The message displayed in UI went trying to list the disk images from the nodes with nautilus client is:

"rbd error: rbd: listing images failed: (2) No such file or directory (500)"

Any thoughs about this weird behaviour ?
Is the new nautilus client not working with the common way of adding external ceph keyring in "/etc/pve/priv/ceph" ?
Please let me know your thoughs, I am just out of any ideas...
Thank you so much !

Leo

Leo David · Jan 9, 2021

Hi,
This becomes way too weird, I just cant find a real cause. Ceph is healthy, one of the pve nodes can normally connect, but the other pve servers throw that error. Did anyone experienced something similar ?
Also I am not sure what logs to trace, so to try find what is going wrong here...
Thank you,

Leo

Leo David · Jan 10, 2021

Hi,
Placing ceph.conf and the keyring for the external cluster on the nodes was not such a good iddea, the vms work fine for a while, then they get stucked.
So at the moment the only pve host the can run vms and also list images from ceph external cluster is the one with luminous client.
On the other pve nodes that have nautilus client, I can only run vms placed on the local storage.
Any changes that it would be a version missmatch ? I just do not understand how to address these issues, the behaviour is way too odd, and a lot of vms cannot be started...
Cheers,

Leo

Leo David · Jan 11, 2021

Hi,
Just to update if this info if tit helps for any suggestion.
The nodes that are not able to list/run vms disks, are still able to display the summary page of the external ceph usage.
They just throw the "rbd error: rbd: listing images failed: (2) No such file or directory (500)" error, as the "rbd" pool does not exist. But the rbd pool exists of course and its normally accesible from the node that runs the v12.0 ceph client.
Any thoughts ?

Cheers,

Leo

Alwin · Jan 11, 2021

Leo David said:
- one node for some reason still has luminous client ( not sure why is this happening since all pve hosts are running "6.3-3/eee5f901" and updated at the same time, with the same sources repos ). This pve host is the only host that can list rbd's and also to run the vms.

As installed version or only as running version? If the later, then you will need to migrate any VM/CT to another node and back. Since the connection is kept on update.

Leo David said:
The only temporarily workaround to make these pve nodes connect to external ceph and start vms was to copy ceph.conf and the keyring to /etc/ceph directory. ( this was just a desperate solution to get the critical vms back online )

Usually the ceph.conf is a symlink to /etc/pve/ceph.conf and the ceph.client.admin.keyring is copied on creation into the /etc/ceph folder. But this is only needed if the Ceph cluster is installed (hyper-converged) on that cluster. Otherwise the storage config is sufficient.

Leo David said:
The nodes that are not able to list/run vms disks, are still able to display the summary page of the external ceph usage.
They just throw the "rbd error: rbd: listing images failed: (2) No such file or directory (500)" error, as the "rbd" pool does not exist. But the rbd pool exists of course and its normally accesible from the node that runs the v12.0 ceph client.

What version is the Ceph cluster? And what do you mean by external ceph usage? Isn't the cluster hyper-converged?

Leo David · Jan 11, 2021

Hi Alwin,
Thank you so much for helping me out with this issue.
Ceph is running on a 7 nodes external cluster, and attached to pve cluster as per documantation. Everything worked normally until we got into this problem.
As per version,

[root@ceph1 ~]# ceph mon versions
{
    "ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)": 3
}

ceph mon dump
dumped monmap epoch 3
epoch 3
fsid 053b3af8-f1a6-4ebd-9512-bf5fe1341c6a
last_changed 2021-01-07 14:04:38.082529
created 2019-09-24 16:48:27.114309
min_mon_release 14 (nautilus)
0: [v2:10.10.6.2:3300/0,v1:10.10.6.2:6789/0] mon.ceph2
1: [v2:10.10.6.3:3300/0,v1:10.10.6.3:6789/0] mon.ceph3
2: [v2:10.10.6.1:3300/0,v1:10.10.6.1:6789/0] mon.ceph1

Also regarding pve cluster details:

pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         1          1 c6220-ch1-node1
         2          1 c6220-ch1-node2
         3          1 c6220-ch1-node3
         4          1 c6220-ch1-node4
         5          1 dell-r630-2 (local)

Node "dell-r630-2" is this only node that still works normally.

In "/etc/pve/storage.cfg" we have:

rbd: external-ceph-rbd
        content images
        krbd 1
        monhost 10.10.6.1 10.10.6.2 10.10.6.3
        pool rbd
        username admin

File /etc/pve/priv/ceph/external-ceph-rbd.keyring is also present.

Regarding "display the summary page of the external ceph usage", I mean that on any pve node when I select the "external-ceph-rbd" storage, and going to "Summary" it displays the space used.
But "VM Disks" only lists the disks from "dell-r630-2". When trying to list the disks from the other node, it throws that error. Somehow, its like the first 4 nodes are not able to list the "rbd" pool content, or just can't find that pool at all.

Now, taking each pve node:

root@dell-r630-2:~# pveversion
pve-manager/6.3-3/eee5f901 (running kernel: 5.4.78-2-pve)
root@dell-r630-2:~# ceph -v
ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable)

root@c6220-ch1-node1:~#  pveversion
pve-manager/6.3-3/eee5f901 (running kernel: 5.4.78-2-pve)
root@c6220-ch1-node1:~#  ceph -v
ceph version 14.2.16 (5d5ae817209e503a412040d46b3374855b7efe04) nautilus (stable)

root@c6220-ch1-node2:~# pveversion
pve-manager/6.3-3/eee5f901 (running kernel: 5.4.78-2-pve)
root@c6220-ch1-node2:~# ceph -v
ceph version 14.2.16 (5d5ae817209e503a412040d46b3374855b7efe04) nautilus (stable)

root@c6220-ch1-node3:~# pveversion
pve-manager/6.3-3/eee5f901 (running kernel: 5.4.78-2-pve)
root@c6220-ch1-node3:~# ceph -v
ceph version 14.2.16 (5d5ae817209e503a412040d46b3374855b7efe04) nautilus (stable)

root@c6220-ch1-node4:~# pveversion
pve-manager/6.3-3/eee5f901 (running kernel: 5.4.78-2-pve)
root@c6220-ch1-node4:~# ceph -v
ceph version 14.2.16 (5d5ae817209e503a412040d46b3374855b7efe04) nautilus (stable)

The external ceph also looks happy:

[root@ceph1 ~]# ceph -s
  cluster:
    id:     053b3af8-f1a6-4ebd-9512-bf5fe1341c6a
    health: HEALTH_OK

  services:
    mon:     3 daemons, quorum ceph2,ceph3,ceph1 (age 30h)
    mgr:     ceph1(active, since 30h), standbys: ceph3, ceph2
    mds:     cephfs:1 {0=ceph3=up:active} 2 up:standby
    osd:     28 osds: 28 up (since 30h), 28 in (since 30h)
    rgw:     2 daemons active (ceph1.rgw0, ceph2.rgw0)
    rgw-nfs: 1 daemon active (ceph2)

  task status:
    scrub status:
        mds.ceph3: idle

  data:
    pools:   17 pools, 1073 pgs
    objects: 3.21M objects, 6.6 TiB
    usage:   20 TiB used, 18 TiB / 38 TiB avail
    pgs:     1072 active+clean
             1    active+clean+scrubbing+deep

  io:
    client:   6.2 KiB/s rd, 1.8 MiB/s wr, 4 op/s rd, 69 op/s wr

Any thoughts ?
Again, thank you so much for help !

Leo

Alwin · Jan 12, 2021

Leo David said:
monhost 10.10.6.1 10.10.6.2 10.10.6.3

Can these MONs be reached by all nodes in the cluster?

Anything in the log files of those nodes?

Leo David · Jan 12, 2021

Alwin said:
Can these MONs be reached by all nodes in the cluster?

Yes, they definetelly can be reached. I assume that displaying the external storage capacity usage would not be possible otherwise.

Alwin said:
Anything in the log files of those nodes?

I am not sure what logs to trace, that would reveal storage accesibiliity issues...

Leo David · Jan 12, 2021

I jave just tried now to conect to the same ceph cluster from a different single instance pve node that has:

pveversion
pve-manager/6.3-2/22f57405 (running kernel: 5.4.78-1-pve)

ceph -v
ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable)

Everything works normal....

Again, the issue seems to be isolated to the nodes with 14.x Nautilus client installed.
I was even thinking to downgrade the ceph client on the affected nodes back to 12.x client, but I am not sure how should i do it in a secure way without breaking down the nodes ( this will also remove important pve packages as being dependent) .
Of course, main target should be not to downgrade, but to upgrade.
My issue is that I am just out of any ideeas of was could be the root cause.
Any changes that the new V2 monitor protocol version has been introduced in newer ceph versions and also the ceph.conf looks a bit different to include newerly introduced port 3300/tcp along with old well-known 6789/tcp ?
Thank you,

Leo

Alwin · Jan 12, 2021

Leo David said:
Again, the issue seems to be isolated to the nodes with 14.x Nautilus client installed.

Yes, but your cluster is Nautilus as well. Hence, why to look in the logs (any / all of them), to see what might be logged that can explain it.

Leo David said:
Any changes that the new V2 monitor protocol version has been introduced in newer ceph versions and also the ceph.conf looks a bit different to include newerly introduced port 3300/tcp along with old well-known 6789/tcp ?

After restart of the MONs, they are listing on both ports, assuming the msgrv2 protocol has been activated. Is there a firewall in between?

Leo David · Jan 12, 2021

Alwin said:
Is there a firewall in between?

No firewall between, and also iptables on ceph nodes are not blocking anything.
Was is a bit disruptive is that ceph can be interogated about its usage, but it cannot show information regarding pool and pool content...

Tailing "/var/log/pveproxy/access.log" when selecting "external-ceph-rbd" ( wich displays current usage )

10.10.2.14 - root@pam [12/01/2021:11:37:57 +0000] "GET /api2/json/nodes/c6220-ch1-node1/storage/external-ceph-rbd/status HTTP/1.1" 200 141
10.10.2.14 - root@pam [12/01/2021:11:37:57 +0000] "GET /api2/json/cluster/tasks HTTP/1.1" 200 4935
10.10.2.14 - root@pam [12/01/2021:11:37:57 +0000] "GET /api2/json/cluster/resources HTTP/1.1" 200 4459
10.10.2.14 - root@pam [12/01/2021:11:37:59 +0000] "GET /api2/json/nodes/c6220-ch1-node1/storage/external-ceph-rbd/status HTTP/1.1" 200 141

When selecting "VM Disks":

10.10.2.14 - root@pam [12/01/2021:11:38:51 +0000] "GET /api2/json/nodes/c6220-ch1-node1/storage/external-ceph-rbd/content?content=images HTTP/1.1" 500 13
10.10.2.14 - root@pam [12/01/2021:11:38:52 +0000] "GET /api2/json/cluster/tasks HTTP/1.1" 200 4933

Then UI displays : "rbd error: rbd: listing images failed: (2) No such file or directory (500)"

Are there any other pve logs to trace that woudl reveal something regarding storage connectivity ?

Thank you so much for help !

Leo

Alwin · Jan 12, 2021

Leo David said:
Then UI displays : "rbd error: rbd: listing images failed: (2) No such file or directory (500)"

Does a manual listing on the cli work? You will need to specify the MONs, user and key by hand.

Leo David · Jan 12, 2021

Alwin said:
Does a manual listing on the cli work? You will need to specify the MONs, user and key by hand.

I am not sure how to accomplish this, is it related to pvesm command ?

Leo David · Jan 12, 2021

Ok,
So i have followed same path as we previously used for the servers with issues.
1. Installed pve 6.0.1 on a single node.
2. Upgraded to latest 6.3-3 && reboot
- noticed that still ceph client v12 is still present
3. Added external ceph cluster
- everything is fine, vms disks are listed
4. Created and added ceph nautilus repo to "/etc/apt/sources.list.d/ceph.list"
deb http://download.proxmox.com/debian/ceph-nautilus buster main
5. Issued "Upgrade" from UI - now it installed the 14.6 ceph client
6. Now external ceph behavies like the other nodes
- Ceph storage usage is displayed, but when trying to list the disks gives "rbd error: rbd: listing images failed: (2) No such file or directory (500)"

Any thoughts ?

Leo David · Jan 13, 2021

Hi,
Desperately looking for a solution, so I have reistalled a fresh single node with the latest iso image.
Result was having running:

pveversion
pve-manager/6.3-2/22f57405 (running kernel: 5.4.73-1-pve)
ceph -v
ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable)

After adding:

deb http://download.proxmox.com/debian/pve buster pve-no-subscription

and update the system, I have :

pveversion
pve-manager/6.3-3/eee5f901 (running kernel: 5.4.78-2-pve)
ceph -v
ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable)

So still on Luminous client.

1. Is this expected: to have ceph v12 client installed even with the latest pve version and only by adding the ceph.repo will update the ceph client ?

Now, comming to my root problem, it seems that the only way to maje things back working would be to downgrade ceph client back to v12, so:

2. Is there a safe way to downgrade ceph back to Luminous without braking pve components ? If so, what would it be the steps ?

Thank you so much for helping me out !

Cheers,

Leo

Leo David · Jan 14, 2021

Hello everyone,
I have tryed to downgrade ceph client on a cluster member pve node, but it seems not possible. I've ended up with uninstalling pve core packages, and when i've re-installed them, v14 ceph client was added back, so got me in the same situation.
Any thoughts about procedure to downgrade ceph client to v12 ? Is it even possible ?
I am just trying to spare all 4 nodes of reinstalling from scratch just because of ceph client version.
Thank you,

Leo

jandoe88 · Mar 20, 2021

Hi. We have the same Problem. I had to boot one of our proxmox-nodes and now it cannot use our external ceph. There is nothing in the logs. All network connections are okay. It can reach the monitors. I can open the summary for the ceph-storage without any problem. But i cannot list content / vm disks and i cannot migrate back the vms from other cluster nodes. And i also cannot create new vms on ceph storage .

Code:

root@pve-03:~# pveversion
pve-manager/6.3-6/2184247e (running kernel: 5.4.103-1-pve)
root@pve-03:~# ceph -v
ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (s

Ceph Version at Ceph-Cluster:
ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)

Problem still persists after upgrade to ceph 14 at the proxmox node:

Code:

root@pve-03:~# ceph -v
ceph version 14.2.16 (5d5ae817209e503a412040d46b3374855b7efe04) nautilus (stable)

Lg
Jan

jandoe88 · Apr 21, 2021

Kleines Update: In meinem Fall war es ein simpler IP-Adresskonflikt. Eine IP des Ceph-Clusters war doppelt verwendet

Lg Jan

Search

Search

Cannot connect to external Ceph Nautilus "rbd error: rbd: listing images failed"

Leo David

Well-Known Member

Leo David

Well-Known Member

Leo David

Well-Known Member

Leo David

Well-Known Member

Alwin

Proxmox Retired Staff

Leo David

Well-Known Member

Alwin

Proxmox Retired Staff

Leo David

Well-Known Member

Leo David

Well-Known Member

Alwin

Proxmox Retired Staff

Leo David

Well-Known Member

Alwin

Proxmox Retired Staff

Leo David

Well-Known Member

Leo David

Well-Known Member

Leo David

Well-Known Member

Leo David

Well-Known Member

jandoe88

Member

jandoe88

Member