[SOLVED] ceph storage not available to a node

walter.egosson · May 5, 2021

hi!
We have a 4 nodes proxmox 6 cluster. 3 nodes are proxmox 6 with ceph luminus (stable) and 1 additionnal node with just proxmox 6, no ceph.
The thing is the ceph storage used to be availabe to that 4th node, but it suddenly became "status unknown" on the GUI while remaining "available" to the other 3.

Code:

root@srv-X:/etc/pve# pveversion
pve-manager/6.4-5/6c7bf5de (running kernel: 5.4.65-1-pve)

aaron · May 5, 2021

walter.egosson said:
We have a 4 nodes proxmox 6 cluster. 3 nodes are proxmox 6 with ceph luminus (stable)

PVE 6 and Ceph Luminous?

If you run ceph versions, what output do you get? If it says anything with 12.x.x you really should upgrade your Ceph installation!

https://pve.proxmox.com/wiki/Ceph_Luminous_to_Nautilus
https://pve.proxmox.com/wiki/Ceph_Nautilus_to_Octopus

walter.egosson · May 5, 2021

aaron said:
PVE 6 and Ceph Luminous?

If you run ceph versions, what output do you get? If it says anything with 12.x.x you really should upgrade your Ceph installation!

https://pve.proxmox.com/wiki/Ceph_Luminous_to_Nautilus
https://pve.proxmox.com/wiki/Ceph_Nautilus_to_Octopus

sorry, I made a typo, the 3 nodes have ceph nautilus

Code:

root@srv-Y:/home/xxx# ceph version
ceph version 14.2.20 (886a8c9442681274213d1c7e897b12624edf6c8a) nautilus (stable)

The 4th node is just an additionnal node we just purchased, it has no ceph installed, it only accesses ceph

walter.egosson · May 5, 2021

aaron said:
PVE 6 and Ceph Luminous?

If you run ceph versions, what output do you get? If it says anything with 12.x.x you really should upgrade your Ceph installation!

https://pve.proxmox.com/wiki/Ceph_Luminous_to_Nautilus
https://pve.proxmox.com/wiki/Ceph_Nautilus_to_Octopus

moreover, the ceph storage (named "storage") is marked as inactive on the 4th node by pvesm status command:

Code:

root@srv-X:/etc/pve# sudo pvesm status
got timeout
Name               Type     Status           Total            Used       Available        %
backsrv1            dir   disabled               0               0               0      N/A
backsrv2            dir   disabled               0               0               0      N/A
backsrv3            dir   disabled               0               0               0      N/A
local               dir     active       130177108        14775416       108745948   11.35%
nfs-storage         nfs     active      6831481856      5628288000      1203193856   82.39%
storage             rbd   inactive               0               0               0    0.00%

aaron · May 5, 2021

Did you follow the recent issue and disabled "auth_allow_insecure_global_id_reclaim"?

Code:

ceph config set mon auth_allow_insecure_global_id_reclaim false

Then it is likely, that the new node, which does not have Ceph installed yet, is still using the older v12 (Luminous) client that comes with Debian Buster.

As a workaround, configure the ceph repository as on the other nodes (/etc/apt/sources.list.d/ceph.list). Then run

apt update
apt full-upgrade

The Ceph client should update to a more recent version that can deal with the changed cluster settings.

walter.egosson · May 6, 2021

aaron said:
Did you follow the recent issue and disabled "auth_allow_insecure_global_id_reclaim"?

Code:

ceph config set mon auth_allow_insecure_global_id_reclaim false

We did applied it on the three proxmox with ceph enabled, not the last one

Then it is likely, that the new node, which does not have Ceph installed yet, is still using the older v12 (Luminous) client that comes with Debian Buster.

As a workaround, configure the ceph repository as on the other nodes (/etc/apt/sources.list.d/ceph.list). Then run
apt update apt full-upgrade

The Ceph client should update to a more recent version that can deal with the changed cluster settings.

We had to add the ceph repo and update+upgrade to fix it. Thank you a lot @aaron

CTCcloud · Jul 21, 2021

We are having more or less this same problem. Just updated all nodes to latest Ceph 14.2.22 with the latest kernel. Already restarted services on Ceph nodes (5 nodes are Ceph, 7 nodes are only for running VMs with Ceph client) After restarting 2 of the 7 nodes that only run the latest Ceph client software neither of them can now access the RBD storage. There is only one Ceph pool and that is represented with just one storage entry. 5 nodes are still accessing Ceph fine but these other 2 are unable to do so after the reboot. Did something change with Ceph Nautilus 14.2.22 that could cause this? Also have looked at logs and there's is nothing in the logs about this. On those two nodes the RBD storage is simply "inactive" although of course still enabled.

aaron · Jul 22, 2021

ctcknows said:
We are having more or less this same problem. Just updated all nodes to latest Ceph 14.2.22 with the latest kernel. Already restarted services on Ceph nodes (5 nodes are Ceph, 7 nodes are only for running VMs with Ceph client) After restarting 2 of the 7 nodes that only run the latest Ceph client software neither of them can now access the RBD storage. There is only one Ceph pool and that is represented with just one storage entry. 5 nodes are still accessing Ceph fine but these other 2 are unable to do so after the reboot. Did something change with Ceph Nautilus 14.2.22 that could cause this? Also have looked at logs and there's is nothing in the logs about this. On those two nodes the RBD storage is simply "inactive" although of course still enabled.

Have you tried my previous answer?

CTCcloud · Jul 22, 2021

aaron said:
Have you tried my previous answer?

Yes, of course.

CTCcloud · Jul 22, 2021

Please bear in mind what I mentioned in my first post. All machines were running good. All machines (12) have been updated with the latest Proxmox software. Only two have been rebooted and now the two that have been rebooted no longer can talk to Ceph. The only thing that has changed is the software versions.

pve-kernel-5.4.124-1-pve
Nautilus 14.2.22

The nodes were previously running kernel 5.4.119-1-pve

I tried going back to 119 kernel on one of the nodes and that made no difference

This led me to think that it's something to do with the new Ceph packages as of 14.2.22 vs the original 14.2.20 that were previously installed

This is a production cluster that has been in production for many years successfully with a number of customers on it using our services

Obviously my dilemma at this point is that if one of the current machines goes down for any reason .. how can I be sure it won't come up with the same problem and then lock all of those customers out of their VMs?

We are Proxmox subscribers as to make sure we are getting the Enterprise software for the most stable results. These situations are extremely rare so please understand, in no way are we blaming Proxmox for anything .. we just need to get to the bottom of this in a hurry before the problem gets worse

We are prepping for getting Ceph up to Octopus so that we can then schedule the Proxmox 7.0 and Ceph Pacific upgrade but this is a huge snag right now.

aaron · Jul 22, 2021

ctcknows said:
Yes, of course.

If you run rbd --version on the VM only nodes, do you get the same version as on the actual Ceph nodes? Any difference between the 2 freshly booted nodes to the other 5?

What if you try to manually connect to the cluster? Any errors that will give us a hint of what is going wrong?

Code:

rbd -p <pool> ls -m <ip of mon1>,<ip of mon2>,... -n client.admin --keyring /etc/pve/priv/ceph/<storage name>.keyring

CTCcloud · Jul 22, 2021

version is the same
ceph version 14.2.22 (877fa256043e4743620f4677e72dee5e738d1226) nautilus (stable)

Trying the rbd -p command now ... so far it's just hanging ... only errors so far are about parsing config file which is the same on all client nodes because there is no Ceph config file on those nodes obviously

I will write back if and when this command times out and let you know if there is an error

CTCcloud · Jul 22, 2021

This is the next error came up
2021-07-22 09:24:10.478 7f346cfc00c0 0 monclient(hunting): authenticate timed out after 300

This makes no sense to me though .. I provided the same IPs as the other nodes that ARE connecting use and I also provided the same keyring the other clients use

aaron · Jul 22, 2021

Does the RBD command work on the other 5 nodes? The warning about the missing ceph.conf is normal in that case.

Use

Code:

nc -zv <mon host ip> 3300
nc -zv <mon host ip> 6789

to check if the node can open a connection to the monitors. If that times out, there might be some network or firewall issue preventing the connection.

CTCcloud · Jul 22, 2021

Ok, so, it just keeps saying the same monclient(hunting): authenticate timed out after 300

CTCcloud · Jul 22, 2021

So, the other nodes do just fine .. again, they mention the ceph.conf parsing error but then list the virtual disks available on that ceph pool

For your example of connecting using netcat, is there something special to do with IPv6 addresses? This ceph cluster is IPv6 only.

CTCcloud · Jul 22, 2021

Nevermind, I see from the help command -6 is for IPv6 only .. that's what I did ..

CTCcloud · Jul 22, 2021

There is no firewall turned on between nodes .. never has been one

CTCcloud · Jul 22, 2021

As a side note, we switched all Ceph networking over from 10Gb copper to 25Gb fiber .. this was done on July 5th, 2021 .. it switched over smoothly ..

Has anyone had issues using fiber? Has anyone had issues with Mellanox fiber NICs and a Dell 48port 25Gb switch?

CTCcloud · Jul 22, 2021

netcat is timing out trying to connect to the monitor node but .. pinging to that same address works fine ... there are no firewalls in between any nodes. The Ceph network has it's own switch and all nodes talk over that switch for Ceph communications

[SOLVED] ceph storage not available to a node

Active Member

Proxmox Staff Member

Active Member

Active Member

Proxmox Staff Member

Active Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

We value your privacy