[SOLVED] ceph storage not available to a node

walter.egosson

Active Member
Sep 4, 2019
27
2
43
33
hi!
We have a 4 nodes proxmox 6 cluster. 3 nodes are proxmox 6 with ceph luminus (stable) and 1 additionnal node with just proxmox 6, no ceph.
The thing is the ceph storage used to be availabe to that 4th node, but it suddenly became "status unknown" on the GUI while remaining "available" to the other 3.
Code:
root@srv-X:/etc/pve# pveversion
pve-manager/6.4-5/6c7bf5de (running kernel: 5.4.65-1-pve)
 
PVE 6 and Ceph Luminous?

If you run ceph versions, what output do you get? If it says anything with 12.x.x you really should upgrade your Ceph installation!

https://pve.proxmox.com/wiki/Ceph_Luminous_to_Nautilus
https://pve.proxmox.com/wiki/Ceph_Nautilus_to_Octopus
sorry, I made a typo, the 3 nodes have ceph nautilus

Code:
root@srv-Y:/home/xxx# ceph version
ceph version 14.2.20 (886a8c9442681274213d1c7e897b12624edf6c8a) nautilus (stable)

The 4th node is just an additionnal node we just purchased, it has no ceph installed, it only accesses ceph
 
Last edited:
PVE 6 and Ceph Luminous?

If you run ceph versions, what output do you get? If it says anything with 12.x.x you really should upgrade your Ceph installation!

https://pve.proxmox.com/wiki/Ceph_Luminous_to_Nautilus
https://pve.proxmox.com/wiki/Ceph_Nautilus_to_Octopus
moreover, the ceph storage (named "storage") is marked as inactive on the 4th node by pvesm status command:

Code:
root@srv-X:/etc/pve# sudo pvesm status
got timeout
Name               Type     Status           Total            Used       Available        %
backsrv1            dir   disabled               0               0               0      N/A
backsrv2            dir   disabled               0               0               0      N/A
backsrv3            dir   disabled               0               0               0      N/A
local               dir     active       130177108        14775416       108745948   11.35%
nfs-storage         nfs     active      6831481856      5628288000      1203193856   82.39%
storage             rbd   inactive               0               0               0    0.00%
 
Did you follow the recent issue and disabled "auth_allow_insecure_global_id_reclaim"?

Code:
ceph config set mon auth_allow_insecure_global_id_reclaim false


Then it is likely, that the new node, which does not have Ceph installed yet, is still using the older v12 (Luminous) client that comes with Debian Buster.

As a workaround, configure the ceph repository as on the other nodes (/etc/apt/sources.list.d/ceph.list). Then run
apt update apt full-upgrade

The Ceph client should update to a more recent version that can deal with the changed cluster settings.
 
Did you follow the recent issue and disabled "auth_allow_insecure_global_id_reclaim"?

Code:
ceph config set mon auth_allow_insecure_global_id_reclaim false
We did applied it on the three proxmox with ceph enabled, not the last one

Then it is likely, that the new node, which does not have Ceph installed yet, is still using the older v12 (Luminous) client that comes with Debian Buster.

As a workaround, configure the ceph repository as on the other nodes (/etc/apt/sources.list.d/ceph.list). Then run
apt update apt full-upgrade

The Ceph client should update to a more recent version that can deal with the changed cluster settings.
We had to add the ceph repo and update+upgrade to fix it. Thank you a lot @aaron
 
  • Like
Reactions: aaron
We are having more or less this same problem. Just updated all nodes to latest Ceph 14.2.22 with the latest kernel. Already restarted services on Ceph nodes (5 nodes are Ceph, 7 nodes are only for running VMs with Ceph client) After restarting 2 of the 7 nodes that only run the latest Ceph client software neither of them can now access the RBD storage. There is only one Ceph pool and that is represented with just one storage entry. 5 nodes are still accessing Ceph fine but these other 2 are unable to do so after the reboot. Did something change with Ceph Nautilus 14.2.22 that could cause this? Also have looked at logs and there's is nothing in the logs about this. On those two nodes the RBD storage is simply "inactive" although of course still enabled.
 
We are having more or less this same problem. Just updated all nodes to latest Ceph 14.2.22 with the latest kernel. Already restarted services on Ceph nodes (5 nodes are Ceph, 7 nodes are only for running VMs with Ceph client) After restarting 2 of the 7 nodes that only run the latest Ceph client software neither of them can now access the RBD storage. There is only one Ceph pool and that is represented with just one storage entry. 5 nodes are still accessing Ceph fine but these other 2 are unable to do so after the reboot. Did something change with Ceph Nautilus 14.2.22 that could cause this? Also have looked at logs and there's is nothing in the logs about this. On those two nodes the RBD storage is simply "inactive" although of course still enabled.
Have you tried my previous answer?
 
Please bear in mind what I mentioned in my first post. All machines were running good. All machines (12) have been updated with the latest Proxmox software. Only two have been rebooted and now the two that have been rebooted no longer can talk to Ceph. The only thing that has changed is the software versions.

pve-kernel-5.4.124-1-pve
Nautilus 14.2.22

The nodes were previously running kernel 5.4.119-1-pve

I tried going back to 119 kernel on one of the nodes and that made no difference

This led me to think that it's something to do with the new Ceph packages as of 14.2.22 vs the original 14.2.20 that were previously installed

This is a production cluster that has been in production for many years successfully with a number of customers on it using our services

Obviously my dilemma at this point is that if one of the current machines goes down for any reason .. how can I be sure it won't come up with the same problem and then lock all of those customers out of their VMs?

We are Proxmox subscribers as to make sure we are getting the Enterprise software for the most stable results. These situations are extremely rare so please understand, in no way are we blaming Proxmox for anything .. we just need to get to the bottom of this in a hurry before the problem gets worse

We are prepping for getting Ceph up to Octopus so that we can then schedule the Proxmox 7.0 and Ceph Pacific upgrade but this is a huge snag right now.
 
Yes, of course.
If you run rbd --version on the VM only nodes, do you get the same version as on the actual Ceph nodes? Any difference between the 2 freshly booted nodes to the other 5?

What if you try to manually connect to the cluster? Any errors that will give us a hint of what is going wrong?
Code:
rbd -p <pool> ls -m <ip of mon1>,<ip of mon2>,... -n client.admin --keyring /etc/pve/priv/ceph/<storage name>.keyring
 
version is the same
ceph version 14.2.22 (877fa256043e4743620f4677e72dee5e738d1226) nautilus (stable)

Trying the rbd -p command now ... so far it's just hanging ... only errors so far are about parsing config file which is the same on all client nodes because there is no Ceph config file on those nodes obviously

I will write back if and when this command times out and let you know if there is an error
 
This is the next error came up
2021-07-22 09:24:10.478 7f346cfc00c0 0 monclient(hunting): authenticate timed out after 300

This makes no sense to me though .. I provided the same IPs as the other nodes that ARE connecting use and I also provided the same keyring the other clients use
 
Does the RBD command work on the other 5 nodes? The warning about the missing ceph.conf is normal in that case.

Use
Code:
nc -zv <mon host ip> 3300
nc -zv <mon host ip> 6789

to check if the node can open a connection to the monitors. If that times out, there might be some network or firewall issue preventing the connection.
 
So, the other nodes do just fine .. again, they mention the ceph.conf parsing error but then list the virtual disks available on that ceph pool

For your example of connecting using netcat, is there something special to do with IPv6 addresses? This ceph cluster is IPv6 only.
 
As a side note, we switched all Ceph networking over from 10Gb copper to 25Gb fiber .. this was done on July 5th, 2021 .. it switched over smoothly ..

Has anyone had issues using fiber? Has anyone had issues with Mellanox fiber NICs and a Dell 48port 25Gb switch?
 
netcat is timing out trying to connect to the monitor node but .. pinging to that same address works fine ... there are no firewalls in between any nodes. The Ceph network has it's own switch and all nodes talk over that switch for Ceph communications
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!