hi,
I have had a problem with one node in my cluster for months now with no idea how to fix it, with no one on this forum or in the Ceph mailing list ever replying, which makes me think no one has ever seen it before and no one has any idea how to troubleshoot it, so I am trying to figure out how to wipe a node that is functioning as a Ceph node and reinstall without losing any Ceph data.
the issue: about six months ago I had this node refuse to respond when I tried to stop or move containers and VMs on it. it will start them if it is rebooted, but any attempt to do anything else to them results in a long wait and finally "
Most importantly, you cannot create any new VMs or containers on the node's Ceph resources, with the same complaint about the fastwrx pool. (It does work using "local" and "local-lvm" storage.) What this means is that all my workloads are slowly shifting to the other two nodes. This is not a sustainable situation and is a waste of resources on the "broken" node. There is nothing running on the "broken" node I need to keep, but I do need the data on the OSDs.
The other day this situation became worse, where even though
I am guessing no one knows how to help me troubleshoot this, but I'm hoping someone has an idea of how I can nuke the node and reinstall it without screwing up the rest of the cluster.
TIA.
I have had a problem with one node in my cluster for months now with no idea how to fix it, with no one on this forum or in the Ceph mailing list ever replying, which makes me think no one has ever seen it before and no one has any idea how to troubleshoot it, so I am trying to figure out how to wipe a node that is functioning as a Ceph node and reinstall without losing any Ceph data.
the issue: about six months ago I had this node refuse to respond when I tried to stop or move containers and VMs on it. it will start them if it is rebooted, but any attempt to do anything else to them results in a long wait and finally "
TASK ERROR: rbd error: 'storage-fastwrx'-locked command timed out - aborting
" when the task eventually fails. ('fastwrx' being the name of the all-SSD pool I use for block devices for rootvols) I get the same error if I try to create any containers or VMs on it. If there is anywhere else that errors related to this are found, I haven't found them yet. There is nothing notable in syslog, dmesg, or /var/log/ceph.log, and all of the OSDs seem to be ticking along without incident. `ceph health` says everything is fine with the exception of there not being enough standby MDSes, which is something that I think changed when I updated to Quincy -- I have two cephfses and I guess it wants a standby MDS for each now. I only have 3 nodes so that's not an option for now.Most importantly, you cannot create any new VMs or containers on the node's Ceph resources, with the same complaint about the fastwrx pool. (It does work using "local" and "local-lvm" storage.) What this means is that all my workloads are slowly shifting to the other two nodes. This is not a sustainable situation and is a waste of resources on the "broken" node. There is nothing running on the "broken" node I need to keep, but I do need the data on the OSDs.
The other day this situation became worse, where even though
pvecm status
showed everything as fine, the node was all "question marks" for itself and all its VMs and containers and wouldn't run the web interface. A restart seems to have temporarily fixed this, but I'm afraid the situation is continuing to deteriorate.I am guessing no one knows how to help me troubleshoot this, but I'm hoping someone has an idea of how I can nuke the node and reinstall it without screwing up the rest of the cluster.
TIA.
Last edited: