PVE 3.4 CEPH cluster - failed node recovery

ronsrussell · Aug 30, 2015

I have a four node PVE / Ceph cluster with three OSD's on each. All nodes are licensed with PVE Community Subscription.
One node has failed and must have PVE reinstalled.
The cluster and all VM's are working fine on the remaining three nodes.
Please describe the best method for replacing the failed node along with the three OSD's.

wolfgang · Aug 31, 2015

Hi,
may be this helps
https://pve.proxmox.com/wiki/Proxmox_VE_2.0_Cluster#Re-installing_a_cluster_node

ronsrussell · Aug 31, 2015

Wolfgang -
Thanks for your reference to that link.
There are two sections on that page that may apply here.
Remove a cluster node -
- This looks like it will work for our case.
Re-installing a cluster node -
- In this case the node has failed and is not accessible for the purpose of copying specified files. So how do we recreate these files ?
- /var/lib/pve-cluster/
- /root/.ssh/
thanks - Ron

ronsrussell · Sep 2, 2015

Anyone out there besides me having this problem ?
I'm still looking for a solution ??

wahmed · Sep 2, 2015

Best solution that i know is this:
Assuming you still have those 3 OSD from damaged node,simply put one OSD at a time in each of the remaining node and let Ceph rebalance itself. While it rebalancing re insttall Proxmox on the damaged node then rejoin to the cluster. Then add the node as Ceph MON, then one by one move OSDs back to the node and let rebalance.

I know it is somewhat lengthy process but thats how i do it in this cases. So far havent lost anything. Main reason i do it this way is to make Ceph happy as soon as possible so rest of the cluster can continue and other users can go on with life. Then fix the node without pressure.
Hope it makes sense.

ronsrussell · Sep 2, 2015

Thanks much for you quick response.
At this point the server and associated three OSD's have been offline for three weeks.
Everything seems healthy other than missing three OSD's.
If I put the three OSD's into the other nodes, do I have to do anything to move them or do they get recognized somehow ?
And in regards to rejoining failed node - when I reinstall proxmox, do I name it the same as failed node (PMC1) or give it a new name?
thanks again - Ron

BTW - nice book but I did not see the answer there

wahmed · Sep 2, 2015

ronsrussell said:
BTW - nice book but I did not see the answer there

Are you referring to book Mastering Proxmox? Back when the book was written, Ceph was not part of Proxmox yet. But i added entire chapter on Ceph installation in that book because i believe Ceph is the greatest thing in the universe, after Proxmox

Also the book is proxmox focused so lot of Ceph related stuff doesnt fit well.

Does your current Ceph cluster says 3 OSDs missing or did you remove them from Ceph cluster already? If it still says 3 OSDs down then just add them in the remaining nodes and they will be automatically recognized. If you already took them out of cluster using GUI or using CLI, then you dont need to worry about anything.
Just rejoin the Proxmox node and add OSDs. You can keep the same node name or change to something else. If you are reinstalling Proxmox on the node from scratch it really does not matter.

ronsrussell · Sep 2, 2015

It says 3 OSD's down/out.
I'm attaching a screenshot.

ronsrussell · Sep 2, 2015

symmcom said:
Are you referring to book Mastering Proxmox? Back when the book was written, Ceph was not part of Proxmox yet. But i added entire chapter on Ceph installation in that book because i believe Ceph is the greatest thing in the universe, after Proxmox Also the book is proxmox focused so lot of Ceph related stuff doesnt fit well. .

Yes, that's the book. It's probably difficult to write a book on technology that is changing so rapidly.

wahmed · Sep 2, 2015

ronsrussell said:
It says 3 OSD's down/out.
I'm attaching a screenshot.

Put one of the OSD in the one of the Proxmox node. I believe Ceph cluster will automatically recognize the OSD and move to it proper node in the CrushMAP. If it doesnt there is a command to manually move it. But i doubt you will have to.

ronsrussell · Sep 3, 2015

I will try moving one OSD either tonight or tomorrow night. But it may not work for another reason - the failed node was running on a Dell R620 with H710 controller which did not allow pass-thru control of the HD's. We had to create single drive RAID0 arrays. The other three nodes are Dell R610's with SAS 6iR which allows pass thru mode. We are replacing the failed node with an R610.

ronsrussell · Sep 3, 2015

symmcom said:
Put one of the OSD in the one of the Proxmox node. I believe Ceph cluster will automatically recognize the OSD and move to it proper node in the CrushMAP. If it doesnt there is a command to manually move it. But i doubt you will have to.

Before I go to the trouble of travelling to the colo at night for three days in a row to move OSD's perhaps someone can provide some clarification - With the cluster in this state (12 osd's: 9 up, 9 in and the GUI showing 3 osd's down/out) is the data contained on the Ceph SAN in danger of being lost ? If it is then I will make haste to move the OSD's. If not then I will wait until I get the new server hardware.
thanks much - Ron

wahmed · Sep 3, 2015

ronsrussell said:
I will try moving one OSD either tonight or tomorrow night. But it may not work for another reason - the failed node was running on a Dell R620 with H710 controller which did not allow pass-thru control of the HD's. We had to create single drive RAID0 arrays. The other three nodes are Dell R610's with SAS 6iR which allows pass thru mode. We are replacing the failed node with an R610.

Do you mean you have a RAID Array in between Ceph and the drives? Nothing can be as risky as putting a ceph cluster on top of a Raid Array. Raid controllers can be used to add performance and should be in JBOD for Ceph OSDs.
If the OSDs in your dead node were part of array, then the best thing at this point for you would be to remove those OSDs from the cluster, then add again later when your node is up and running. You can remove them from Proxmox GUI one at a time. Each time you remove ceph will rebalance and after all done your ceph health should be "OK" instead of Warning now. Specially since you are replacing the RAID card in dead node, there are no guarantees that Ceph will accept those drives seamlessly. Because the RAID array needs to be build itself first if the new Raid card can still recognize the array.

I personally dot no use any RAID except for OS drives and we manage large ceph environment. So somebody can correct me on this Raid behavior.

ronsrussell · Sep 4, 2015

symmcom said:
Do you mean you have a RAID Array in between Ceph and the drives? Nothing can be as risky as putting a ceph cluster on top of a Raid Array. Raid controllers can be used to add performance and should be in JBOD for Ceph OSDs.

Yes that is what I meant. There were three single drive arrays in node 1.

symmcom said:
If the OSDs in your dead node were part of array, then the best thing at this point for you would be to remove those OSDs from the cluster, then add again later when your node is up and running. You can remove them from Proxmox GUI one at a time. Each time you remove ceph will rebalance and after all done your ceph health should be "OK" instead of Warning now.

My attempt to remove an OSD resulted in an error pop up saying "Connection drror 595: No route to host"
Just to reiterate - node 1 with three OSD's has failed and is offline.
The only ceph warning is that one monitor is down.

wahmed · Sep 4, 2015

Could you please post the output for:
# ceph -s
# ceph osd tree

ronsrussell · Sep 4, 2015

symmcom said:
Could you please post the output for:
# ceph -s
# ceph osd tree

From node 2

root@pmc2:~# ceph -s

cluster 773e19fe-60e7-427d-bdc9-6cbcc1301f6e

health HEALTH_WARN
1 mons down, quorum 1,2,3 1,2,3
monmap e4: 4 mons at {0=10.10.10.1:6789/0,1=10.10.10.2:6789/0,2=10.10.10.3:6789/0,3=10.10.10.4:6789/0}
election epoch 120, quorum 1,2,3 1,2,3
osdmap e6726: 12 osds: 9 up, 9 in
pgmap v6419672: 1088 pgs, 2 pools, 2537 GB data, 640 kobjects
5061 GB used, 2937 GB / 7999 GB avail
1088 active+clean
client io 2994 kB/s rd, 1206 kB/s wr, 254 op/s
root@pmc2:~#

root@pmc2:~# ceph osd tree
2015-09-04 05:57:22.736976 7fad6c616700 0 -- :/1667490 >> 10.10.10.1:6789/0 pip e(0x15ea050 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x15e70d0).fault
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 10.43994 root default
-2 2.60999 host pmc1
0 0.87000 osd.0 down 0 1.00000
1 0.87000 osd.1 down 0 1.00000
2 0.87000 osd.2 down 0 1.00000
-3 2.60999 host pmc2
3 0.87000 osd.3 up 1.00000 1.00000
4 0.87000 osd.4 up 1.00000 1.00000
5 0.87000 osd.5 up 1.00000 1.00000
-4 2.60999 host pmc3
6 0.87000 osd.6 up 1.00000 1.00000
7 0.87000 osd.7 up 1.00000 1.00000
8 0.87000 osd.8 up 1.00000 1.00000
-5 2.60999 host pmc4
9 0.87000 osd.9 up 1.00000 1.00000
10 0.87000 osd.10 up 1.00000 1.00000
11 0.87000 osd.11 up 1.00000 1.00000
root@pmc2:~#

wahmed · Sep 4, 2015

Looks like your cluster is fine. Just remove the OSDs with following CLI command:

#ceph osd rm 0

Then remove the MON:
#ceph mon remove <host>

Should get OK health status.

ronsrussell · Sep 5, 2015

symmcom said:
Looks like your cluster is fine. Just remove the OSDs with following CLI command:

#ceph osd rm 0

Then remove the MON:
#ceph mon remove <host>

Should get OK health status.

This worked perfectly !
Thanks so much - Ron

wahmed · Sep 5, 2015

ronsrussell said:
This worked perfectly !
Thanks so much - Ron

Although you got Ceph back to healthy state, i would like to point out that due to the RAID Array in other 3 nodes, your data is not 100% safe. The whole point of using Ceph is the amazing drives even nodes redundancy it provides. When a Raid card fails you are literally going to lose all OSDs attached to it. Even in day to day operation ceph cannot quite work the way it intended to when Raid Array is in the mix. Something to think about for your future setups.

ronsrussell · Sep 5, 2015

symmcom said:
Although you got Ceph back to healthy state, i would like to point out that due to the RAID Array in other 3 nodes, your data is not 100% safe. The whole point of using Ceph is the amazing drives even nodes redundancy it provides. When a Raid card fails you are literally going to lose all OSDs attached to it. Even in day to day operation ceph cannot quite work the way it intended to when Raid Array is in the mix. Something to think about for your future setups.

Not to worry - other three nodes in the cluster do not use RAID array. Only node 1 was configured this way because it was different hardware (Dell R620). The other three nodes are Dell R610's with hd controller that supports pass thru mode. We have ordered another R610 to rebuild the failed node.

Although I reported that the removal of the OSD's worked perfectly, there remains one issue. Seems that the Proxmox GUI still shows the mon & osd's. I suspect that I must manually edit the ceph.conf file to remove them? Here is the ceph.conf file -

[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
auth supported = cephx
cluster network = 10.10.10.0/24
filestore xattr use omap = true
fsid = 773e19fe-60e7-427d-bdc9-6cbcc1301f6e
keyring = /etc/pve/priv/$cluster.$name.keyring
osd journal size = 5120
osd pool default min size = 1
public network = 10.10.10.0/24

[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.0]
host = pmc1
mon addr = 10.10.10.1:6789

[mon.1]
host = pmc2
mon addr = 10.10.10.2:6789

[mon.3]
host = pmc4
mon addr = 10.10.10.4:6789

[mon.2]
host = pmc3
mon addr = 10.10.10.3:6789

And the crush map still lists the OSD's that I removed along with the failed host - but I do not know where to edit this ?

thanks - Ron

PVE 3.4 CEPH cluster - failed node recovery

Renowned Member

Proxmox Retired Staff

Renowned Member

Renowned Member

Famous Member

Renowned Member

Famous Member

Renowned Member

Attachments

Renowned Member

Famous Member

Renowned Member

Renowned Member

Famous Member

Renowned Member

Famous Member

Renowned Member

Famous Member

Renowned Member

Famous Member

Renowned Member