PVE 3.4 CEPH cluster - failed node recovery

wahmed · Sep 5, 2015

ronsrussell said:
Not to worry - other three nodes in the cluster do not use RAID array. Only node 1 was configured this way because it was different hardware (Dell R620). The other three nodes are Dell R610's with hd controller that supports pass thru mode. We have ordered another R610 to rebuild the failed node.

Ah, you are fine then. I think i misread your original post.

ronsrussell said:
Although I reported that the removal of the OSD's worked perfectly, there remains one issue. Seems that the Proxmox GUI still shows the mon & osd's. I suspect that I must manually edit the ceph.conf file to remove them?

You can remove the OSD from crush using this:
# ceph osd crush remove osd.ID

You may also have to remove OSD authentication key. But since the original node no longer exists, i am not sure if you have to. Here is the command:
# ceph auth del osd.ID

Just remove the MON using ceph mon remove command?

ronsrussell · Sep 6, 2015

symmcom said:
You can remove the OSD from crush using this:
# ceph osd crush remove osd.ID

You may also have to remove OSD authentication key. But since the original node no longer exists, i am not sure if you have to. Here is the command:
# ceph auth del osd.ID

Just remove the MON using ceph mon remove command?

Both of those ceph commands worked but the Proxmox GUI still shows -
- in the Config panel "mon.0"
- in the Monitor panel "mon.0"
- in the OSD panel "pmc1" ( this is node 1)
- in the Crush panel
- - "device 0"
- - "device 1"
- - "device 2"
- - under buckets - "host pmc1"

See attached screen shots.

Is this going to be a problem when I re-install node 1 (pmc1) ?

thanks - Ron

wahmed · Sep 6, 2015

I see you still have pmc1 joined in the Proxmox cluster. Remove it with :
#pvecm delnode pmc1

If the host is still in crushmap after that run then just remove it from ceph.conf

ronsrussell · Sep 6, 2015

symmcom said:
I see you still have pmc1 joined in the Proxmox cluster. Remove it with :
#pvecm delnode pmc1

OK, this removed pmc1 from showing up in the GUI under Datacenter.

symmcom said:
If the host is still in crushmap after that run then just remove it from ceph.conf

And this took care of pmc1 showing up in the GUI under Ceph > Config & Monitor.
But it still exists in the GUI under Ceph > OSD & Crush

thanks - Ron

wahmed · Sep 9, 2015

ronsrussell said:
And this took care of pmc1 showing up in the GUI under Ceph > Config & Monitor.
But it still exists in the GUI under Ceph > OSD & Crush

Run the following command to remove bucket/item from CrushMAP:

#ceph osd crush remove host=pmc1

Since you are cluster is perfectly fine at this point, it is safe to remove the dead host from CrushMAP.

ronsrussell · Sep 9, 2015

symmcom said:
Run the following command to remove bucket/item from CrushMAP:

#ceph osd crush remove host=pmc1

This command failed with -
Invalid command: invalid chars = in host=pmc1
osd crush remove <name> (<ancestor>) : remove <name> from crush map (everywhere, or just at <ancestor>)

So I tried this command -
ceph osd crush remove pmc1
which seemed to work as it returned -
removed item id -2 name 'pmc1' from crush map

In the Proxmox GUI the only invalid info now shown is in Ceph > Crush which still shows
# devices
device 0 device0
device 1 device1
device2 device2
device 3 osd.3
thru
device 11 osd.11

thanks - Ron

ronsrussell · Sep 28, 2015

I have attempted to add node back to PVE Cluster using pvecm add ip-of-node2 but it has failed 'Waiting for quorum...'

And this shows up in the syslog -
Sep 27 18:59:35 pmc1 pmxcfs[3872]: [main] notice: teardown filesystem
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [quorum] crit: quorum_initialize failed: 6
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [quorum] crit: can't initialize service
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [confdb] crit: confdb_initialize failed: 6
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [quorum] crit: can't initialize service
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [dcdb] crit: cpg_initialize failed: 6
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [quorum] crit: can't initialize service
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [dcdb] crit: cpg_initialize failed: 6
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [quorum] crit: can't initialize service
Sep 27 18:59:47 pmc1 kernel: SCTP: Hash tables configured (established 65536 bind 65536)
Sep 27 18:59:47 pmc1 kernel: DLM (built Sep 12 2015 12:55:41) installed
Sep 27 18:59:48 pmc1 corosync[551708]: [MAIN ] Corosync Cluster Engine ('1.4.7'): started and rea
dy to provide service.
Sep 27 18:59:48 pmc1 corosync[551708]: [MAIN ] Corosync built-in features: nss
Sep 27 18:59:48 pmc1 corosync[551708]: [MAIN ] Successfully read config from /etc/cluster/cluster.conf
Sep 27 18:59:48 pmc1 corosync[551708]: [MAIN ] Successfully parsed cman config
Sep 27 18:59:48 pmc1 corosync[551708]: [MAIN ] Successfully configured openais services to load
Sep 27 18:59:48 pmc1 corosync[551708]: [TOTEM ] Initializing transport (UDP/IP Multicast).
Sep 27 18:59:48 pmc1 corosync[551708]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Sep 27 18:59:48 pmc1 corosync[551708]: [TOTEM ] The network interface is down.
Sep 27 18:59:48 pmc1 corosync[551708]: [QUORUM] Using quorum provider quorum_cman
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1
Sep 27 18:59:48 pmc1 corosync[551708]: [CMAN ] CMAN 1364188437 (built Mar 25 2013 06:14:01) started
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync CMAN membership service 2.90
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: openais cluster membership service B.01.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: openais event service B.01.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: openais checkpoint service B.01.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: openais message service B.03.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: openais distributed locking service B.03.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: openais timer service A.01.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync extended virtual synchrony service
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync configuration service
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync cluster config database access v1.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync profile loading service
Sep 27 18:59:48 pmc1 corosync[551708]: [QUORUM] Using quorum provider quorum_cman
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1
Sep 27 18:59:48 pmc1 corosync[551708]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine.
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] CLM CONFIGURATION CHANGE
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] New Configuration:
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] Members Left:
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] Members Joined:
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] CLM CONFIGURATION CHANGE
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] New Configuration:
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] #011r(0) ip(127.0.0.1)
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] Members Left:
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] Members Joined:
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] #011r(0) ip(127.0.0.1)
Sep 27 18:59:48 pmc1 corosync[551708]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 27 18:59:48 pmc1 corosync[551708]: [QUORUM] Members[1]: 1
Sep 27 18:59:48 pmc1 corosync[551708]: [QUORUM] Members[1]: 1
Sep 27 18:59:48 pmc1 corosync[551708]: [CPG ] chosen downlist: sender r(0) ip(127.0.0.1) ; members(old:0 left:0)
Sep 27 18:59:48 pmc1 corosync[551708]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 27 18:59:49 pmc1 pmxcfs[551627]: [status] crit: cpg_send_message failed: 9
Sep 27 18:59:49 pmc1 pmxcfs[551627]: [status] crit: cpg_send_message failed: 9
Sep 27 18:59:49 pmc1 pmxcfs[551627]: [status] crit: cpg_send_message failed: 9
Sep 27 18:59:49 pmc1 pmxcfs[551627]: [status] crit: cpg_send_message failed: 9
Sep 27 18:59:49 pmc1 pmxcfs[551627]: [status] crit: cpg_send_message failed: 9
Sep 27 18:59:49 pmc1 pmxcfs[551627]: [status] crit: cpg_send_message failed: 9
Sep 27 18:59:53 pmc1 pmxcfs[551627]: [status] notice: update cluster info (cluster name PPC-Office, version = 6)
Sep 27 18:59:53 pmc1 pmxcfs[551627]: [dcdb] notice: members: 1/551627
Sep 27 18:59:53 pmc1 pmxcfs[551627]: [dcdb] notice: all data is up to date
Sep 27 18:59:53 pmc1 pmxcfs[551627]: [dcdb] notice: members: 1/551627
Sep 27 18:59:53 pmc1 pmxcfs[551627]: [dcdb] notice: all data is up to date
Sep 27 19:00:01 pmc1 /usr/sbin/cron[4004]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Sep 27 19:01:01 pmc1 /usr/sbin/cron[4004]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Sep 27 19:02:01 pmc1 /usr/sbin/cron[4004]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Sep 27 19:03:01 pmc1 /usr/sbin/cron[4004]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Sep 27 19:04:01 pmc1 /usr/sbin/cron[4004]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Sep 27 19:05:01 pmc1 /usr/sbin/cron[4004]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Sep 27 19:06:01 pmc1 /usr/sbin/cron[4004]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)

I rebooted the node and now get this in the syslog -
Sep 27 19:15:10 pmc1 pveproxy[4786]: worker exit
Sep 27 19:15:10 pmc1 pveproxy[4787]: worker exit
Sep 27 19:15:10 pmc1 pveproxy[4788]: worker exit
Sep 27 19:15:10 pmc1 pveproxy[4315]: worker 4786 finished
Sep 27 19:15:10 pmc1 pveproxy[4315]: starting 1 worker(s)
Sep 27 19:15:10 pmc1 pveproxy[4315]: worker 4789 started
Sep 27 19:15:10 pmc1 pveproxy[4315]: worker 4787 finished
Sep 27 19:15:10 pmc1 pveproxy[4315]: starting 1 worker(s)
Sep 27 19:15:10 pmc1 pveproxy[4315]: worker 4790 started
Sep 27 19:15:10 pmc1 pveproxy[4789]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/HTTPServer.pm line 1634

The re-added node shows up red in the PVE GUI.

ronsrussell · Sep 28, 2015

I have attempted to add node back to PVE Cluster using pvecm add ip-of-node2 but it has failed 'Waiting for quorum...'

And this shows up in the syslog -
Sep 27 18:59:35 pmc1 pmxcfs[3872]: [main] notice: teardown filesystem
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [quorum] crit: quorum_initialize failed: 6
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [quorum] crit: can't initialize service
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [confdb] crit: confdb_initialize failed: 6
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [quorum] crit: can't initialize service
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [dcdb] crit: cpg_initialize failed: 6
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [quorum] crit: can't initialize service
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [dcdb] crit: cpg_initialize failed: 6
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [quorum] crit: can't initialize service
Sep 27 18:59:47 pmc1 kernel: SCTP: Hash tables configured (established 65536 bind 65536)
Sep 27 18:59:47 pmc1 kernel: DLM (built Sep 12 2015 12:55:41) installed
Sep 27 18:59:48 pmc1 corosync[551708]: [MAIN ] Corosync Cluster Engine ('1.4.7'): started and rea
dy to provide service.
Sep 27 18:59:48 pmc1 corosync[551708]: [MAIN ] Corosync built-in features: nss
Sep 27 18:59:48 pmc1 corosync[551708]: [MAIN ] Successfully read config from /etc/cluster/cluster.conf
Sep 27 18:59:48 pmc1 corosync[551708]: [MAIN ] Successfully parsed cman config
Sep 27 18:59:48 pmc1 corosync[551708]: [MAIN ] Successfully configured openais services to load
Sep 27 18:59:48 pmc1 corosync[551708]: [TOTEM ] Initializing transport (UDP/IP Multicast).
Sep 27 18:59:48 pmc1 corosync[551708]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Sep 27 18:59:48 pmc1 corosync[551708]: [TOTEM ] The network interface is down.
Sep 27 18:59:48 pmc1 corosync[551708]: [QUORUM] Using quorum provider quorum_cman
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1
Sep 27 18:59:48 pmc1 corosync[551708]: [CMAN ] CMAN 1364188437 (built Mar 25 2013 06:14:01) started
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync CMAN membership service 2.90
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: openais cluster membership service B.01.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: openais event service B.01.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: openais checkpoint service B.01.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: openais message service B.03.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: openais distributed locking service B.03.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: openais timer service A.01.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync extended virtual synchrony service
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync configuration service
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync cluster config database access v1.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync profile loading service
Sep 27 18:59:48 pmc1 corosync[551708]: [QUORUM] Using quorum provider quorum_cman
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1
Sep 27 18:59:48 pmc1 corosync[551708]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine.
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] CLM CONFIGURATION CHANGE
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] New Configuration:
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] Members Left:
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] Members Joined:
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] CLM CONFIGURATION CHANGE
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] New Configuration:
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] #011r(0) ip(127.0.0.1)
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] Members Left:
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] Members Joined:
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] #011r(0) ip(127.0.0.1)
Sep 27 18:59:48 pmc1 corosync[551708]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 27 18:59:48 pmc1 corosync[551708]: [QUORUM] Members[1]: 1
Sep 27 18:59:48 pmc1 corosync[551708]: [QUORUM] Members[1]: 1
Sep 27 18:59:48 pmc1 corosync[551708]: [CPG ] chosen downlist: sender r(0) ip(127.0.0.1) ; members(old:0 left:0)
Sep 27 18:59:48 pmc1 corosync[551708]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 27 18:59:49 pmc1 pmxcfs[551627]: [status] crit: cpg_send_message failed: 9
Sep 27 18:59:49 pmc1 pmxcfs[551627]: [status] crit: cpg_send_message failed: 9
Sep 27 18:59:49 pmc1 pmxcfs[551627]: [status] crit: cpg_send_message failed: 9
Sep 27 18:59:49 pmc1 pmxcfs[551627]: [status] crit: cpg_send_message failed: 9
Sep 27 18:59:49 pmc1 pmxcfs[551627]: [status] crit: cpg_send_message failed: 9
Sep 27 18:59:49 pmc1 pmxcfs[551627]: [status] crit: cpg_send_message failed: 9
Sep 27 18:59:53 pmc1 pmxcfs[551627]: [status] notice: update cluster info (cluster name PPC-Office, version = 6)
Sep 27 18:59:53 pmc1 pmxcfs[551627]: [dcdb] notice: members: 1/551627
Sep 27 18:59:53 pmc1 pmxcfs[551627]: [dcdb] notice: all data is up to date
Sep 27 18:59:53 pmc1 pmxcfs[551627]: [dcdb] notice: members: 1/551627
Sep 27 18:59:53 pmc1 pmxcfs[551627]: [dcdb] notice: all data is up to date
Sep 27 19:00:01 pmc1 /usr/sbin/cron[4004]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Sep 27 19:01:01 pmc1 /usr/sbin/cron[4004]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Sep 27 19:02:01 pmc1 /usr/sbin/cron[4004]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Sep 27 19:03:01 pmc1 /usr/sbin/cron[4004]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Sep 27 19:04:01 pmc1 /usr/sbin/cron[4004]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Sep 27 19:05:01 pmc1 /usr/sbin/cron[4004]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Sep 27 19:06:01 pmc1 /usr/sbin/cron[4004]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)

I rebooted the node and now get this in the syslog -
Sep 27 19:15:10 pmc1 pveproxy[4786]: worker exit
Sep 27 19:15:10 pmc1 pveproxy[4787]: worker exit
Sep 27 19:15:10 pmc1 pveproxy[4788]: worker exit
Sep 27 19:15:10 pmc1 pveproxy[4315]: worker 4786 finished
Sep 27 19:15:10 pmc1 pveproxy[4315]: starting 1 worker(s)
Sep 27 19:15:10 pmc1 pveproxy[4315]: worker 4789 started
Sep 27 19:15:10 pmc1 pveproxy[4315]: worker 4787 finished
Sep 27 19:15:10 pmc1 pveproxy[4315]: starting 1 worker(s)
Sep 27 19:15:10 pmc1 pveproxy[4315]: worker 4790 started
Sep 27 19:15:10 pmc1 pveproxy[4789]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/HTTPServer.pm line 1634

The re-added node shows up red in the PVE GUI.

We need to get this node back online so any assistance would be appreciated.

Search

Search

PVE 3.4 CEPH cluster - failed node recovery

wahmed

Famous Member

ronsrussell

Renowned Member

Attachments

wahmed

Famous Member

ronsrussell

Renowned Member

wahmed

Famous Member

ronsrussell

Renowned Member

ronsrussell

Renowned Member

ronsrussell

Renowned Member