PVE 3.4 CEPH cluster - failed node recovery

Not to worry - other three nodes in the cluster do not use RAID array. Only node 1 was configured this way because it was different hardware (Dell R620). The other three nodes are Dell R610's with hd controller that supports pass thru mode. We have ordered another R610 to rebuild the failed node.
Ah, you are fine then. I think i misread your original post.

Although I reported that the removal of the OSD's worked perfectly, there remains one issue. Seems that the Proxmox GUI still shows the mon & osd's. I suspect that I must manually edit the ceph.conf file to remove them?
You can remove the OSD from crush using this:
# ceph osd crush remove osd.ID

You may also have to remove OSD authentication key. But since the original node no longer exists, i am not sure if you have to. Here is the command:
# ceph auth del osd.ID

Just remove the MON using ceph mon remove command?
 
You can remove the OSD from crush using this:
# ceph osd crush remove osd.ID

You may also have to remove OSD authentication key. But since the original node no longer exists, i am not sure if you have to. Here is the command:
# ceph auth del osd.ID

Just remove the MON using ceph mon remove command?

Both of those ceph commands worked but the Proxmox GUI still shows -
- in the Config panel "mon.0"
- in the Monitor panel "mon.0"
- in the OSD panel "pmc1" ( this is node 1)
- in the Crush panel
- - "device 0"
- - "device 1"
- - "device 2"
- - under buckets - "host pmc1"

See attached screen shots.

Is this going to be a problem when I re-install node 1 (pmc1) ?

thanks - Ron
 

Attachments

  • PPC_PMceph-Crush2.png
    PPC_PMceph-Crush2.png
    27.4 KB · Views: 8
  • PPC_PMceph-Crush1.png
    PPC_PMceph-Crush1.png
    26.6 KB · Views: 7
  • PPC_PMceph-OSD.png
    PPC_PMceph-OSD.png
    31.2 KB · Views: 7
  • PPC_PMceph-Monitor.png
    PPC_PMceph-Monitor.png
    21.6 KB · Views: 6
  • PPC_PMceph-config.png
    PPC_PMceph-config.png
    25.4 KB · Views: 7
I see you still have pmc1 joined in the Proxmox cluster. Remove it with :
#pvecm delnode pmc1

If the host is still in crushmap after that run then just remove it from ceph.conf
 
I see you still have pmc1 joined in the Proxmox cluster. Remove it with :
#pvecm delnode pmc1
OK, this removed pmc1 from showing up in the GUI under Datacenter.

If the host is still in crushmap after that run then just remove it from ceph.conf

And this took care of pmc1 showing up in the GUI under Ceph > Config & Monitor.
But it still exists in the GUI under Ceph > OSD & Crush

thanks - Ron
 
And this took care of pmc1 showing up in the GUI under Ceph > Config & Monitor.
But it still exists in the GUI under Ceph > OSD & Crush
Run the following command to remove bucket/item from CrushMAP:

#ceph osd crush remove host=pmc1

Since you are cluster is perfectly fine at this point, it is safe to remove the dead host from CrushMAP.
 
Run the following command to remove bucket/item from CrushMAP:

#ceph osd crush remove host=pmc1

This command failed with -
Invalid command: invalid chars = in host=pmc1
osd crush remove <name> (<ancestor>) : remove <name> from crush map (everywhere, or just at <ancestor>)

So I tried this command -
ceph osd crush remove pmc1
which seemed to work as it returned -
removed item id -2 name 'pmc1' from crush map

In the Proxmox GUI the only invalid info now shown is in Ceph > Crush which still shows
# devices
device 0 device0
device 1 device1
device2 device2
device 3 osd.3
thru
device 11 osd.11

thanks - Ron
 
I have attempted to add node back to PVE Cluster using pvecm add ip-of-node2 but it has failed 'Waiting for quorum...'

And this shows up in the syslog -
Sep 27 18:59:35 pmc1 pmxcfs[3872]: [main] notice: teardown filesystem
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [quorum] crit: quorum_initialize failed: 6
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [quorum] crit: can't initialize service
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [confdb] crit: confdb_initialize failed: 6
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [quorum] crit: can't initialize service
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [dcdb] crit: cpg_initialize failed: 6
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [quorum] crit: can't initialize service
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [dcdb] crit: cpg_initialize failed: 6
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [quorum] crit: can't initialize service
Sep 27 18:59:47 pmc1 kernel: SCTP: Hash tables configured (established 65536 bind 65536)
Sep 27 18:59:47 pmc1 kernel: DLM (built Sep 12 2015 12:55:41) installed
Sep 27 18:59:48 pmc1 corosync[551708]: [MAIN ] Corosync Cluster Engine ('1.4.7'): started and rea
dy to provide service.
Sep 27 18:59:48 pmc1 corosync[551708]: [MAIN ] Corosync built-in features: nss
Sep 27 18:59:48 pmc1 corosync[551708]: [MAIN ] Successfully read config from /etc/cluster/cluster.conf
Sep 27 18:59:48 pmc1 corosync[551708]: [MAIN ] Successfully parsed cman config
Sep 27 18:59:48 pmc1 corosync[551708]: [MAIN ] Successfully configured openais services to load
Sep 27 18:59:48 pmc1 corosync[551708]: [TOTEM ] Initializing transport (UDP/IP Multicast).
Sep 27 18:59:48 pmc1 corosync[551708]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Sep 27 18:59:48 pmc1 corosync[551708]: [TOTEM ] The network interface is down.
Sep 27 18:59:48 pmc1 corosync[551708]: [QUORUM] Using quorum provider quorum_cman
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1
Sep 27 18:59:48 pmc1 corosync[551708]: [CMAN ] CMAN 1364188437 (built Mar 25 2013 06:14:01) started
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync CMAN membership service 2.90
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: openais cluster membership service B.01.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: openais event service B.01.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: openais checkpoint service B.01.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: openais message service B.03.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: openais distributed locking service B.03.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: openais timer service A.01.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync extended virtual synchrony service
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync configuration service
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync cluster config database access v1.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync profile loading service
Sep 27 18:59:48 pmc1 corosync[551708]: [QUORUM] Using quorum provider quorum_cman
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1
Sep 27 18:59:48 pmc1 corosync[551708]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine.
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] CLM CONFIGURATION CHANGE
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] New Configuration:
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] Members Left:
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] Members Joined:
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] CLM CONFIGURATION CHANGE
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] New Configuration:
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] #011r(0) ip(127.0.0.1)
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] Members Left:
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] Members Joined:
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] #011r(0) ip(127.0.0.1)
Sep 27 18:59:48 pmc1 corosync[551708]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 27 18:59:48 pmc1 corosync[551708]: [QUORUM] Members[1]: 1
Sep 27 18:59:48 pmc1 corosync[551708]: [QUORUM] Members[1]: 1
Sep 27 18:59:48 pmc1 corosync[551708]: [CPG ] chosen downlist: sender r(0) ip(127.0.0.1) ; members(old:0 left:0)
Sep 27 18:59:48 pmc1 corosync[551708]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 27 18:59:49 pmc1 pmxcfs[551627]: [status] crit: cpg_send_message failed: 9
Sep 27 18:59:49 pmc1 pmxcfs[551627]: [status] crit: cpg_send_message failed: 9
Sep 27 18:59:49 pmc1 pmxcfs[551627]: [status] crit: cpg_send_message failed: 9
Sep 27 18:59:49 pmc1 pmxcfs[551627]: [status] crit: cpg_send_message failed: 9
Sep 27 18:59:49 pmc1 pmxcfs[551627]: [status] crit: cpg_send_message failed: 9
Sep 27 18:59:49 pmc1 pmxcfs[551627]: [status] crit: cpg_send_message failed: 9
Sep 27 18:59:53 pmc1 pmxcfs[551627]: [status] notice: update cluster info (cluster name PPC-Office, version = 6)
Sep 27 18:59:53 pmc1 pmxcfs[551627]: [dcdb] notice: members: 1/551627
Sep 27 18:59:53 pmc1 pmxcfs[551627]: [dcdb] notice: all data is up to date
Sep 27 18:59:53 pmc1 pmxcfs[551627]: [dcdb] notice: members: 1/551627
Sep 27 18:59:53 pmc1 pmxcfs[551627]: [dcdb] notice: all data is up to date
Sep 27 19:00:01 pmc1 /usr/sbin/cron[4004]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Sep 27 19:01:01 pmc1 /usr/sbin/cron[4004]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Sep 27 19:02:01 pmc1 /usr/sbin/cron[4004]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Sep 27 19:03:01 pmc1 /usr/sbin/cron[4004]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Sep 27 19:04:01 pmc1 /usr/sbin/cron[4004]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Sep 27 19:05:01 pmc1 /usr/sbin/cron[4004]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Sep 27 19:06:01 pmc1 /usr/sbin/cron[4004]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)

I rebooted the node and now get this in the syslog -
Sep 27 19:15:10 pmc1 pveproxy[4786]: worker exit
Sep 27 19:15:10 pmc1 pveproxy[4787]: worker exit
Sep 27 19:15:10 pmc1 pveproxy[4788]: worker exit
Sep 27 19:15:10 pmc1 pveproxy[4315]: worker 4786 finished
Sep 27 19:15:10 pmc1 pveproxy[4315]: starting 1 worker(s)
Sep 27 19:15:10 pmc1 pveproxy[4315]: worker 4789 started
Sep 27 19:15:10 pmc1 pveproxy[4315]: worker 4787 finished
Sep 27 19:15:10 pmc1 pveproxy[4315]: starting 1 worker(s)
Sep 27 19:15:10 pmc1 pveproxy[4315]: worker 4790 started
Sep 27 19:15:10 pmc1 pveproxy[4789]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/HTTPServer.pm line 1634

The re-added node shows up red in the PVE GUI.
 
Last edited:
I have attempted to add node back to PVE Cluster using pvecm add ip-of-node2 but it has failed 'Waiting for quorum...'

And this shows up in the syslog -
Sep 27 18:59:35 pmc1 pmxcfs[3872]: [main] notice: teardown filesystem
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [quorum] crit: quorum_initialize failed: 6
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [quorum] crit: can't initialize service
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [confdb] crit: confdb_initialize failed: 6
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [quorum] crit: can't initialize service
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [dcdb] crit: cpg_initialize failed: 6
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [quorum] crit: can't initialize service
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [dcdb] crit: cpg_initialize failed: 6
Sep 27 18:59:47 pmc1 pmxcfs[551627]: [quorum] crit: can't initialize service
Sep 27 18:59:47 pmc1 kernel: SCTP: Hash tables configured (established 65536 bind 65536)
Sep 27 18:59:47 pmc1 kernel: DLM (built Sep 12 2015 12:55:41) installed
Sep 27 18:59:48 pmc1 corosync[551708]: [MAIN ] Corosync Cluster Engine ('1.4.7'): started and rea
dy to provide service.
Sep 27 18:59:48 pmc1 corosync[551708]: [MAIN ] Corosync built-in features: nss
Sep 27 18:59:48 pmc1 corosync[551708]: [MAIN ] Successfully read config from /etc/cluster/cluster.conf
Sep 27 18:59:48 pmc1 corosync[551708]: [MAIN ] Successfully parsed cman config
Sep 27 18:59:48 pmc1 corosync[551708]: [MAIN ] Successfully configured openais services to load
Sep 27 18:59:48 pmc1 corosync[551708]: [TOTEM ] Initializing transport (UDP/IP Multicast).
Sep 27 18:59:48 pmc1 corosync[551708]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Sep 27 18:59:48 pmc1 corosync[551708]: [TOTEM ] The network interface is down.
Sep 27 18:59:48 pmc1 corosync[551708]: [QUORUM] Using quorum provider quorum_cman
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1
Sep 27 18:59:48 pmc1 corosync[551708]: [CMAN ] CMAN 1364188437 (built Mar 25 2013 06:14:01) started
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync CMAN membership service 2.90
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: openais cluster membership service B.01.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: openais event service B.01.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: openais checkpoint service B.01.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: openais message service B.03.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: openais distributed locking service B.03.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: openais timer service A.01.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync extended virtual synchrony service
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync configuration service
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync cluster config database access v1.01
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync profile loading service
Sep 27 18:59:48 pmc1 corosync[551708]: [QUORUM] Using quorum provider quorum_cman
Sep 27 18:59:48 pmc1 corosync[551708]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1
Sep 27 18:59:48 pmc1 corosync[551708]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine.
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] CLM CONFIGURATION CHANGE
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] New Configuration:
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] Members Left:
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] Members Joined:
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] CLM CONFIGURATION CHANGE
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] New Configuration:
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] #011r(0) ip(127.0.0.1)
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] Members Left:
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] Members Joined:
Sep 27 18:59:48 pmc1 corosync[551708]: [CLM ] #011r(0) ip(127.0.0.1)
Sep 27 18:59:48 pmc1 corosync[551708]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 27 18:59:48 pmc1 corosync[551708]: [QUORUM] Members[1]: 1
Sep 27 18:59:48 pmc1 corosync[551708]: [QUORUM] Members[1]: 1
Sep 27 18:59:48 pmc1 corosync[551708]: [CPG ] chosen downlist: sender r(0) ip(127.0.0.1) ; members(old:0 left:0)
Sep 27 18:59:48 pmc1 corosync[551708]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 27 18:59:49 pmc1 pmxcfs[551627]: [status] crit: cpg_send_message failed: 9
Sep 27 18:59:49 pmc1 pmxcfs[551627]: [status] crit: cpg_send_message failed: 9
Sep 27 18:59:49 pmc1 pmxcfs[551627]: [status] crit: cpg_send_message failed: 9
Sep 27 18:59:49 pmc1 pmxcfs[551627]: [status] crit: cpg_send_message failed: 9
Sep 27 18:59:49 pmc1 pmxcfs[551627]: [status] crit: cpg_send_message failed: 9
Sep 27 18:59:49 pmc1 pmxcfs[551627]: [status] crit: cpg_send_message failed: 9
Sep 27 18:59:53 pmc1 pmxcfs[551627]: [status] notice: update cluster info (cluster name PPC-Office, version = 6)
Sep 27 18:59:53 pmc1 pmxcfs[551627]: [dcdb] notice: members: 1/551627
Sep 27 18:59:53 pmc1 pmxcfs[551627]: [dcdb] notice: all data is up to date
Sep 27 18:59:53 pmc1 pmxcfs[551627]: [dcdb] notice: members: 1/551627
Sep 27 18:59:53 pmc1 pmxcfs[551627]: [dcdb] notice: all data is up to date
Sep 27 19:00:01 pmc1 /usr/sbin/cron[4004]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Sep 27 19:01:01 pmc1 /usr/sbin/cron[4004]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Sep 27 19:02:01 pmc1 /usr/sbin/cron[4004]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Sep 27 19:03:01 pmc1 /usr/sbin/cron[4004]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Sep 27 19:04:01 pmc1 /usr/sbin/cron[4004]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Sep 27 19:05:01 pmc1 /usr/sbin/cron[4004]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Sep 27 19:06:01 pmc1 /usr/sbin/cron[4004]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)

I rebooted the node and now get this in the syslog -
Sep 27 19:15:10 pmc1 pveproxy[4786]: worker exit
Sep 27 19:15:10 pmc1 pveproxy[4787]: worker exit
Sep 27 19:15:10 pmc1 pveproxy[4788]: worker exit
Sep 27 19:15:10 pmc1 pveproxy[4315]: worker 4786 finished
Sep 27 19:15:10 pmc1 pveproxy[4315]: starting 1 worker(s)
Sep 27 19:15:10 pmc1 pveproxy[4315]: worker 4789 started
Sep 27 19:15:10 pmc1 pveproxy[4315]: worker 4787 finished
Sep 27 19:15:10 pmc1 pveproxy[4315]: starting 1 worker(s)
Sep 27 19:15:10 pmc1 pveproxy[4315]: worker 4790 started
Sep 27 19:15:10 pmc1 pveproxy[4789]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/HTTPServer.pm line 1634

The re-added node shows up red in the PVE GUI.

We need to get this node back online so any assistance would be appreciated.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!