Broken Cluster

Ashley · Mar 27, 2017

Hello,

Over the weekend one of our switches had a period of high CPU causing our cluster to flap, once this was resolved we ended up with a form of broken cluster.

All nodes could communicate with each other via the WebGUI if selected directly even though they all showed as red
VM's could not be started as it seemed the /etc/pve was read-only
CEPH continued to operate without issue (separate network)

To get our services online I rebuilt our compute nodes, attached them to our RBD from CEPH and copied across the vm config files, this has brought our VM's online and working, however left us with a broken cluster across Proxmox & CEPH.

I would like to try and repair the cluster to the point I can use it to just manage the CEPH aspect, currently unable to do anything CEPH related in the GUI as Proxmox think's only the node it self is online, even though cluster status shows all the storage nodes online. So ideally by removing the 6 offline servers so that all that is left is the online CEPH nodes and try and bring the cluster up.

Output on one storage node:

service pve-cluster status
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
Active: active (running) since Fri 2017-03-24 09:57:20 GMT; 2 days ago
Main PID: 3104 (pmxcfs)
CGroup: /system.slice/pve-cluster.service
└─3104 /usr/bin/pmxcfs

Mar 26 15:08:28 sn7 pmxcfs[3104]: [status] notice: members: 3/17125, 5/2149, 6/2057, 7/2168, 8/2153, 10/3104, 11/2760, 12/2751, 14/2990
Mar 26 15:08:28 sn7 pmxcfs[3104]: [status] notice: queue not emtpy - resening 9846 messages
Mar 26 15:08:28 sn7 pmxcfs[3104]: [dcdb] notice: received sync request (epoch 3/17125/00000001)
Mar 26 15:08:28 sn7 pmxcfs[3104]: [dcdb] notice: cpg_send_message retried 2 times
Mar 26 15:08:28 sn7 pmxcfs[3104]: [status] notice: received sync request (epoch 3/17125/00000001)
Mar 26 15:08:28 sn7 pmxcfs[3104]: [status] notice: cpg_send_message retried 2 times
Mar 26 15:26:23 sn7 pmxcfs[3104]: [dcdb] notice: members: 5/2149, 6/2057, 7/2168, 8/2153, 10/3104, 11/2760, 12/2751, 14/2990
Mar 26 15:26:23 sn7 pmxcfs[3104]: [status] notice: members: 5/2149, 6/2057, 7/2168, 8/2153, 10/3104, 11/2760, 12/2751, 14/2990
Mar 26 15:26:23 sn7 pmxcfs[3104]: [dcdb] notice: received sync request (epoch 5/2149/000000D1)
Mar 26 15:26:23 sn7 pmxcfs[3104]: [status] notice: received sync request (epoch 5/2149/00000142)

To me looks like it has just stopped sending sync's even though the services are online.

Ashley · Mar 27, 2017

pveproxy service has just died on the remaining nodes and unable to start the service or dir /etc/pve or run pvecm status all just hang.

If im correct CEPH will not operate without Proxmox wrapper due to CEPH being setup differently than vanilla CEPH.

Search

Search

Broken Cluster

Ashley

Member

Ashley

Member