Broken Cluster

Ashley

Member
Jun 28, 2016
267
15
18
34
Hello,

Over the weekend one of our switches had a period of high CPU causing our cluster to flap, once this was resolved we ended up with a form of broken cluster.

  • All nodes could communicate with each other via the WebGUI if selected directly even though they all showed as red
  • VM's could not be started as it seemed the /etc/pve was read-only
  • CEPH continued to operate without issue (separate network)
To get our services online I rebuilt our compute nodes, attached them to our RBD from CEPH and copied across the vm config files, this has brought our VM's online and working, however left us with a broken cluster across Proxmox & CEPH.

I would like to try and repair the cluster to the point I can use it to just manage the CEPH aspect, currently unable to do anything CEPH related in the GUI as Proxmox think's only the node it self is online, even though cluster status shows all the storage nodes online. So ideally by removing the 6 offline servers so that all that is left is the online CEPH nodes and try and bring the cluster up.

upload_2017-3-27_9-34-21.png
upload_2017-3-27_9-34-38.png
upload_2017-3-27_9-34-53.png

Output on one storage node:

service pve-cluster status
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
Active: active (running) since Fri 2017-03-24 09:57:20 GMT; 2 days ago
Main PID: 3104 (pmxcfs)
CGroup: /system.slice/pve-cluster.service
└─3104 /usr/bin/pmxcfs

Mar 26 15:08:28 sn7 pmxcfs[3104]: [status] notice: members: 3/17125, 5/2149, 6/2057, 7/2168, 8/2153, 10/3104, 11/2760, 12/2751, 14/2990
Mar 26 15:08:28 sn7 pmxcfs[3104]: [status] notice: queue not emtpy - resening 9846 messages
Mar 26 15:08:28 sn7 pmxcfs[3104]: [dcdb] notice: received sync request (epoch 3/17125/00000001)
Mar 26 15:08:28 sn7 pmxcfs[3104]: [dcdb] notice: cpg_send_message retried 2 times
Mar 26 15:08:28 sn7 pmxcfs[3104]: [status] notice: received sync request (epoch 3/17125/00000001)
Mar 26 15:08:28 sn7 pmxcfs[3104]: [status] notice: cpg_send_message retried 2 times
Mar 26 15:26:23 sn7 pmxcfs[3104]: [dcdb] notice: members: 5/2149, 6/2057, 7/2168, 8/2153, 10/3104, 11/2760, 12/2751, 14/2990
Mar 26 15:26:23 sn7 pmxcfs[3104]: [status] notice: members: 5/2149, 6/2057, 7/2168, 8/2153, 10/3104, 11/2760, 12/2751, 14/2990
Mar 26 15:26:23 sn7 pmxcfs[3104]: [dcdb] notice: received sync request (epoch 5/2149/000000D1)
Mar 26 15:26:23 sn7 pmxcfs[3104]: [status] notice: received sync request (epoch 5/2149/00000142)

To me looks like it has just stopped sending sync's even though the services are online.
 
pveproxy service has just died on the remaining nodes and unable to start the service or dir /etc/pve or run pvecm status all just hang.

If im correct CEPH will not operate without Proxmox wrapper due to CEPH being setup differently than vanilla CEPH.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!