Hello,
Over the weekend one of our switches had a period of high CPU causing our cluster to flap, once this was resolved we ended up with a form of broken cluster.
I would like to try and repair the cluster to the point I can use it to just manage the CEPH aspect, currently unable to do anything CEPH related in the GUI as Proxmox think's only the node it self is online, even though cluster status shows all the storage nodes online. So ideally by removing the 6 offline servers so that all that is left is the online CEPH nodes and try and bring the cluster up.
Output on one storage node:
service pve-cluster status
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
Active: active (running) since Fri 2017-03-24 09:57:20 GMT; 2 days ago
Main PID: 3104 (pmxcfs)
CGroup: /system.slice/pve-cluster.service
└─3104 /usr/bin/pmxcfs
Mar 26 15:08:28 sn7 pmxcfs[3104]: [status] notice: members: 3/17125, 5/2149, 6/2057, 7/2168, 8/2153, 10/3104, 11/2760, 12/2751, 14/2990
Mar 26 15:08:28 sn7 pmxcfs[3104]: [status] notice: queue not emtpy - resening 9846 messages
Mar 26 15:08:28 sn7 pmxcfs[3104]: [dcdb] notice: received sync request (epoch 3/17125/00000001)
Mar 26 15:08:28 sn7 pmxcfs[3104]: [dcdb] notice: cpg_send_message retried 2 times
Mar 26 15:08:28 sn7 pmxcfs[3104]: [status] notice: received sync request (epoch 3/17125/00000001)
Mar 26 15:08:28 sn7 pmxcfs[3104]: [status] notice: cpg_send_message retried 2 times
Mar 26 15:26:23 sn7 pmxcfs[3104]: [dcdb] notice: members: 5/2149, 6/2057, 7/2168, 8/2153, 10/3104, 11/2760, 12/2751, 14/2990
Mar 26 15:26:23 sn7 pmxcfs[3104]: [status] notice: members: 5/2149, 6/2057, 7/2168, 8/2153, 10/3104, 11/2760, 12/2751, 14/2990
Mar 26 15:26:23 sn7 pmxcfs[3104]: [dcdb] notice: received sync request (epoch 5/2149/000000D1)
Mar 26 15:26:23 sn7 pmxcfs[3104]: [status] notice: received sync request (epoch 5/2149/00000142)
To me looks like it has just stopped sending sync's even though the services are online.
Over the weekend one of our switches had a period of high CPU causing our cluster to flap, once this was resolved we ended up with a form of broken cluster.
- All nodes could communicate with each other via the WebGUI if selected directly even though they all showed as red
- VM's could not be started as it seemed the /etc/pve was read-only
- CEPH continued to operate without issue (separate network)
I would like to try and repair the cluster to the point I can use it to just manage the CEPH aspect, currently unable to do anything CEPH related in the GUI as Proxmox think's only the node it self is online, even though cluster status shows all the storage nodes online. So ideally by removing the 6 offline servers so that all that is left is the online CEPH nodes and try and bring the cluster up.
Output on one storage node:
service pve-cluster status
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
Active: active (running) since Fri 2017-03-24 09:57:20 GMT; 2 days ago
Main PID: 3104 (pmxcfs)
CGroup: /system.slice/pve-cluster.service
└─3104 /usr/bin/pmxcfs
Mar 26 15:08:28 sn7 pmxcfs[3104]: [status] notice: members: 3/17125, 5/2149, 6/2057, 7/2168, 8/2153, 10/3104, 11/2760, 12/2751, 14/2990
Mar 26 15:08:28 sn7 pmxcfs[3104]: [status] notice: queue not emtpy - resening 9846 messages
Mar 26 15:08:28 sn7 pmxcfs[3104]: [dcdb] notice: received sync request (epoch 3/17125/00000001)
Mar 26 15:08:28 sn7 pmxcfs[3104]: [dcdb] notice: cpg_send_message retried 2 times
Mar 26 15:08:28 sn7 pmxcfs[3104]: [status] notice: received sync request (epoch 3/17125/00000001)
Mar 26 15:08:28 sn7 pmxcfs[3104]: [status] notice: cpg_send_message retried 2 times
Mar 26 15:26:23 sn7 pmxcfs[3104]: [dcdb] notice: members: 5/2149, 6/2057, 7/2168, 8/2153, 10/3104, 11/2760, 12/2751, 14/2990
Mar 26 15:26:23 sn7 pmxcfs[3104]: [status] notice: members: 5/2149, 6/2057, 7/2168, 8/2153, 10/3104, 11/2760, 12/2751, 14/2990
Mar 26 15:26:23 sn7 pmxcfs[3104]: [dcdb] notice: received sync request (epoch 5/2149/000000D1)
Mar 26 15:26:23 sn7 pmxcfs[3104]: [status] notice: received sync request (epoch 5/2149/00000142)
To me looks like it has just stopped sending sync's even though the services are online.