Hello,
I had an issue which caused the Proxmox Cluster to break due to an extended period of network issues on the cluster communications network.
I have brought all VM's online on a new Proxmox cluster, however the old broken cluster still has the CEPH Cluster attached to it, this is running fine however I am unable to add / remove or make any changes to the CEPH Cluster now.
Is it possible to convert a Proxmox CEPH Cluster to a standalone CEPH Setup? Or is it possible to fix the cluster so I can continue to operate CEPH on this.
Checking "service pve-cluster status" outputs:
service pve-cluster status
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
Active: active (running) since Fri 2017-03-24 09:57:20 GMT; 2 weeks 3 days ago
Main PID: 3104 (pmxcfs)
CGroup: /system.slice/pve-cluster.service
└─3104 /usr/bin/pmxcfs
Mar 28 06:27:53 sn7 pmxcfs[3104]: [status] notice: cpg_send_message retried 2 times
Mar 28 06:27:54 sn7 pmxcfs[3104]: [status] notice: cpg_send_message retried 2 times
Mar 28 06:27:54 sn7 pmxcfs[3104]: [status] notice: cpg_send_message retried 2 times
Mar 28 06:27:54 sn7 pmxcfs[3104]: [status] notice: cpg_send_message retried 2 times
Mar 28 06:27:54 sn7 pmxcfs[3104]: [status] notice: cpg_send_message retried 2 times
Mar 28 06:27:54 sn7 pmxcfs[3104]: [status] notice: cpg_send_message retry 30
Mar 28 06:27:54 sn7 pmxcfs[3104]: [status] notice: cpg_send_message retried 2 times
Mar 28 06:27:55 sn7 pmxcfs[3104]: [status] notice: cpg_send_message retried 2 times
Mar 28 06:27:55 sn7 pmxcfs[3104]: [status] notice: cpg_send_message retried 2 times
Mar 28 06:27:55 sn7 pmxcfs[3104]: [status] notice: cpg_send_message retried 35 times
corosync service on every node is using 1 full CPU core, and pmxcfs using 2 full CPU core's.
Seems data tried to communicate back on the day of the issue and gave up since, I can still ping between all the remaining nodes within the cluster, each cluster FS seem's to have gone readonly, so CEPH is still able to read the ceph.conf file.
Any cluster commands just hang and don't output anything, I think I need to somehow take one copy of the cluster filesystem and replicate to all nodes / clear any pending messages that are stuck. Then remove the 4 node's that are now in the new cluster so that the cluster comes back online.
Any help would be appreciated.
I had an issue which caused the Proxmox Cluster to break due to an extended period of network issues on the cluster communications network.
I have brought all VM's online on a new Proxmox cluster, however the old broken cluster still has the CEPH Cluster attached to it, this is running fine however I am unable to add / remove or make any changes to the CEPH Cluster now.
Is it possible to convert a Proxmox CEPH Cluster to a standalone CEPH Setup? Or is it possible to fix the cluster so I can continue to operate CEPH on this.
Checking "service pve-cluster status" outputs:
service pve-cluster status
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
Active: active (running) since Fri 2017-03-24 09:57:20 GMT; 2 weeks 3 days ago
Main PID: 3104 (pmxcfs)
CGroup: /system.slice/pve-cluster.service
└─3104 /usr/bin/pmxcfs
Mar 28 06:27:53 sn7 pmxcfs[3104]: [status] notice: cpg_send_message retried 2 times
Mar 28 06:27:54 sn7 pmxcfs[3104]: [status] notice: cpg_send_message retried 2 times
Mar 28 06:27:54 sn7 pmxcfs[3104]: [status] notice: cpg_send_message retried 2 times
Mar 28 06:27:54 sn7 pmxcfs[3104]: [status] notice: cpg_send_message retried 2 times
Mar 28 06:27:54 sn7 pmxcfs[3104]: [status] notice: cpg_send_message retried 2 times
Mar 28 06:27:54 sn7 pmxcfs[3104]: [status] notice: cpg_send_message retry 30
Mar 28 06:27:54 sn7 pmxcfs[3104]: [status] notice: cpg_send_message retried 2 times
Mar 28 06:27:55 sn7 pmxcfs[3104]: [status] notice: cpg_send_message retried 2 times
Mar 28 06:27:55 sn7 pmxcfs[3104]: [status] notice: cpg_send_message retried 2 times
Mar 28 06:27:55 sn7 pmxcfs[3104]: [status] notice: cpg_send_message retried 35 times
corosync service on every node is using 1 full CPU core, and pmxcfs using 2 full CPU core's.
Seems data tried to communicate back on the day of the issue and gave up since, I can still ping between all the remaining nodes within the cluster, each cluster FS seem's to have gone readonly, so CEPH is still able to read the ceph.conf file.
Any cluster commands just hang and don't output anything, I think I need to somehow take one copy of the cluster filesystem and replicate to all nodes / clear any pending messages that are stuck. Then remove the 4 node's that are now in the new cluster so that the cluster comes back online.
Any help would be appreciated.
Last edited: