Hi Everyone,
Our cluster (Version: 3.3-5/bfebec03) occasionally gets some communication issues between nodes. We notice this in the web gui when browsing a machine on a different node than the one we're logged into.
It shows up as:
And the VM Summary page will show "no name specified" and "status unknown"
We are able to resolve this by restarting cluster services (cman, pve-cluster, pvedaemon, pvestatd, pveproxy and pve-manager) on the node we're logged in to, but this seems to only be temporary.
I've seen three other threads that seem relevant:
1) Cluster Connection Errors, "Error Connection error 596" "Communication Failure (0)"
http://forum.proxmox.com/threads/14...-596-quot-quot-Communication-Failure-(0)-quot
2) there are 'communication failure (0)' in proxmox ve's status box
http://forum.proxmox.com/threads/18...cation-failure-(0)-in-proxmox-ve-s-status-box
3) Communication Failure (0) Cluster Node
http://forum.proxmox.com/threads/14361-Communication-Failure-(0)-Cluster-Node
However these don't help in our particular case. 1) is resolved by using UDP instead of multicast. As we have more than four machines in our cluster this isn't an option (Multicast Notes suggest max 4 nodes for UDP). 2) doesn't seem to have any solutions and all reports seem to be small 2 or 3 node clusters. 3) is resolved but the issue is a constant one, not intermittent like ours.
One possibility is that it may be caused by latency on the Ceph pool as this issue seems to be happening more as we have added more VM's and get close to capacity on Ceph. It also seems to happen when Proxmox gui is busy - i.e. when users are doing many machine operations.
Any thoughts or possible solutions welcomed.
Note: rebooting the whole cluster is not an option due to it being production. Rebooting individual nodes is possible but time consuming due to migrating a large number of machines.
Thanks,
Rich
Our cluster (Version: 3.3-5/bfebec03) occasionally gets some communication issues between nodes. We notice this in the web gui when browsing a machine on a different node than the one we're logged into.
It shows up as:
Code:
Communication Failure (0)
We are able to resolve this by restarting cluster services (cman, pve-cluster, pvedaemon, pvestatd, pveproxy and pve-manager) on the node we're logged in to, but this seems to only be temporary.
I've seen three other threads that seem relevant:
1) Cluster Connection Errors, "Error Connection error 596" "Communication Failure (0)"
http://forum.proxmox.com/threads/14...-596-quot-quot-Communication-Failure-(0)-quot
2) there are 'communication failure (0)' in proxmox ve's status box
http://forum.proxmox.com/threads/18...cation-failure-(0)-in-proxmox-ve-s-status-box
3) Communication Failure (0) Cluster Node
http://forum.proxmox.com/threads/14361-Communication-Failure-(0)-Cluster-Node
However these don't help in our particular case. 1) is resolved by using UDP instead of multicast. As we have more than four machines in our cluster this isn't an option (Multicast Notes suggest max 4 nodes for UDP). 2) doesn't seem to have any solutions and all reports seem to be small 2 or 3 node clusters. 3) is resolved but the issue is a constant one, not intermittent like ours.
One possibility is that it may be caused by latency on the Ceph pool as this issue seems to be happening more as we have added more VM's and get close to capacity on Ceph. It also seems to happen when Proxmox gui is busy - i.e. when users are doing many machine operations.
Any thoughts or possible solutions welcomed.
Note: rebooting the whole cluster is not an option due to it being production. Rebooting individual nodes is possible but time consuming due to migrating a large number of machines.
Thanks,
Rich