GeorgeRay

New Member
Dec 4, 2019
2
0
1
26
Hello everyone. I am a bit new to the Proxmox community so please pardon me should I miss something regarding forum or community etiquette/decorum.

A Proxmox Cluster setup I have been charged with supporting has developed a strange and unexplained issue that I have been unable to find a solution for. Notably several days ago all of the nodes within the cluster (7 in total) became unresponsive to command input from either the WebGUI or the CLI (both via SSH, and KVM). This issue primarily manifests with CLI commands relating to Proxmox such as 'qm status' or attempting to delete nodes using the GUI. Attached are screenshots of attempts to destroy a VM via CLI and perform a backup via GUI. Both hold as depicted indefinitely, and both are action that on the same cluster took less than five minutes in full.

Please note the actual VMs themselves appear to be functioning normally. A reboot was attempted on the least critical node, 0x00000005, however after reboot the situation was not resolved and it was not possible to execute commands locally on the server. Additionally post reboot the VMs on the server began exhibiting abnormal behavior.

To my knowledge no hardware of software changes have been made to the cluster in a time-frame that would explain this recent abnormal behavior. However based upon the fact that the entire cluster is effected it has to be something with the clustering. To this end I have looked into the quorum with the 'pvecm status' command and appended the output below. From these results it appears that the cluster is successfully quorum-ed and I am at a loss to explain this behavior.

Any ideas about what could cause this or where to look would be much appreciated. Even if it would be possible to get backups running again that would be fantastic as it would be possible to reload Proxmox and migrate to fresh installs.


root@pve22:~# pvecm status
Quorum information
------------------
Date: Wed Dec 4 05:39:50 2019
Quorum provider: corosync_votequorum
Nodes: 7
Node ID: 0x00000006
Ring ID: 4/12612
Quorate: Yes

Votequorum information
----------------------
Expected votes: 7
Highest expected: 7
Total votes: 7
Quorum: 4
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000004 1 192.168.2.15
0x00000008 1 192.168.2.63
0x00000001 1 192.168.2.99
0x00000002 1 192.168.2.104
0x00000003 1 192.168.2.111
0x00000005 1 192.168.2.115
0x00000006 1 192.168.2.126 (local)


Thank you for taking the time to read this and for any insight shared, it is much appreciated.


Addendum: All servers in cluster are running Proxmox Virtual Environment 5.4-6.
 

Attachments

  • BackupEx1.PNG
    BackupEx1.PNG
    29.7 KB · Views: 10
  • DestroyEx1.PNG
    DestroyEx1.PNG
    32.3 KB · Views: 10
  • QuorumEx1.PNG
    QuorumEx1.PNG
    13.7 KB · Views: 10
Quorum seems ok .

Check the logs of the servers - `journalctl -r` gives you all logs in reverse order - That should provide a hint where the problem is.

I hope this helps!
 
  • Like
Reactions: GeorgeRay
This issue has been resolved. The root cause appears to be something one of our switches did improperly which brought down another eth interface on one of our servers as well as this cluster issue. Corosync was running at full load on one of the nodes, once that process was restarted and behaving normally the cluster seemed to be getting into a more stable state. restarting a few PVE related services on a couple other nodes restored the GUI access (Sorry, can't remember/don't know which restarts really fixed this besides corosync) and all appears well. Thank you Stoiko for your help.
 
  • Like
Reactions: Stoiko Ivanov
Thank you for the recommendation Stoiko! As TonyBucci mentioned we have resolved this. Resolution came in the form of restarting the corosync and pvestad services. It appears that something hit one of the nodes on the cluster corrupting its corosync process. Then the corruption appears to have propagated around the cluster into all other nodes. One thing to note that was not mentioned in the original post is that a noticeable symptom was high server load, despite there being no actual load (literal CPUs at 100% idle) with no VMs running on a new server. Similar behavior was exhibited across the cluster, this was most easily noticed on the server with no real load. Once each node's corosync service were restarted the load on all nodes within the cluster would markedly decrease.

While I am not 100% sure it appears that this started when a minor network blip occurred a little over a week ago. This blip has not been firmly identified and has been chalked up as anomaly.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!