[SOLVED] updated Proxmox on one host, caused havoc in the cluster

kyriazis

Active Member
Oct 28, 2019
96
4
28
Austin, TX
Hi there.

We have 2 Proxmox clusters. I updated packages on one cluster with no problems. Then I started updating packages on the 2nd cluster, and after I upgraded the 1st node, the whole cluster went berserk.

Initially, some of the nodes appeared up, some down, and some with a "?", even though all nodes appear operational. cluster has quorum. Some pve commands are hung on the node that I updated, reboot does not help. After a while I lost the credentials on the web interface of that node. and I cannot login. In fact, I cannot login in any of the node's web interfaces. Still, "pvecm status" claims I have quorum.

Containers start, but VMs hang on startup.
Cluster has ceph installed and running. Seems that ceph is not affected.
Cluster has 12 nodes.

any help is appreciated!

George
 

Attachments

  • Image.png
    Image.png
    781.1 KB · Views: 8
Hmm - on a hunch - try restarting the pve-cluster.service (`systemctl restart pve-cluster`) on one node and see if this improves things

if not - please post a larger part of the journal as text (instead of a screenshot) -> `journalctl --since today`

I hope this helps!
 
restarting pve-cluster on 2 nodes (one with the update and one without) did not do any help.

Attaching journals from both nodes.

As a note, the screenshot was from the console, not any logs.

Thank you!

George
 

Attachments

  • journal-nonupdated.gz
    537.7 KB · Views: 1
  • journal-updated.gz
    359 KB · Views: 1
Hm - quite a few messages from corosync indicating that there might be an issue with the cluster-network
Additionally the issues seem to have started quite before 00:00 today (at least I did not see a clear point where the issues started)

a) make sure you don't have HA enabled (else your nodes might fence themselves)
b) check the journal of the node after the reboot after the upgrade (journalctl -b) - for any messages relating to problems with the new kernel
c) try restarting corosync on all nodes - and watch for issues in the journal
d) the output of `corosync-cfgtool -s` might be helpful

I hope this helps!
 
Thanks Stoiko,

(c) and (d) helped. It turned out that some other nodes had corosync frozen, and by restarting corosync on those nodes the cluster became alive again.

I went ahead and did updates on the rest of the nodes without problems, so I am back in business now.

Thank you for the help!

George
 
  • Like
Reactions: Stoiko Ivanov
glad that worked out - if possible please mark the thread as 'SOLVED' - this helps others running into similar issues

Thanks!
 
  • Like
Reactions: kyriazis

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!