[SOLVED] updated Proxmox on one host, caused havoc in the cluster

kyriazis · Mar 1, 2022

Hi there.

We have 2 Proxmox clusters. I updated packages on one cluster with no problems. Then I started updating packages on the 2nd cluster, and after I upgraded the 1st node, the whole cluster went berserk.

Initially, some of the nodes appeared up, some down, and some with a "?", even though all nodes appear operational. cluster has quorum. Some pve commands are hung on the node that I updated, reboot does not help. After a while I lost the credentials on the web interface of that node. and I cannot login. In fact, I cannot login in any of the node's web interfaces. Still, "pvecm status" claims I have quorum.

Containers start, but VMs hang on startup.
Cluster has ceph installed and running. Seems that ceph is not affected.
Cluster has 12 nodes.

any help is appreciated!

George

Stoiko Ivanov · Mar 1, 2022

Hmm - on a hunch - try restarting the pve-cluster.service (`systemctl restart pve-cluster`) on one node and see if this improves things

if not - please post a larger part of the journal as text (instead of a screenshot) -> `journalctl --since today`

I hope this helps!

kyriazis · Mar 1, 2022

restarting pve-cluster on 2 nodes (one with the update and one without) did not do any help.

Attaching journals from both nodes.

As a note, the screenshot was from the console, not any logs.

Thank you!

George

Stoiko Ivanov · Mar 1, 2022

Hm - quite a few messages from corosync indicating that there might be an issue with the cluster-network
Additionally the issues seem to have started quite before 00:00 today (at least I did not see a clear point where the issues started)

a) make sure you don't have HA enabled (else your nodes might fence themselves)
b) check the journal of the node after the reboot after the upgrade (journalctl -b) - for any messages relating to problems with the new kernel
c) try restarting corosync on all nodes - and watch for issues in the journal
d) the output of `corosync-cfgtool -s` might be helpful

I hope this helps!

kyriazis · Mar 1, 2022

Thanks Stoiko,

(c) and (d) helped. It turned out that some other nodes had corosync frozen, and by restarting corosync on those nodes the cluster became alive again.

I went ahead and did updates on the rest of the nodes without problems, so I am back in business now.

Thank you for the help!

George

Stoiko Ivanov · Mar 1, 2022

glad that worked out - if possible please mark the thread as 'SOLVED' - this helps others running into similar issues

Thanks!

[SOLVED] updated Proxmox on one host, caused havoc in the cluster

kyriazis

Well-Known Member

Attachments

Stoiko Ivanov

Proxmox Staff Member

kyriazis

Well-Known Member

Attachments

Stoiko Ivanov

Proxmox Staff Member

kyriazis

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

We value your privacy