Cluster broken after nodes update/upgrade

dancenation

New Member
Jul 13, 2020
6
0
1
38
Bulgaria
Hi to All,
I'm writing here since i can't find enough information about the issue I'm facing.

I have decided to test 4 nodes cluster with proxmox . Everything was running just fine for the last 4 weeks.Ceph running good.Vm's running well no issues.
Yesterday I have decided to update /upgrade all machines, because I have checked for any updates and there was a lot available.
After the first machine update/upgrade and reboot the whole cluster died.

Although i first updated only 1 machine -the whole cluster Died.
I'm glad that this was not a production environment !!!>

Till now I cant make the cluster running ..
When i try to login via ssh to whichever of the machines it hangs and cant continue .
When I try to log in it stays like this on the login page...
I have tried to with little to no luck.

I have tried to check whether the proxmox cluster pem certificate that i have created during the cluster creation is there but ssh drops and hangs

killall -9 corosync
systemctl restart pve-cluster
systemctl restart pvedaemon
systemctl restart pveproxy
systemctl restart pvestatd

Its important to find out whether its possible to revive cluster in such situation if happends becouse imagine 100 's of machines and You decide to update only 1 machine and the whole cluster Dies...
1594624028148.png
 
After the first machine update/upgrade and reboot the whole cluster died.
Died in which way? Just not working anymore with the symptoms you describe or did all nodes reboot?

Do you have any HA enabled guests?

Can you get the output of pvecm status?

Try stopping pve-cluster, then restart the corosync service and if that is up again, start the pve-cluster service.
 
Hi and thank you for your help.
Yes, all nodes are in a cluster with HA guest enabled . Also, Ceph was installed and running properly with failover.

I will test now and will come back with the output from the commands.
I will need to go to the machines physically because I can't access with ssh still
 
Hi im trying to execute commands but its very hard becouse after couple of commands either freeze or cant write anything to ssh nor direct at console on server

I have manage to to get the output from one of the nodes


Code:
root@proxmox1:~# pvecm status
Cluster information
-------------------
Name:             ******
Config Version:   4
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Jul 13 16:51:06 2020
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000002
Ring ID:          1.5d39
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.10.11.41
0x00000002          1 10.10.11.21 (local)
0x00000003          1 10.10.11.31
0x00000004          1 10.10.11.11
 
This behavior is not normal. Do you see anything in the syslogs or in the output of dmesg?

The disks and RAM are okay?

The slow and hanging commands and not working ssh are very 'interesting'.

Can the nodes ping each other? Can you ssh from one node to another?
 
The whole cluster was working perfectly. Disks are totally new , Ram is New and tested.The cluster is made of 4 DELL R720 's.
As i mentioned earlier, the moment i have started upgrading the first machine and even without reboot the whole cluster stopped working and started to act weird. After i have upgraded all the machines and reboot, the cluster is dead now.


About the ssh and console:
The issue is that whether is ssh or direct console, after few commands the console hangs and cannot write commands.
Will try to cross ping from inside one to another.
 

Attachments

  • dmesg.txt
    506 KB · Views: 12
In that dmesg output are a few things that are worrying, first it is filled with these lines:
Code:
[ 7444.146268] vmbr0: received packet on bond1 with own address as source address (addr:ec:f4:bb:d8:0e:14, vlan:0)

And then there is this:
Code:
[ 5922.333371] bond0: (slave eno3): link status definitely down, disabling slave
[ 5922.333378] bond0: now running without any active interface!
[ 5922.333414] bond0: (slave eno4): link status definitely down, disabling slave
[ 5922.333544] vmbr1: port 1(bond0) entered disabled state
[ 5922.357312] bond1: (slave eno1): link status definitely down, disabling slave
[ 5922.357327] bond1: (slave eno2): link status definitely down, disabling slave
[ 5922.357344] bond1: now running without any active interface!
The interfaces come up ~4 seconds later but this should not happen.

How are these bonds set up? Can you show us the /etc/network/interfaces file? (redact any public IP addresses).

Are the switches okay and running the latest firmware?
 
As hardware, everything is working just fine.
I'm still trying to understand what is missing in order to get rid of this message when trying to manage the other 3 NODES.It is same when i connect to different node and try to manage the others.
Seems like some "keyring" or cert is missing or nodes not communicating as in cluster.
"Connection refused (595)"

Here is the network config which was working perfectly before the update



Code:
auto lo

iface lo inet loopback


auto eno1

iface eno1 inet manual


auto eno2

iface eno2 inet manual


auto eno3

iface eno3 inet manual


auto eno4

iface eno4 inet manual


auto bond0

iface bond0 inet manual

        bond-slaves eno3 eno4

        bond-miimon 100

        bond-mode balance-rr


auto bond1

iface bond1 inet manual

        bond-slaves eno1 eno2

        bond-miimon 100

        bond-mode balance-rr


auto vmbr0

iface vmbr0 inet static

        address 10.10.11.31/24

        gateway 10.10.11.1

        bridge-ports bond1

        bridge-stp off

        bridge-fd 0


auto vmbr1

iface vmbr1 inet manual

        bridge-ports bond0

        bridge-stp off

        bridge-fd 0
 
Maybe try to change the bond-mode from balance-rr to active-backup?

For some reasons you have these log lines a lot and it is possible that this is then causing load on your servers which in return behave in this very peculiar way they do, not being responsive at all anymore and such.
If you search for the log line, there are a few threads even in this forum and it was either broken switches or buggy firmware.

It might be worth to try to restart the switches and maybe test another set of switches.
 
Fixed!!!
1594721383417.png

The cluster is working now .All Nodes are green!
What i did so far is as aaron suggested is to check to ping all nodes from all nodes which eventually have returned duplicated packets

1594721575313.png

This is when i removed one of the interfaces from the bond
1594721822138.png

I have removed the vbmr0 bonds for the local connectivity
It is strange how it was working perfectly with eno1 eno2 in bond with balance-rr without any issues BEFORE the update.

Anyway ....
After each removed bond and set vmr0 with eno1 or eno2
service pve-cluster restart

after that all those

killall -9 corosync
systemctl restart pve-cluster
systemctl restart pvedaemon
systemctl restart pveproxy
systemctl restart pvestatd

And now all nodes are green + no connectivity issue between them no DUp ping and no connection error
Im waiting for the 40gbit cards to arrive so i can test without bond with only 40gbit connectivity for CEPH and failover && backup.

Thank you for you help and directions!!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!