Total Cluster Failure

Christian St. · Jan 2, 2021

Greatsamps said:
Perhpas i should start a new thread about the ceph disconnections etc.

Concerning Ceph:

ceph: 14.2.15-pve3
ceph-fuse: 14.2.15-pve3
on the old cluster nodes

ceph-fuse: 12.2.11+dfsg1-2.1+b1
on the new node8

Is node 8 connected to your 4 ceph nodes? You should use the same versions.
Do you get any warnings from ceph?

Greatsamps said:
Jan 2 16:46:44 ld4-pve-n3 corosync[2809]: [KNET ] link: host: 6 link: 1 is down
Jan 2 16:46:44 ld4-pve-n3 corosync[2809]: [KNET ] host: host: 6 (passive) best link: 2 (pri: 10)
Jan 2 16:46:45 ld4-pve-n3 corosync[2809]: [KNET ] rx: host: 6 link: 1 is up
Jan 2 16:46:45 ld4-pve-n3 corosync[2809]: [KNET ] host: host: 6 (passive) best link: 2 (pri: 10)

Could be a Problem with the bond. you could try to add another seperated link if possible,
Have you checked the LACP parameter on the switch an on the bond? (I think you are using IEEE802.3ad LACP?)

Read: https://pve.proxmox.com/pve-docs/pve-network-plain.html

If your switch support the LACP (IEEE 802.3ad) protocol then we recommend using the corresponding bonding mode (802.3ad). Otherwise you should generally use the active-backup mode.
If you intend to run your cluster network on the bonding interfaces, then you have to use active-passive mode on the bonding interfaces, other modes are unsupported.

Christian St. · Jan 2, 2021

Concerning a maybe possible network infrastructure for you:

You do not need 6 Switches, if there are enough ports free on your existing ones. Seperate the corosync links on 2 (maybe use also cheaper switches, even 2 100MBit/s or 2 1GBit/s seperated interfaces are better than on a bond with 20GBit/s seperated with vlans. Only the low latency is relevant. Try it with some old switches and test it with omping?!). No bonding, just 2 seperate rings. Another benefit over the redundancy if one switch goes down or an network card is damaged, is, that if somthing went wrong with the configuration of one switch, not the entire cluster has the problem. With the LACP (and the MLAG between the redundant switches) it could, as seen with your VLAN Problem first.

For ceph it would also be better to seperate the traffic from the public vm traffic. (Here it is more the performance perspective, while with corosync it is neccesary for stability) If you use additional simpler switches for corosync and there are enought ports on your existing 10GBit/s switches: add another 2 Port 10GBit/s cards to the nodes. Then you could seperate Public (2 x 10GBit/s) / Ceph (2 x 10 GBit/s ) / Corosync (Cluster) (2 x 1GBit/s or 100MBit/s) and you have a network structure which is the right for such an HCI cluster.

With the right network topology and good configured nodes/cluster it is a rock solid system. We have seen nearly all possible failures on our cluster and we neighter got the cluster nor ceph down.

Greatsamps · Jan 3, 2021

So did a full set of pings.

On the short term, i can see max above 6ms, but on the long term one these appear to vanish.

Looking at it, it would seem using this link would cause issues for corosync.

I also did a "jumbo-frame" ping between all the servers and could confirm that that is working as expected, not sure why i am getting MTU errors, perhaps because of latency?

Greatsamps · Jan 3, 2021

Thanks for your further replies.

i have checked the switches and they do seem to be working with 802.3ad:

[LN-LD4-SW-S6720]dis lacp stat eth-trunk 16
Eth-Trunk16's PDU statistic is:
------------------------------------------------------------------------------
Port LacpRevPdu LacpSentPdu MarkerRevPdu MarkerSentPdu
XGigabitEthernet0/0/16 250444 266425 0 0
XGigabitEthernet1/0/16 250439 266619 0 0

The bonds on Proxmox are set to Layer2 hash policy

i am wondering if the network drivers need updating. We are using the standard Proxmox/Debian ones currently, but Mellanox do have an apt-repository which when added, do offer newer drivers, involving a kernel patch.

Someone else said to avoid anything like this as could cause issues so we have not done it, but surely they have an updated driver for a reason?

What you have proposed is not too far off what i was thinking myself.

We do have a couple of old Dell 1gb switches that we use for iDrac/IPMI access. I was thinking to dedicate one of these to corosync, i could actually dedicate 2 of them and have 2 rings... not a if i will be able to sell them for anything..

we do have enough ports in the switches for another set each, so i guess what you say makes sense. put Ceph onto one bond (no vlan's) then the public and live-migration traffic onto another bond.... this is going to be a lot of cables!!!

Most of our hosts are Dell R820's at the moment (4 sockets). We are in the process of changing over to newer 2 socket ones at which point i think we will have to get support on this.

in terms of Network cards, as mentioned we have Mellanox Connect X4 cards. Do you think these are ok? any other one that is battle-tested beyond belief to consider?

overall i have 3 issues with this setup:

1) add node killed cluster <-- confident we have gotten to the bottom of this

2) servers getting stuck in POST after a reboot <-- new supermicro server we have installed is not doing this, so hopefully just a Dell bios issue on old server

3) individual nodes locking up, can't even type on the physical console, nothing logged in logs, CPU's maxed out at 100%, need full reset. <-- this one is still oustanding. My gut is it's something to do with either the old hardware (Sandy Bridge generation), or the network card. Was one of the reasons to get the newer supermicro in place to try and rule that out.

thanks for all your assistance thus far!

Christian St. · Jan 3, 2021

Greatsamps said:
Looking at it, it would seem using this link would cause issues for corosync.

You did the ping over one of the vlan networks, wich is on the bond?
You see, that the ping to NODE7 (CEPH OSD'S & VM'S) 172.16.3.27 is much higher than to an node, wich only has vm's on is. Because it has the highest latency from all the ceph nodes, i think, that this host was master, when you did the test?!
The first test, with 10000 pings is a type of stress test. you put a high load on the link and see what happens with the latency. The 10 minute test let you know what happens over time. Here the package lost is more relevant. Even when there the latancy is good, it may come in trouble when there is load on the network. At the moment ceph is working normal, no recoyery or such things?!
Your network connection and switches are not bad, but you see that the latancy spikes (which are more than 7 ms) which are only based on the not seperated networks. Because of this also the nodes with ceph have a higher latency to all other nodes, and all other nodes to them. (Even you will not saturate the 20GBit/s, you can see the effect wich I sad conerning higher latency when you put load on the network)
Have you connected the 1GBit/s seperate link for corosync already? You can do that test over the IP Adresses on that seperate interface and should see more or less the same min / avg but not so high spikes at max..

Greatsamps said:
in terms of Network cards, as mentioned we have Mellanox Connect X4 cards. Do you think these are ok? any other one that is battle-tested beyond belief to consider?

We are using three Mellanox CX354A ConnectX-3 (2 Port 40GBit/s) in each node and had no problems at all. I do not know if the difference is big, but yours should be newer and if the old ones are supported perfectly.. Why should yours not work. We have not changed the network driver. (Maybe reading through the discussions about the network drivers for mellanox) Have you checked the firmware for the cards?

Have you connected them via SFP or RJ45? If RJ45 which cables are you using (CAT5e, CAT6, CAT6a, CAT7?)

Greatsamps said:
What you have proposed is not too far off what i was thinking myself.

Great. It could make your life much easier. When you set up something like this, I would recomand not to bond the 2 links on a physical card, instead using one link on one card and one link of the other card bonded for public/vm traffic and the other 2 for ceph. Then you have even 1 link of the lacp left if an entire card fails. Otherwise vm or ceph has no links if one of the network cards fails.

You can also use (if enought ports) your iDrac/IPMI switches for one of the 2 corosync links. When the switch is powerfull enough it is not absolutely neccesary to have an dedicated one. It should only have a seperate physical link from your nodes to the switch, could even be a vlan with an seperate subnet where only the corosync traffic is running. It makes a different if you have a physical link (also the LACP is 1 link) with 3 vlans/subnets on it (like now) or you have 3 seperate physical links with 1 vlan/subnet on it. When the switches are powerfull enough you could have the iDrac/IPMI in one vlan over 1 physical interface and the corosync traffic on one vlan over a 2nd seperate physical link on the same switch. then you maybe need just 1 more switch to be redundant. Give it a try and test it with omping like you did already. It is just essential that the switch does not give any packages from other networks (iDrac/IPMI) to the corosync network and vice versa.

Greatsamps said:
Most of our hosts are Dell R820's at the moment (4 sockets). We are in the process of changing over to newer 2 socket ones at which point i think we will have to get support on this.

Concerning Hardware you can look at Thomas Krenn. They have systems optimised for proxmox/ceph. You have not to buy them there, but they also give a detailed list which hareware is used in there systems wich gives you a good orientation e.g. which network card or hba or what ever is good to use. They build tons of systems and they have a lot of experience with proxmox/ceph and also other cool open source solutions. Also reading the benchmark papers from proxmox gives you an orientation.
I think min. you will need a community support subscription for the enterprise repo. If you have some concrete idea how the new nodes should look like and you have list wich hardeware you want to use, you could also send and e-mail to proxmox, with the request to have a short look on your specific hardware. They are all really good guys and I think that they will have a comment on this.
If you want to run this system you should also think about an training. They are excellent and you could also place your questions there, if there are not already answerd in the training. Espessially now, where you got some experience on the good old trail-and-error methode this will push your skills. https://www.proxmox.com/en/training/pve-bundle At the moment there are also online (concerning COVID-19) so you could take part from every..

bobmc · Jan 3, 2021

I'm surprised that someone running a cluster of that size in a commercial enterprise is operating without any form of contracted support

Christian St. · Jan 3, 2021

Greatsamps said:
3) individual nodes locking up, can't even type on the physical console, nothing logged in logs, CPU's maxed out at 100%, need full reset. <-- this one is still oustanding. My gut is it's something to do with either the old hardware (Sandy Bridge generation), or the network card. Was one of the reasons to get the newer supermicro in place to try and rule that out.

If you have updated all firmware and bios I think it could be a problem with the C-States.
Is there any possibility in the bios to change things arround the C-States? (Have never worked with Dell hardware)
Best would be only using C0 and C1. Maybe try this on one of your nodes.

To disable it in software look at: https://forum.proxmox.com/threads/r...e-6-1-auf-ex62-nvme-hetzner.63597/post-294285

Maybe read https://forum.proxmox.com/threads/proxmox-host-random-freeze.54721/
There is also written how to use kdump, to have some log when it happens again.

Search

Search

Total Cluster Failure

Christian St.

Well-Known Member

Christian St.

Well-Known Member

Greatsamps

Active Member

Attachments

Greatsamps

Active Member

Christian St.

Well-Known Member

bobmc

Renowned Member

Christian St.

Well-Known Member

We value your privacy