Best Practice Network

felipe · Jan 12, 2021

Hi,

we have a proxmox / ceph cluster since 8 and 5 years up and runnng. Now specially some ceph servers are end of life and we want to replace them. We now want to change the network for best stability as we had some problems sometimes. Now we have 2 juniper Switches (all 10G LACP connected) with VLAN configured and 3 networks:

1: frontend vm and cluster network
2: storage network (ceph mons nfs etc..)
3: ceph osd network

recently a juniper switch crashed really baldy and all proxmox nodes as also ceph went down. Still we dont know why it hit so badly.

For this we want so separate physically more the networks.
Our plan is to use

1 OSD NETWORK: 2 x melanox 40gb active / passive - or is it possible stacked melanox with 40gb - or worth it?)
2 STORAGE NETWORK: 2 x Juniper 10G (for the mons and nds shares etc)
3 FRONTEND NETWORK : 2 X juniper 10G (only allow 1G for the vms set in proxmox) (a lot of vlans in this network)
having corosync configured for storage and frontend network as 2 rings.

Also he question ist it possible to have a second ceph ceph cluster (only mons) in the same network as the old ceph cluster?

what do you think? what considerations?

best regards
Philipp

Christian St. · Jan 13, 2021

felipe said:
having corosync configured for storage and frontend network as 2 rings.

I would not do this, especially if you take are taking a lot of effort to separate your networks. Corosync should have minimum 1 (better 2) seperate physical link (not shared with any other traffic) because of latency. Corosync links do not need a high bandwidth connection, but a stable and low latency connection.
You could come into big troubles if there is some interrupt on the cluster (corosync) link and a ceph node gets restarted. When it is up again the storage traffic for rebalancing ceph is high and your latency for corosync becomes high again, and maybe the next node starts again...
Seperate it one 1GBit/s connection with another ring on the public network (you named it FRONTEND NETWORK) or better on 2 links wich are seperated links for corosync. (You can use also the same 2 switches for all, if you have enought powerful switches but with dedicated ports and network adapters for the different traffic. If you are just seperated via VLAN on the same physical connection, latency will grow with traffic on the link even if the connection is not saturating.

Proxmox did a perfect graphic concerning network structure:

That grafic come from a Talk with Alwin https://youtu.be/OeDEsD1NjWI (it is in German)

felipe · Jan 13, 2021

Ok in this chart it seems they share storage with osd network which os also not best practise as i think.
generally the question ist: with corosync two networks confugured is it possible to sayforst network and failover network?

last dasys we had bad luck. we have 4 big junpier switches and 3 lacp connects for each server conigured as vlans for vm&cluster, storage/backup, osd network each with its own 2 networks cards for lacp. one juniper went down and the network with all other swiitches (another 2 junipers) was so buggy that proxmox lost quorum but also ceph los qorum. normaly switches dont behalve that strange. but it happened.

felipe · Jan 13, 2021

i dont trust only 2 switches anymore. ours are really strong but when they start to flip /route vlans around still everything goes bad,...
maybe we had just terrible bad luck... i asked some network specialists and they said they never saw something similar..

Christian St. · Jan 13, 2021

Yes, storage frontend and backend from ceph is in this version on one link. Seperation of this two makes sense, but is not as essential as seperating corosync. If ceph starts to rebalance the pg's it might have an impact on performance on the frontend (to the vm's).
For this case you can set osd_max_backfill wich allows you to limit the rebalance speed. As a result, rebalancing would slow down and reduce the network load. But you can not reduce (you can but it's really not recomanded) the needed latency on the corosync link.
sayforst? What do you mean with that? I think you would like to know if you can prioritize the corosync links, that you can use prefered the seperated link and in case of a switching failure corosync should use the "failover network" wich is on an link with other traffic?!
It can be done with the "parameter knet_link_priority: X" in the corosync.conf. E.g.:

Code:

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: cluster
  config_version: XX
  interface {
    bindnetaddr: XXX.10.XXX.XXX
    ringnumber: 0
    linknumber: 0
    knet_link_priority: 10
  }
  interface {
    bindnetaddr: XXX.20.XXX.XXX
    ringnumber: 1
    linknumber: 1
    knet_link_priority: 1
  }
  ip_version: ipv4
  rrp_mode: passive
  secauth: on
  version: 2
  token: 5000
}

You can also use active mode, then both links are used in a way like lacp.

Using corosync links on an lacp interface is not supported and can lead to problems. (Sometimes it works, but we also had problems in the beginning, when using corosnc on a lacp link. Read: https://pve.proxmox.com/pve-docs/pve-network-plain.html

If your switch support the LACP (IEEE 802.3ad) protocol then we recommend using the corresponding bonding mode (802.3ad). Otherwise you should generally use the active-backup mode.
If you intend to run your cluster network on the bonding interfaces, then you have to use active-passive mode on the bonding interfaces, other modes are unsupported.

Maybe a special Problem with the switch configuration or firmeware?

We are running all traffic of a cluster through 2 powerful Arista switches, but seperated, like shown on the graphic. (6 Physical Cables to each node, so we have the full performance of the ports and also the low latency for each network) VM network and Ceph Network is a LACP link. Corosync has a seperate network on each switch. Connection from each node to one switch in one subnet used as link0 and seperate connections from each node to the other switch in an other subnet used als link1. Using the active mode in corosync.conf which is stable since some months (before we used the passiv mode), we tested nearly every possible failure scenario and had no problems with this setup. (switch failure, connection failiure, inter switch connecion (MLAG) failure, node failure, ...
It is essential to test all this possible scenarios before going in production with the cluster to prevent this surprises.

I think that your problem came from using a lacp link for corosync, which is, like written before, not supported. When your switch goes down, have you did any tests, how the latency of the network was? Was there any network failure reported on the proxmox? How did you solved this problem? Did your cluster came up again after repairing/changing the switches?

When we setup new clusters, I prefere using seperate switches just for corosync (1GBit/s is more than enough) because in the first setups we spend so much of the more expensive ports (10/40 GBit/s) for corosync, which is not neccesary. Another benefit of the seperate switch is, that if you did something wrong with the configuration of the switches which are in a MLAG. This may also affect your corosync link and its so much work to repair somthing if the cluster is not quorate and the cluster filesystem is in readonly. You can also use some older switches, even 100MBit/s is enought when the cluster is not too large. You can add one ring and test the latency on the connection. If it is low enought (max. 6-7ms) everything will work out fine.

Christian St. · Jan 13, 2021

felipe said:
each with its own 2 networks cards for lacp

Are this really seperate network cards (6), or cards with more ports? E.g. 3 Network Cards with 2 Ports for the 6 connections?
Here you should also mix the connections on different cards. E.g. If you have one network card with 2 Ports linked to 2 switches with lacp (e.g. for storage traffic) you have nearly none benefit, if there is a failure on the card. Your whole sotrage network for this node will be down..

felipe · Jan 15, 2021

At the moment we have 3 network cards with 2 links each. so we use lacp 1 for vm traffick/ coriosync (not so good) 1 for storage backupd and mons, and 1 for osd trafick. all connected with lacp to 2 junper 10g switches.

actually this worked quite well for 7 years.
our problem was that 1 of our 2 juniper stacked switches just somehow died (not really dead but after reboot did not come up again - before boot doing some strange thing in the network) also we identified a older cisco stack with 2 switches behaving odd. it was impossible for us to really identify what happened because most happened in the night an when we checked just all clusters (proxmox and ceph) where broken. i guess we had really bad luck that switches where going that bad. its just our gues that it where the switches as one as dead afterwards and one was behaving strange (no login on ssh, and also slow pings etc...)
but this showed me that relien to only one network link even with lacp between switches as backbones even separating physially the network links of the host nodes is not 100% secure.

what we found on our servers was that: (it just suddently startet in the night) we don know exacty what happened. in the moring we just saw that the vlans where routing around . traffick was not that extrem enymore.. but we had to disconnct all the servers and the it startet all to work again. we then connected one server afer the other to join the cluster. for the ceph we had to wait some hours to function normally.

Jan 8 05:25:26 host2 kernel: [2392757.962971] mlx4_en: ens1d1: Steering Mode 1
Jan 8 05:25:26 host2 kernel: [2392757.984195] mlx4_en: ens1d1: Link Down
Jan 8 05:25:26 host2 kernel: [2392757.986922] bond1: (slave ens1d1): speed changed to 0 on port 2
Jan 8 05:25:27 host2 kernel: [2392758.037264] mlx4_en: ens1: Steering Mode 1
Jan 8 05:25:27 host2 kernel: [2392758.058644] mlx4_en: ens1: Link Down
Jan 8 05:25:27 host2 kernel: [2392758.061109] bond1: (slave ens1): speed changed to 0 on port 1
Jan 8 05:25:27 host2 kernel: [2392758.065424] bond1: (slave ens1): link status definitely down, disabling slave
Jan 8 05:25:27 host2 kernel: [2392758.065466] bond1: (slave ens1d1): link status definitely down, disabling slave
Jan 8 05:25:27 host2 kernel: [2392758.065469] bond1: now running without any active interface!
Jan 8 05:25:27 host2 kernel: [2392758.065503] vmbr1: port 1(bond1) entered disabled state
Jan 8 05:25:27 host2 kernel: [2392758.077385] mlx4_core 0000:04:00.0: Failed to bond device: -95
Jan 8 05:25:27 host2 kernel: [2392758.077404] mlx4_en: ens1d1: Fail to bond device
Jan 8 05:25:27 host2 kernel: [2392758.078195] mlx4_core 0000:04:00.0: Failed to bond device: -95
Jan 8 05:25:27 host2 kernel: [2392758.078221] mlx4_en: ens1: Fail to bond device
Jan 8 05:25:29 host2 kernel: [2392760.243975] mlx4_en: ens1d1: Link Up
Jan 8 05:25:29 host2 kernel: [2392760.257529] bond1: (slave ens1d1): link status definitely up, 10000 Mbps full duplex
Jan 8 05:25:29 host2 kernel: [2392760.257535] bond1: active interface up!
Jan 8 05:25:29 host2 kernel: [2392760.257542] mlx4_core 0000:04:00.0: Failed to bond device: -95
Jan 8 05:25:29 host2 kernel: [2392760.257564] mlx4_en: ens1d1: Fail to bond device
Jan 8 05:25:29 host2 kernel: [2392760.257586] vmbr1: port 1(bond1) entered blocking state
Jan 8 05:25:29 host2 kernel: [2392760.257588] vmbr1: port 1(bond1) entered forwarding state
Jan 8 05:25:29 host2 kernel: [2392760.298942] mlx4_en: ens1d1: Link Down
Jan 8 05:25:29 host2 kernel: [2392760.301401] bond1: (slave ens1d1): speed changed to 0 on port 2
Jan 8 05:25:29 host2 kernel: [2392760.329563] vmbr1: port 1(bond1) entered disabled state
Jan 8 05:25:29 host2 kernel: [2392760.365485] bond1: (slave ens1d1): link status definitely down, disabling slave

----------

Jan 8 04:31:46 host4 kernel: [3058091.571664] vmbr0: received packet on bond0 with own address as source address (addr:9c:69:b4:61:83:28, vlan:0)
Jan 8 04:31:46 host4 kernel: [3058091.621607] vmbr0: received packet on bond0 with own address as source address (addr:9c:69:b4:61:83:28, vlan:0)
Jan 8 04:31:46 host4 kernel: [3058091.671612] vmbr0: received packet on bond0 with own address as source address (addr:9c:69:b4:61:83:28, vlan:0)
Jan 8 04:31:46 host4 kernel: [3058091.721640] vmbr0: received packet on bond0 with own address as source address (addr:9c:69:b4:61:83:28, vlan:0)
Jan 8 04:31:46 host4 kernel: [3058091.771671] vmbr0: received packet on bond0 with own address as source address (addr:9c:69:b4:61:83:28, vlan:0)
Jan 8 04:31:46 host4 kernel: [3058091.821696] vmbr0: received packet on bond0 with own address as source address (addr:9c:69:b4:61:83:28, vlan:0)
Jan 8 04:31:46 host4 kernel: [3058091.871694] vmbr0: received packet on bond0 with own address as source address (addr:9c:69:b4:61:83:28, vlan:0)
Jan 8 04:31:51 host4 kernel: [3058096.322864] net_ratelimit: 93 callbacks suppressed

----------

Jan 8 05:09:21 ceph3 kernel: [7650713.839546] nf_conntrack: nf_conntrack: table full, dropping packet
Jan 8 05:09:21 ceph3 kernel: [7650713.839858] nf_conntrack: nf_conntrack: table full, dropping packet
Jan 8 05:09:21 ceph3 kernel: [7650713.839900] nf_conntrack: nf_conntrack: table full, dropping packet
Jan 8 05:09:21 ceph3 kernel: [7650713.839978] nf_conntrack: nf_conntrack: table full, dropping packet
Jan 8 05:09:22 ceph3 kernel: [7650714.171741] nf_conntrack: nf_conntrack: table full, dropping packet
Jan 8 05:09:22 ceph3 kernel: [7650714.172154] nf_conntrack: nf_conntrack: table full, dropping packet

Christian St. · Jan 15, 2021

felipe said:
Jan 8 05:09:21 ceph3 kernel: [7650713.839546] nf_conntrack: nf_conntrack: table full, dropping packet
Jan 8 05:09:21 ceph3 kernel: [7650713.839858] nf_conntrack: nf_conntrack: table full, dropping packet
Jan 8 05:09:21 ceph3 kernel: [7650713.839900] nf_conntrack: nf_conntrack: table full, dropping packet
Jan 8 05:09:21 ceph3 kernel: [7650713.839978] nf_conntrack: nf_conntrack: table full, dropping packet
Jan 8 05:09:22 ceph3 kernel: [7650714.171741] nf_conntrack: nf_conntrack: table full, dropping packet
Jan 8 05:09:22 ceph3 kernel: [7650714.172154] nf_conntrack: nf_conntrack: table full, dropping packet

I do not know, if this is in any relation. Does this happens long time after your problems started, while ceph was recovering?
I had written a comment in another thread some days ago, where I found somthing concerning the nf_conntrack:

Check nf_conntrack: This connection tracking and limiting system is the bane of many production Ceph clusters, and can be insidious in that everything is fine at first. As cluster topology and client workload grow, mysterious and intermittent connection failures and performance glitches manifest, becoming worse over time and at certain times of day. Check syslog history for table fillage events. You can mitigate this bother by raising nf_conntrack_max to a much higher value via sysctl. Be sure to raise nf_conntrack_buckets accordingly to nf_conntrack_max / 4, which may require action outside of sysctl e.g. "echo 131072 > /sys/module/nf_conntrack/parameters/hashsize More interdictive but fussier is to blacklist the associated kernel modules to disable processing altogether. This is fragile in that the modules vary among kernel versions, as does the order in which they must be listed. Even when blacklisted there are situations in which iptables or docker may activate connection tracking anyway, so a “set and forget” strategy for the tunables is advised. On modern systems this will not consume appreciable resources.
Look at: https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-osd/

felipe · Jan 17, 2021

no this started before ceph was recovering. we had a really strange network issue. one switch of 2 in the stack was going mad... but it was impossible afterwards to rcunstruct what really happened. we did not have the time to play around and make more tests or analysis during the problem. just saying that with 2 enterprise switches and lacp also you can have big troubles with proxmox cluster and ceph cluster. so we will go for another uplink for proxmox traffic on separate switches... never ever want that again,..

Search

Search

Best Practice Network

felipe

Well-Known Member

Christian St.

Well-Known Member

felipe

Well-Known Member

felipe

Well-Known Member

Christian St.

Well-Known Member

Christian St.

Well-Known Member

felipe

Well-Known Member

Christian St.

Well-Known Member

felipe

Well-Known Member

We value your privacy