[SOLVED] SDN runs in timeout

pille99 · Mar 15, 2023

hello guys
i changed from a physically network config (10.10.15.10-13) which was connected to the SDN Network (communicate between servers), to a vLAN config and in top of it SDN with evpn. sinse the cluster is not working properly anymore. i get timeouts.
the new ip range for the vLAN is 10.10.16.10-13

here are the configs
auto lo
iface lo inet loopback

auto enp41s0
iface enp41s0 inet manual
#1GB UPLINK

auto enp1s0f0
iface enp1s0f0 inet manual
mtu 9000
#10GB SDN

auto enp1s0f1
iface enp1s0f1 inet static
address 10.10.12.10/24
mtu 9000
#10GB CoroSync

auto enp33s0
iface enp33s0 inet manual
mtu 9000
#10GB Ceph

auto vmbr99
iface vmbr99 inet static
address 10.10.10.10/24
bridge-ports enp33s0
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 2-4094
mtu 9000
#10GB Ceph Public

auto vmbr98 ---------------------------------------------- this is the cnaged network
iface vmbr98 inet static
address 10.10.16.10/24
bridge-ports enp1s0f0
bridge-stp off
bridge-fd 0
mtu 9000
#10GB SDN Network

auto vmbr0
iface vmbr0 inet static
address publicIP
gateway PublicGW
bridge-ports enp41s0
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 2-4094
#1GB UPLINK Public

auto vmbr99.11
iface vmbr99.11 inet static
address 10.10.11.10/24
mtu 9000
#10GB Ceph Cluster

datacenter.cfg
crs: ha=static
email_from: email
ha: shutdown_policy=migrate
keyboard: en-gb
migration: network=10.10.16.0/24,type=secure
registered-tags: Customer.Adm4u;Customer.Demo;Customer.Eyonis;Infra.Eyonis;Infra.Service;Linux;Openshift;Private.Cloud;Shared.Cloud;Windows2022
tag-style: shape=full

ceph.conf
[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.10.11.10/24
fsid = a8939e3c-7fee-484c-826f-29875927cf43
mon_allow_pool_delete = true
mon_host = 10.10.11.10 10.10.11.11 10.10.11.12
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.10.10.10/24

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.hvirt01]
public_addr = 10.10.11.10

[mon.hvirt02]
public_addr = 10.10.11.11

[mon.hvirt03]
public_addr = 10.10.11.12

controller.cfg
evpn: eVPN_Win
asn 65000
peers 10.10.16.10,10.10.16.11,10.10.16.12,10.10.16.13

zone.cfg
vxlan: IntLAN ------------------------------- current zone. will be replaced with evpn zone
peers 10.10.15.10, 10.10.15.11, 10.10.15.12, 10.10.15.13
ipam pve
mtu 1500
nodes hvirt02,hvirt01,hvirt04,hvirt03

evpn: eZnWin
controller eVPN_Win
vrf-vxlan 98
exitnodes hvirt01,hvirt03
ipam pve
mac CA

7:29:8C:3F:A5
mtu 1450
nodes hvirt01,hvirt02,hvirt03,hvirt04

vnet.cfg
vnet: vNetWin
zone eZnWin
alias vNetWin
tag 98
vlanaware 1

subnet.cfg
empty

where is the mistake

pille99 · Mar 15, 2023

syslog is full of

Mar 15 14:00:11 hvirt04 pvestatd[2017]: status update time (10.214 seconds)
Mar 15 14:00:16 hvirt04 pvestatd[2017]: got timeout
Mar 15 14:00:21 hvirt04 pvestatd[2017]: got timeout
Mar 15 14:00:22 hvirt04 pvestatd[2017]: status update time (10.215 seconds)
Mar 15 14:00:27 hvirt04 pvestatd[2017]: got timeout
Mar 15 14:00:32 hvirt04 pvestatd[2017]: got timeout
Mar 15 14:00:32 hvirt04 pvestatd[2017]: status update time (10.216 seconds)
Mar 15 14:00:37 hvirt04 pvestatd[2017]: got timeout
Mar 15 14:00:42 hvirt04 pvestatd[2017]: got timeout
Mar 15 14:00:42 hvirt04 pvestatd[2017]: status update time (10.214 seconds)
Mar 15 14:00:47 hvirt04 pvestatd[2017]: got timeout
Mar 15 14:00:52 hvirt04 pvestatd[2017]: got timeout

the OSD cant be loaded

pille99 · Mar 15, 2023

still not running

spirit · Mar 20, 2023

so, it seem related to your ceph network, not the sdn.

ceph.conf
[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.10.11.10/24
fsid = a8939e3c-7fee-484c-826f-29875927cf43
mon_allow_pool_delete = true
mon_host = 10.10.11.10 10.10.11.11 10.10.11.12
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.10.10.10/24

The monitors need to run on public network, not cluster network.
cluster network is only (optional different netowrk) for osd replication. If not defined, osd replicate through the the public network.

(and you can't change monitor ips after they are build)

pille99 · Mar 20, 2023

no - it wasnt. i made an update of proxmox. since, the MTU 9000 didnt work anymore. it run the last half a year smooth, suddenly, all was crashed. after changeing the mtu to 8972 everything started to work again.

spirit · Mar 20, 2023

pille99 said:
no - it wasnt. i made an update of proxmox. since, the MTU 9000 didnt work anymore. it run the last half a year smooth, suddenly, all was crashed. after changeing the mtu to 8972 everything started to work again.

mmmm, vlan need extra 4bytes to work. Generally, the kernel auto generate frame like 9004 bytes in you use tagged interface.

Also, for vxlan , you also need 56bytes for vxlan encapsulation.

also your physical switch need to handle it . (generally, jumbo frame on physical switch is also a little bit more than 9000, something like 9216)

pille99 · Mar 20, 2023

i dont have access to the switch.
anyway. with the new mtu value it works again. cant explain why, to be honest.

Search

Search

[SOLVED] SDN runs in timeout

pille99

Active Member

pille99

Active Member

pille99

Active Member

spirit

Distinguished Member

pille99

Active Member

spirit

Distinguished Member

pille99

Active Member

We value your privacy