i just got an experimental firmware release from mellanox for this specific card - and the "link down" issue is now gone.
The startup of the vmbr with vlans is still somehow "shitty" as it still takes about a minute or so to bring up the nic (before it was about 3 minutes) - is this...
i'd apreciate if you could send me the scripts.
I have a bunch of other Servers all with those standard PCIe Connect-X4 LX EN running with PVE, but none uses a vlan aware bridge - so i never bothered ;)
Thanks for your help :cool:
just set up a cluster with a weird behaviour:
it takes +5 Minutes to bring up the bridge after clicking apply configuration
iface lo inet loopback
iface enp65s0f0 inet manual
iface enp65s0f1 inet manual
iface enp65s0f2 inet manual
Sorry, but without an error message i can't help you ¯\_(ツ)_/¯
try sneaking around in /var/log/ceph and look at the logfiles, or try pinging nodes on the ceph public network, or check if the interface came up .. or or or
Do you have any error message?
We have no problems with the same kernel and also mlx cards:
root@prox2:~# dmesg |grep -e mlx
[ 3.724951] mlx5_core 0000:af:00.0: firmware version: 14.23.1020
[ 3.724981] mlx5_core 0000:af:00.0: 63.008 Gb/s available PCIe bandwidth (8 GT/s x8 link)
We did some adjustments (mainly LimitNoFile set to a higher value) and observed the following behaviour:
Everything works well, until we run about 220 guests per node (~20-25 FDs used) - then the prometheus node_exporter in every running guest produces too much "noise" (scraped every 120...
we just upgraded to the 6.2 release with lxc 4.0 and after running about 250 containers on each node, we now get the following error (previously: https://forum.proxmox.com/threads/lxcfs-br0ke-cgroup-limit.69015/#post-309442)
root@lxc-prox4:~# grep -A 5 -B 5 lxcfs /var/log/messages...