Sorry, but without an error message i can't help you ¯\_(ツ)_/¯
try sneaking around in /var/log/ceph and look at the logfiles, or try pinging nodes on the ceph public network, or check if the interface came up .. or or or
Do you have any error message?
We have no problems with the same kernel and also mlx cards:
root@prox2:~# dmesg |grep -e mlx
[ 3.724951] mlx5_core 0000:af:00.0: firmware version: 14.23.1020
[ 3.724981] mlx5_core 0000:af:00.0: 63.008 Gb/s available PCIe bandwidth (8 GT/s x8 link)
We did some adjustments (mainly LimitNoFile set to a higher value) and observed the following behaviour:
Everything works well, until we run about 220 guests per node (~20-25 FDs used) - then the prometheus node_exporter in every running guest produces too much "noise" (scraped every 120...
we just upgraded to the 6.2 release with lxc 4.0 and after running about 250 containers on each node, we now get the following error (previously: https://forum.proxmox.com/threads/lxcfs-br0ke-cgroup-limit.69015/#post-309442)
root@lxc-prox4:~# grep -A 5 -B 5 lxcfs /var/log/messages...
we're actually running a four node Cluster with about 250 lxc containers on each node (evenly distributed). Primary Storage for almost all containers (except 4) is on the integrated ceph within proxmox.
Linux 5.3.13-1-pve #1 SMP PVE 5.3.13-1 (Thu, 05 Dec 2019...
We implemented it like this (the pve-guests still takes care of all containers as well, but we dont mind ;) )
Description=PVE startup booster
having >200 LXC Containers on a Server takes some time to bring all up on a Node Reboot, especially when they are started in a serial manner.
Is there a possibility to work with parallel starts to bring up the containers faster?