[SOLVED] New Cluster Reboot Problems: NTP, Adding a new Node

LBX_Blackjack

Member
Jul 24, 2020
35
1
8
31
Hello all,
[Background] We recently set up our first Proxmox cluster with four new HP ProLiant DL360 Gen9 servers and a HP-2530-48G (J9775A) switch. The servers' NICs configured to form LACP bonds to the switch, and we have set up CEPH and HA in Proxmox.

[Issue 1] We were getting random reboots of nodes or the whole cluster, but put a band-aid on it by creating a job that updates the Time/Date from a public NTP server every minute. Clearly this isn't ideal, so we were hoping for some advise on how to properly fix this issue.

[Issue 2] The second issue we're having is that we're trying to add a fifth server (on a second switch) into the cluster and CEPH pool, but it's causing nodes to spontaneously reboot.

Any leads would be greatly appreciated.
 
These problems are most likely due to congestion/saturation of your network. LACP can only share sessions based on the hashing policy configured on the nodes AND the switch, any mismatch produces out-of-order packets and therefore re-transmissions and/or latency.

I have built a similar cluster on like hardware for testing/proof of concept, but you MUST separate networks by function -- VLANS are not sufficient.

1. Front-end Network -- vmbr0; node mgmt GUI/SSH; VM networking (with VLANS, etc) -- can be active/backup or LACP bond
2. PVE Cluster (Corosync/Kronosnet) -- dedicated -- nothing else runs on this network -- 1G is sufficient [0]
3. Ceph Public/Cluster -- separate network for CEPH -- can be LACP with layer3+4 hashing on BOTH switch and nodes. [1], [2]

[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network
[1] https://docs.ceph.com/docs/master/rados/configuration/network-config-ref/
[2] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_precondition

With some tweaking of scrubbing, logging, etc., CEPH will perform minimally over 1G links/bonds, but keep in mind it takes about 3 HOURS to replicate 1TB of data over 1Gbps. Also, systemd-timesyncd never performed well enough for me -- I replaced it with Chrony and set the same upstream NTP server on all nodes with fallback to each other (so loss of NTP causes the nodes to drift TOGETHER).

As I said, this will work for PoC -- you absolutely MUST upgrade the CEPH NICs and switch to 10G before going to production or you will suffer exponentially degraded performance with the load/VMs/etc. you place on the cluster!
 
Well that makes a lot of sense. Our network unfortunately can't support those requirements, so we will have to do without CEPH and HA. Thank you very much!