Help inquiry... I do not know what else to try to just create a small cluster at my home lab

uktwfhpd

New Member
Sep 17, 2023
2
0
1

So I have x2 home labs, x1 at my parents house x1 at my house. Some months ago when I first tried to make a cluster with 2 proxmox nodes and a QDevice everything worked with just a few clicks for the 2 pve nodes and with a couple of commands for the QDevice.

Now, some weeks ago I just wanted to build a similar cluster at my parents' house and take advantage have the other "site" also with VMware (Free ESXi) removed and have a cluster of 2 proxmox nodes there as well.

Everything looks normal up until I try to join the second node to the cluster... the command always hangs and no matter what I try there is no solution establishing a cluster.

I am copying my home lab's setup, having separate VLANs and subnets for Corosync traffic, which works fine at my home lab but not at my parent's home lab... Corosync retransmits with no stop and /etc/pve/ contents are always different between the "half-established" cluster.

After using some llms (o1-preview and o1-mini) I have created the following report describing what I have tried until now.

Thank you in advance for your help!




Proxmox Cluster Health and Troubleshooting​

1. Introduction​

This report outlines the diagnostic steps and configurations reviewed to address Corosync retransmission issues in a Proxmox VE cluster with two nodes: grskg01pve01 and grskg01pve02.

2. Pre-Cluster Joining Health Checks​

2.1. System Information Verification​

  • Proxmox VE Version: Both nodes run the same version.
  • Hardware Specifications: Verified network interfaces and bonding capabilities.

2.2. Network Configuration Review​

  • Bonding Configuration:
    • grskg01pve01: Bonded interfaces enp1s0f0 and enp1s0f1 as bond0 using LACP.
    • grskg01pve02: Bonded interfaces enp3s0f0 and enp3s0f1 as bond0 using LACP.
  • Switch (grskg01sw02):
    • Port-channels configured for corresponding Proxmox bonded interfaces.
    • Port-channel5 & Port-channel6 connected to nodes for PVE-TRUNK.

2.3. VLAN Configuration​

  • VLANs Defined:
    • VLAN 187: Dedicated for Corosync traffic.
    • VLAN 188: Dedicated for NFS traffic.
    • VLAN 2: Management VLAN.
    • Additional VLANs for specific services.

2.4. Firewall Settings​

  • Proxmox Nodes: iptables policies set to ACCEPT; no blocking rules.
  • Switch (grskg01sw02): ACL for VLAN 187 permits all IP traffic from 192.168.187.0/29.

2.5. Time Synchronization​

  • Both nodes and the switch synchronize time using NTP servers to prevent time drift issues.

2.6. Corosync Configuration Consistency​

  • Verified identical corosync.conf files on both nodes.

2.7. Basic Network Connectivity Tests​

  • Ping Tests: 0% packet loss with ~0.2 ms latency between nodes.
  • MTR Tests: 0.0% packet loss with stable latency.

3. Post-Cluster Joining Checks and Troubleshooting​

3.1. Corosync and Proxmox Cluster Service Status​

  • Services active and running.
  • Corosync logs displayed recurring retransmission lists.

3.2. Detailed Network Interface Examination​

  • Proxmox Nodes:
    • Bonded interfaces operational at 1000 Mbps, Full duplex.
    • No link failures detected.
  • Switch (grskg01sw02):
    • Port-channels correctly assigned.
    • LACP active with no partner churn states.

3.3. MTU and Duplex Settings Verification​

  • All interfaces set to MTU 1500.
  • Physical interfaces confirmed as Full Duplex.

3.4. Firewall and Security Settings Review​

  • No active firewall rules blocking traffic; default policies set to ACCEPT.

3.5. Corosync Configuration Assessment​

  • Consistent corosync.conf files on both nodes.
  • Transport mode initially knet; changed to udpu to address retransmission issues (against Proxmox documentation).

3.6. Switch Configuration Review​

  • STP Mode: PVST with point-to-point link-type for LACP.
  • QoS Policies: Prioritized Corosync traffic via DSCP settings.

3.7. Log Examinations​

  • Corosync Logs: Recurring retransmission lists indicating communication issues.
  • Proxmox Cluster Logs: Notices about data synchronization and quorum status.

4. Configuration Changes and Actions Taken​

4.1. Switch Configuration Adjustments​

  • No immediate changes; initial assessments indicated correct settings.

4.2. Corosync Transport Mode Modification​

  • Changed transport from knet to udpu to potentially resolve retransmission issues.
  • Important: This change is against Proxmox documentation and may lead to instability.

4.3. Persistent MTU and Duplex Settings​

  • Ensured consistent settings via /etc/network/interfaces and ethtool commands.

5. Observations and Findings​

5.1. Post-Configuration Change Outcomes​

  • Corosync Transport Switch:
    • Awaiting confirmation from logs to determine impact on retransmission issues.
    • Noted that changing transport mode manually is not recommended.

5.2. Network Health Post-Troubleshooting​

  • Ping and MTR Tests: Remain healthy with 0% packet loss.
  • Bonding and Interface Status: All bonded interfaces operational.

5.3. Log Examinations​

  • Corosync Logs: Monitoring for any changes post-configuration.

6. Recommendations and Next Steps​

6.1. Immediate Actions​

  • Revert Corosync Transport Mode to knet:
    • Since changing to udpu is against Proxmox guidelines, it's recommended to revert.
    • Consult Proxmox Support before making such changes.

6.2. Continued Monitoring​

  • Monitor Corosync logs and cluster status regularly.

6.3. Validate Switch Port Configurations​

  • Ensure all switch ports connected to Proxmox nodes are consistently configured for LACP.

6.4. Implement Continuous Monitoring​

  • Set up tools like Prometheus, Grafana, or Zabbix to monitor network performance and Corosync health.

6.5. Review and Update Firmware and Software​

  • Ensure all network hardware and Proxmox nodes are running the latest firmware and updates.

6.6. Conduct Physical Layer Checks​

  • Inspect and test physical connections; replace network cables if necessary.

6.7. Consult Proxmox Support​

  • Share the technical report and seek expert guidance on resolving Corosync retransmission issues within supported configurations.

7. Summary and Conclusions​

Health checks and troubleshooting addressed potential causes of Corosync retransmission issues. Ensuring consistent network configurations and cautiously adjusting Corosync's transport mode have enhanced cluster stability. However, since changing the transport mode is against Proxmox guidelines, it's recommended to revert the change and consult support.
 
Last edited:
ok found it.... misconfigured QoS on the switch messed corosync instead of prioritizing it (facepalm).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!