[SOLVED] IMPOSSIBLE to join node after cluster is created

CieNTi · Sep 12, 2022

Hello there, I hope to find some kind of solution because this is driving me crazy after ~1 week of different tests.

I have two servers at Hetzner with public IP, and also two VLANs. I post the latest one I have, but I tested a lot other combinations (like only interfaces, only bridges ..).

Code:

auto enp4s0
iface enp4s0 inet static
  address [PUBLIC IP]/32
  gateway [GATEWAY]
  pointtopoint [GATEWAY]
  hwaddress [MAC]
  pre-up    ebtables -t nat -A POSTROUTING -j snat --to-src [MAC] -o enp4s0
  post-down ebtables -t nat -D POSTROUTING -j snat --to-src [MAC] -o enp4s0
# Main interface (Public IP) - IPv4

auto enp4s0.4025
iface enp4s0.4025 inet static
  address 10.250.10.1/17
  mtu 1400
  post-up  ip route add 10.250.128.0/17 via 10.250.0.1 dev enp4s0.4025
  pre-down ip route del 10.250.128.0/17 via 10.250.0.1 dev enp4s0.4025
  post-up  iptables -t nat -A POSTROUTING -j SNAT -s 10.200.0.0/16 -o enp4s0.4025 --to 10.250.10.1
  pre-down iptables -t nat -D POSTROUTING -j SNAT -s 10.200.0.0/16 -o enp4s0.4025 --to 10.250.10.1
# Management interface (VLAN 4025)

auto enp4s0.4020
iface enp4s0.4020 inet manual
  mtu 1400
# Servers/VMs interface (VLAN 4020)

auto vmbr0
iface vmbr0 inet static
  address 10.200.10.1/17
  bridge-ports enp4s0.4020
  bridge_waitport 0
  bridge-stp off
  bridge-fd 0
  mtu 1400
  up   ip route add 10.200.128.0/17 via 10.200.0.1 dev vmbr0
  down ip route del 10.200.128.0/17 via 10.200.0.1 dev vmbr0
  post-up   iptables -t nat -A POSTROUTING -j SNAT -s 10.200.0.0/16 -o enp4s0 --to [PUBLIC IP]
  post-down iptables -t nat -D POSTROUTING -j SNAT -s 10.200.0.0/16 -o enp4s0 --to [PUBLIC IP]
# Servers/VMs bridge (VLAN 4020)

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: root1
    nodeid: 1
    quorum_votes: 2
    ring0_addr: 10.250.10.1
  }
  node {
    name: root2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.250.10.2
  }
}

quorum {
  expected_votes: 2
  provider: corosync_votequorum
  two_nodes: 2
}

totem {
  cluster_name: RooT
  config_version: 6
  crypto_cipher: none
  crypto_hash: none
  interface {
    linknumber: 0
  }
  ip_version: ipv4
  link_mode: passive
  secauth: on
  transport: udpu
  netmtu: 1400
  version: 2
}

Code:

127.0.0.1 localhost.localdomain localhost
10.250.10.1 root1.[DOMAIN] root1
10.250.10.2 root2.[DOMAIN] root2

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

With this config I have full connectivity between cloud machines (no proxmox), robot machines (proxmox) and VM within robot machines.

Firewall is now disabled both at cluster and host level, but I also tried to enable it (defaults) and to enable it with custom rules. ALL FAILED

I already tested like the 1000 other threads, videos, posts and comments the simplest "GUI, create + join, super easy, see?". ALL FAILED

Somewhere out there I found that Hetzner may have something related to multicast, but the golden unicorn in this cases is to use "udpu" so unicast is used. ALL FAILED

I also found that maybe "two_nodes" or "netmtu" can help. ALL FAILED

I tried to join via public IP, 200 range and 250 range. ALL FAILED

I tested via GUI, via default pvecm commands a via customized pvecm (-link0 -nodeid). ALL FAILED

In the case I'm posting there is also more weight given to the first node, so it can fake a quorum by itself. ALL FAILED

IT NEVER EVER EVER EVER WORKED, EVERY SINGLE TIME IT FAILS AS FOLLOWS:

If joining via GUI:

- On root1 interface: root2 is seen, but always with red X. Nothing can be used but the console.
- On root2 interface: some log while the initial process is shown, and after that, a "invalid ticket 401" appears
- Depending my modifications, the gui is just broken (login always fails) or directly not reachable.

If joining via command:

Code:

root@root2:~# pvecm add root1 -link0 10.250.10.2 -nodeid 2
Please enter superuser (root) password for 'root1': *************
Establishing API connection with host 'root1'
The authenticity of host 'root1' can't be established.
X509 SHA256 key fingerprint is 3E:CB:58:13:59:67:B6:1D:AF:DF:F1:65:48:EC:03:24:7A:01:A6:54:E5:A3:50:00:2A:2C:D8:8A:45:16:14:08.
Are you sure you want to continue connecting (yes/no)? yes
Login succeeded.
check cluster join API version
Request addition of this node
Join request OK, finishing setup locally
stopping pve-cluster service
backup old database to '/var/lib/pve-cluster/backup/config-1662960617.sql.gz'
waiting for quorum...

And the fact all the documentation states that once the attempt is done, the server is kind of broken, does not help at all. Server 1 is reachable, but server 2 becomes half-dead.

Code:

Sep 12 08:11:40 root1 corosync[1616]:   [MAIN  ] Corosync Cluster Engine 3.1.5 starting up
Sep 12 08:11:40 root1 corosync[1616]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pie relro bindnow
Sep 12 08:11:40 root1 corosync[1616]:   [TOTEM ] Initializing transport (UDP/IP Unicast).
Sep 12 08:11:40 root1 corosync[1616]:   [TOTEM ] The network interface [10.250.10.1] is now up.
Sep 12 08:11:40 root1 corosync[1616]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Sep 12 08:11:40 root1 corosync[1616]:   [QB    ] server name: cmap
Sep 12 08:11:40 root1 corosync[1616]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Sep 12 08:11:40 root1 corosync[1616]:   [QB    ] server name: cfg
Sep 12 08:11:40 root1 corosync[1616]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Sep 12 08:11:40 root1 corosync[1616]:   [QB    ] server name: cpg
Sep 12 08:11:40 root1 corosync[1616]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Sep 12 08:11:40 root1 corosync[1616]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Sep 12 08:11:40 root1 corosync[1616]:   [WD    ] Watchdog not enabled by configuration
Sep 12 08:11:40 root1 corosync[1616]:   [WD    ] resource load_15min missing a recovery key.
Sep 12 08:11:40 root1 corosync[1616]:   [WD    ] resource memory_used missing a recovery key.
Sep 12 08:11:40 root1 corosync[1616]:   [WD    ] no resources configured.
Sep 12 08:11:40 root1 corosync[1616]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Sep 12 08:11:40 root1 corosync[1616]:   [QUORUM] Using quorum provider corosync_votequorum
Sep 12 08:11:40 root1 corosync[1616]:   [QUORUM] This node is within the primary component and will provide service.
Sep 12 08:11:40 root1 corosync[1616]:   [QUORUM] Members[0]:
Sep 12 08:11:40 root1 corosync[1616]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Sep 12 08:11:40 root1 corosync[1616]:   [QB    ] server name: votequorum
Sep 12 08:11:40 root1 corosync[1616]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Sep 12 08:11:40 root1 corosync[1616]:   [QB    ] server name: quorum
Sep 12 08:11:40 root1 corosync[1616]:   [TOTEM ] Configuring link 0
Sep 12 08:11:40 root1 corosync[1616]:   [TOTEM ] adding new UDPU member {10.250.10.1}
Sep 12 08:11:40 root1 corosync[1616]:   [TOTEM ] adding new UDPU member {10.250.10.2}
Sep 12 08:11:40 root1 corosync[1616]:   [QUORUM] Sync members[1]: 1
Sep 12 08:11:40 root1 corosync[1616]:   [QUORUM] Sync joined[1]: 1
Sep 12 08:11:40 root1 corosync[1616]:   [TOTEM ] A new membership (1.23) was formed. Members joined: 1
Sep 12 08:11:40 root1 corosync[1616]:   [QUORUM] Members[1]: 1
Sep 12 08:11:40 root1 corosync[1616]:   [MAIN  ] Completed service synchronization, ready to provide service.

Code:

Sep 12 08:14:59 root2 corosync[1570]:   [MAIN  ] Corosync Cluster Engine 3.1.5 starting up
Sep 12 08:14:59 root2 corosync[1570]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pie relro bindnow
Sep 12 08:14:59 root2 corosync[1570]:   [TOTEM ] Initializing transport (UDP/IP Unicast).
Sep 12 08:14:59 root2 corosync[1570]:   [TOTEM ] The network interface [10.250.10.2] is now up.
Sep 12 08:14:59 root2 corosync[1570]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Sep 12 08:14:59 root2 corosync[1570]:   [QB    ] server name: cmap
Sep 12 08:14:59 root2 corosync[1570]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Sep 12 08:14:59 root2 corosync[1570]:   [QB    ] server name: cfg
Sep 12 08:14:59 root2 corosync[1570]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Sep 12 08:14:59 root2 corosync[1570]:   [QB    ] server name: cpg
Sep 12 08:14:59 root2 corosync[1570]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Sep 12 08:14:59 root2 corosync[1570]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Sep 12 08:14:59 root2 corosync[1570]:   [WD    ] Watchdog not enabled by configuration
Sep 12 08:14:59 root2 corosync[1570]:   [WD    ] resource load_15min missing a recovery key.
Sep 12 08:14:59 root2 corosync[1570]:   [WD    ] resource memory_used missing a recovery key.
Sep 12 08:14:59 root2 corosync[1570]:   [WD    ] no resources configured.
Sep 12 08:14:59 root2 corosync[1570]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Sep 12 08:14:59 root2 corosync[1570]:   [QUORUM] Using quorum provider corosync_votequorum
Sep 12 08:14:59 root2 corosync[1570]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Sep 12 08:14:59 root2 corosync[1570]:   [QB    ] server name: votequorum
Sep 12 08:14:59 root2 corosync[1570]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Sep 12 08:14:59 root2 corosync[1570]:   [QB    ] server name: quorum
Sep 12 08:14:59 root2 corosync[1570]:   [TOTEM ] Configuring link 0
Sep 12 08:14:59 root2 corosync[1570]:   [TOTEM ] adding new UDPU member {10.250.10.1}
Sep 12 08:14:59 root2 corosync[1570]:   [TOTEM ] adding new UDPU member {10.250.10.2}
Sep 12 08:14:59 root2 corosync[1570]:   [QUORUM] Sync members[1]: 2
Sep 12 08:14:59 root2 corosync[1570]:   [QUORUM] Sync joined[1]: 2
Sep 12 08:14:59 root2 corosync[1570]:   [TOTEM ] A new membership (2.2d) was formed. Members joined: 2
Sep 12 08:14:59 root2 corosync[1570]:   [QUORUM] Members[1]: 2
Sep 12 08:14:59 root2 corosync[1570]:   [MAIN  ] Completed service synchronization, ready to provide service.

Code:

proxmox-ve: 7.2-1 (running kernel: 5.15.53-1-pve)
pve-manager: 7.2-7 (running version: 7.2-7/d0dd0e85)
pve-kernel-5.15: 7.2-10
pve-kernel-helper: 7.2-10
pve-kernel-5.15.53-1-pve: 5.15.53-1
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-3
libpve-storage-perl: 7.2-8
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.5-1
proxmox-backup-file-restore: 2.2.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-1
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-3
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.5-pve1

What I'm missing !?

Thanks in advance,
CieNTi

aaron · Sep 15, 2022

You can ping between the hosts right? Did you check if the configured MTU is actually correct?

That could be one reason. But there might be other reasons and Hetzner AFAIK does have some peculiar network setups that are not encountered often.

You can try to test it using ping with the following extra parameters: -M do -s <size>.

The -M do will not fragment the packet, and the size needs to be 28 byte smaller than the intended MTU to account for IP and Ping overhead.

So if the MTU that works would be 1500, you should be able to get a response with ping <ip> -M do -s 1478. If you do not get a response, lower the size. Once you get a response, the MTU you need to configure is the size used + 28.

fabian · Sep 15, 2022

there is no need to use udpu in PVE 7.x, the default 'knet' transport already uses unicast and UDP. furthermore, udpu doesn't support encryption or authentication, it's only there to support legacy setups! additionally, knet will also give you much more info about what's going on.

please give the full output of journalctl -u pve-cluster -u corosync from both nodes covering the period of attempting to join the cluster.

but please be aware that corosync requires a fast (as in, low-latency), stable link, ideally redundant. users have reported issues with attempts to run a cluster over public links at hosting providers in the past.

CieNTi · Sep 15, 2022

Hi @aaron, thanks for your reply

About MTU and ping:

If you look `/etc/network/interfaces` of my first post, you can check how the default interface is not set (default to 1500) and any other in relation to VLAN are set to 1400.

I can ping between both servers via all the 3 interfaces, and connectivity between them has low latency.

The following spoilers should confirm both topics:

Code:

root@root1:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp4s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether [enp4s0_MAC] brd ff:ff:ff:ff:ff:ff
    inet [root1_PUB_IP]/32 scope global enp4s0
       valid_lft forever preferred_lft forever
    inet6 [enp4s0_MAC_as_IPv6]/64 scope link
       valid_lft forever preferred_lft forever
3: enp4s0.4025@enp4s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default qlen 1000
    link/ether [enp4s0_MAC] brd ff:ff:ff:ff:ff:ff
    inet 10.250.10.1/17 scope global enp4s0.4025
       valid_lft forever preferred_lft forever
    inet6 [enp4s0_MAC_as_IPv6]/64 scope link
       valid_lft forever preferred_lft forever
4: enp4s0.4020@enp4s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue master vmbr0 state UP group default qlen 1000
    link/ether [enp4s0_MAC] brd ff:ff:ff:ff:ff:ff
5: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default qlen 1000
    link/ether [enp4s0_MAC] brd ff:ff:ff:ff:ff:ff
    inet 10.200.10.1/17 scope global vmbr0
       valid_lft forever preferred_lft forever
    inet6 [enp4s0_MAC_as_IPv6]/64 scope link
       valid_lft forever preferred_lft forever
root@root1:~# ping 10.200.10.2
PING 10.200.10.2 (10.200.10.2) 56(84) bytes of data.
64 bytes from 10.200.10.2: icmp_seq=1 ttl=64 time=0.388 ms
64 bytes from 10.200.10.2: icmp_seq=2 ttl=64 time=0.352 ms
64 bytes from 10.200.10.2: icmp_seq=3 ttl=64 time=0.386 ms
64 bytes from 10.200.10.2: icmp_seq=4 ttl=64 time=0.375 ms
^C
--- 10.200.10.2 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3061ms
rtt min/avg/max/mdev = 0.352/0.375/0.388/0.014 ms
root@root1:~# ping 10.250.10.2
PING 10.250.10.2 (10.250.10.2) 56(84) bytes of data.
64 bytes from 10.250.10.2: icmp_seq=1 ttl=64 time=0.422 ms
64 bytes from 10.250.10.2: icmp_seq=2 ttl=64 time=0.382 ms
64 bytes from 10.250.10.2: icmp_seq=3 ttl=64 time=0.400 ms
64 bytes from 10.250.10.2: icmp_seq=4 ttl=64 time=0.342 ms
^C
--- 10.250.10.2 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3059ms
rtt min/avg/max/mdev = 0.342/0.386/0.422/0.029 ms
root@root1:~# ping [root2_PUB_IP]
PING [root2_PUB_IP] ([root2_PUB_IP]) 56(84) bytes of data.
64 bytes from [root2_PUB_IP]: icmp_seq=1 ttl=61 time=0.370 ms
64 bytes from [root2_PUB_IP]: icmp_seq=2 ttl=61 time=0.338 ms
64 bytes from [root2_PUB_IP]: icmp_seq=3 ttl=61 time=0.333 ms
64 bytes from [root2_PUB_IP]: icmp_seq=4 ttl=61 time=0.316 ms
^C
--- [root2_PUB_IP] ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3066ms
rtt min/avg/max/mdev = 0.316/0.339/0.370/0.019 ms
root@root1:~#

Code:

root@root2:/etc/network# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp4s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether [enp4s0_MAC] brd ff:ff:ff:ff:ff:ff
    inet [root2_PUB_IP]/32 scope global enp4s0
       valid_lft forever preferred_lft forever
    inet6 [enp4s0_MAC_as_IPv6]/64 scope link
       valid_lft forever preferred_lft forever
3: enp4s0.4025@enp4s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default qlen 1000
    link/ether [enp4s0_MAC] brd ff:ff:ff:ff:ff:ff
    inet 10.250.10.2/17 scope global enp4s0.4025
       valid_lft forever preferred_lft forever
    inet6 [enp4s0_MAC_as_IPv6]/64 scope link
       valid_lft forever preferred_lft forever
4: enp4s0.4020@enp4s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue master vmbr0 state UP group default qlen 1000
    link/ether [enp4s0_MAC] brd ff:ff:ff:ff:ff:ff
5: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default qlen 1000
    link/ether [enp4s0_MAC] brd ff:ff:ff:ff:ff:ff
    inet 10.200.10.2/17 scope global vmbr0
       valid_lft forever preferred_lft forever
    inet6 [enp4s0_MAC_as_IPv6]/64 scope link
       valid_lft forever preferred_lft forever
root@root2:/etc/network# ping 10.200.10.1
PING 10.200.10.1 (10.200.10.1) 56(84) bytes of data.
64 bytes from 10.200.10.1: icmp_seq=1 ttl=64 time=0.400 ms
64 bytes from 10.200.10.1: icmp_seq=2 ttl=64 time=0.389 ms
64 bytes from 10.200.10.1: icmp_seq=3 ttl=64 time=0.360 ms
64 bytes from 10.200.10.1: icmp_seq=4 ttl=64 time=0.407 ms
^C
--- 10.200.10.1 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3072ms
rtt min/avg/max/mdev = 0.360/0.389/0.407/0.017 ms
root@root2:/etc/network# ping 10.250.10.1
PING 10.250.10.1 (10.250.10.1) 56(84) bytes of data.
64 bytes from 10.250.10.1: icmp_seq=1 ttl=64 time=0.315 ms
64 bytes from 10.250.10.1: icmp_seq=2 ttl=64 time=0.317 ms
64 bytes from 10.250.10.1: icmp_seq=3 ttl=64 time=0.334 ms
64 bytes from 10.250.10.1: icmp_seq=4 ttl=64 time=0.373 ms
^C
--- 10.250.10.1 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3066ms
rtt min/avg/max/mdev = 0.315/0.334/0.373/0.023 ms
root@root2:/etc/network# ping [root1_PUB_IP]
PING [root1_PUB_IP] ([root1_PUB_IP]) 56(84) bytes of data.
64 bytes from [root1_PUB_IP]: icmp_seq=1 ttl=61 time=0.313 ms
64 bytes from [root1_PUB_IP]: icmp_seq=2 ttl=61 time=0.333 ms
64 bytes from [root1_PUB_IP]: icmp_seq=3 ttl=61 time=0.344 ms
64 bytes from [root1_PUB_IP]: icmp_seq=4 ttl=61 time=0.330 ms
^C
--- [root1_PUB_IP] ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3057ms
rtt min/avg/max/mdev = 0.313/0.330/0.344/0.011 ms
root@root2:/etc/network#

About the non-fragmented pings, here are the outputs that validates 1400 for one VLAN and 1500 for main interface:

Code:

root@root1:~# ping 10.200.10.2 -M do -s 1500
PING 10.200.10.2 (10.200.10.2) 1500(1528) bytes of data.
ping: local error: message too long, mtu=1400
ping: local error: message too long, mtu=1400
^C
--- 10.200.10.2 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1026ms
root@root1:~# ping 10.200.10.2 -M do -s 1450
PING 10.200.10.2 (10.200.10.2) 1450(1478) bytes of data.
ping: local error: message too long, mtu=1400
ping: local error: message too long, mtu=1400
^C
--- 10.200.10.2 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1029ms
root@root1:~# ping 10.200.10.2 -M do -s 1373
PING 10.200.10.2 (10.200.10.2) 1373(1401) bytes of data.
ping: local error: message too long, mtu=1400
ping: local error: message too long, mtu=1400
^C
--- 10.200.10.2 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1026ms
root@root1:~# ping 10.200.10.2 -M do -s 1372
PING 10.200.10.2 (10.200.10.2) 1372(1400) bytes of data.
1380 bytes from 10.200.10.2: icmp_seq=1 ttl=64 time=0.515 ms
1380 bytes from 10.200.10.2: icmp_seq=2 ttl=64 time=0.491 ms
^C
--- 10.200.10.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1018ms
rtt min/avg/max/mdev = 0.491/0.503/0.515/0.012 ms

Code:

root@root2:/etc/network# ping 10.200.10.1 -M do -s 1500
PING 10.200.10.1 (10.200.10.1) 1500(1528) bytes of data.
ping: local error: message too long, mtu=1400
ping: local error: message too long, mtu=1400
^C
--- 10.200.10.1 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1021ms
root@root2:/etc/network# ping 10.200.10.1 -M do -s 1450
PING 10.200.10.1 (10.200.10.1) 1450(1478) bytes of data.
ping: local error: message too long, mtu=1400
ping: local error: message too long, mtu=1400
^C
--- 10.200.10.1 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1008ms
root@root2:/etc/network# ping 10.200.10.1 -M do -s 1373
PING 10.200.10.1 (10.200.10.1) 1373(1401) bytes of data.
ping: local error: message too long, mtu=1400
ping: local error: message too long, mtu=1400
^C
--- 10.200.10.1 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1003ms
root@root2:/etc/network# ping 10.200.10.1 -M do -s 1372
PING 10.200.10.1 (10.200.10.1) 1372(1400) bytes of data.
1380 bytes from 10.200.10.1: icmp_seq=1 ttl=64 time=0.478 ms
1380 bytes from 10.200.10.1: icmp_seq=2 ttl=64 time=0.458 ms
^C
--- 10.200.10.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1030ms
rtt min/avg/max/mdev = 0.458/0.468/0.478/0.010 ms

Code:

root@root1:~# ping [root2_PUB_IP] -M do -s 1473
PING [root2_PUB_IP] ([root2_PUB_IP]) 1473(1501) bytes of data.
ping: local error: message too long, mtu=1500
ping: local error: message too long, mtu=1500
^C
--- [root2_PUB_IP] ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1032ms
root@root1:~# ping [root2_PUB_IP] -M do -s 1472
PING [root2_PUB_IP] ([root2_PUB_IP]) 1472(1500) bytes of data.
1480 bytes from [root2_PUB_IP]: icmp_seq=1 ttl=61 time=0.453 ms
1480 bytes from [root2_PUB_IP]: icmp_seq=2 ttl=61 time=0.432 ms
^C
--- [root2_PUB_IP] ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1012ms
rtt min/avg/max/mdev = 0.432/0.442/0.453/0.010 ms

Code:

root@root2:/etc/network# ping [root1_PUB_IP] -M do -s 1473
PING [root1_PUB_IP] ([root1_PUB_IP]) 1473(1501) bytes of data.
ping: local error: message too long, mtu=1500
ping: local error: message too long, mtu=1500
^C
--- [root1_PUB_IP] ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1012ms
root@root2:/etc/network# ping [root1_PUB_IP] -M do -s 1472
PING [root1_PUB_IP] ([root1_PUB_IP]) 1472(1500) bytes of data.
1480 bytes from [root1_PUB_IP]: icmp_seq=1 ttl=61 time=0.404 ms
1480 bytes from [root1_PUB_IP]: icmp_seq=2 ttl=61 time=0.340 ms
^C
--- [root1_PUB_IP] ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1032ms
rtt min/avg/max/mdev = 0.340/0.372/0.404/0.032 ms

CieNTi · Sep 15, 2022

Hi @fabian, thanks for your reply.

The udpu tests started after no single success with knet, which was my first option (as much proxmox defaults as possible).

I agree about the non-recommended udpu, but after 4 days of reinstall, test, fail, reinstall ... I didn't really care if encrypted or authenticated or ... as soon as I could get a single working point to start from, and udpu vs. knet seemed like older and simpler to worth a test, but it is only one of my carried tests, not the central focus.

About the latency, I hope < 1ms is enough, as you can see in the previous spoilers.

Take also in mind that are plain empty servers, still no VMs, so the full gigabit connection is dedicated for corosync at the moment.

About the network, I tested both using public IPs and internal VLAN IPs.

When not a single success happened, is when I started to mess with "udpu", "two_nodes", "expect 1 vote", "2 votes for root1", "netmtu" ... involving each of the tests a full reinstallation of both servers after failure, and a single change per test, just to be sure.

As for the output you request, I just did a new attempt using the IP expected for management (10.250.x.x). It was done after the previous information was taken, with no more modifications or actions than going to the GUI to create + join, then to CLI to retrieve the information.

Code:

Sep 13 12:15:53 root1 systemd[1]: Starting The Proxmox VE cluster filesystem...
Sep 13 12:15:54 root1 systemd[1]: Started The Proxmox VE cluster filesystem.
Sep 13 12:15:54 root1 systemd[1]: Condition check resulted in Corosync Cluster Engine being skipped.
Sep 15 10:59:10 root1 systemd[1]: Condition check resulted in Corosync Cluster Engine being skipped.
Sep 15 10:59:10 root1 systemd[1]: Stopping The Proxmox VE cluster filesystem...
Sep 15 10:59:10 root1 pmxcfs[1770]: [main] notice: teardown filesystem
Sep 15 10:59:20 root1 systemd[1]: pve-cluster.service: State 'stop-sigterm' timed out. Killing.
Sep 15 10:59:20 root1 systemd[1]: pve-cluster.service: Killing process 1770 (pmxcfs) with signal SIGKILL.
Sep 15 10:59:20 root1 systemd[1]: pve-cluster.service: Killing process 1771 (cfs_loop) with signal SIGKILL.
Sep 15 10:59:20 root1 systemd[1]: pve-cluster.service: Main process exited, code=killed, status=9/KILL
Sep 15 10:59:20 root1 systemd[1]: pve-cluster.service: Failed with result 'timeout'.
Sep 15 10:59:20 root1 systemd[1]: Stopped The Proxmox VE cluster filesystem.
Sep 15 10:59:20 root1 systemd[1]: pve-cluster.service: Consumed 2min 4.092s CPU time.
Sep 15 10:59:20 root1 systemd[1]: Starting The Proxmox VE cluster filesystem...
Sep 15 10:59:20 root1 pmxcfs[2718536]: [dcdb] notice: wrote new corosync config '/etc/corosync/corosync.conf' (version = 1)
Sep 15 10:59:20 root1 pmxcfs[2718536]: [dcdb] notice: wrote new corosync config '/etc/corosync/corosync.conf' (version = 1)
Sep 15 10:59:20 root1 pmxcfs[2718538]: [quorum] crit: quorum_initialize failed: 2
Sep 15 10:59:20 root1 pmxcfs[2718538]: [quorum] crit: can't initialize service
Sep 15 10:59:20 root1 pmxcfs[2718538]: [confdb] crit: cmap_initialize failed: 2
Sep 15 10:59:20 root1 pmxcfs[2718538]: [confdb] crit: can't initialize service
Sep 15 10:59:20 root1 pmxcfs[2718538]: [dcdb] crit: cpg_initialize failed: 2
Sep 15 10:59:20 root1 pmxcfs[2718538]: [dcdb] crit: can't initialize service
Sep 15 10:59:20 root1 pmxcfs[2718538]: [status] crit: cpg_initialize failed: 2
Sep 15 10:59:20 root1 pmxcfs[2718538]: [status] crit: can't initialize service
Sep 15 10:59:21 root1 systemd[1]: Started The Proxmox VE cluster filesystem.
Sep 15 10:59:21 root1 systemd[1]: Starting Corosync Cluster Engine...
Sep 15 10:59:21 root1 corosync[2718543]:   [MAIN  ] Corosync Cluster Engine 3.1.5 starting up
Sep 15 10:59:21 root1 corosync[2718543]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzl>
Sep 15 10:59:21 root1 corosync[2718543]:   [TOTEM ] Initializing transport (Kronosnet).
Sep 15 10:59:22 root1 corosync[2718543]:   [TOTEM ] totemknet initialized
Sep 15 10:59:22 root1 corosync[2718543]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/c>
Sep 15 10:59:22 root1 corosync[2718543]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Sep 15 10:59:22 root1 corosync[2718543]:   [QB    ] server name: cmap
Sep 15 10:59:22 root1 corosync[2718543]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Sep 15 10:59:22 root1 corosync[2718543]:   [QB    ] server name: cfg
Sep 15 10:59:22 root1 corosync[2718543]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Sep 15 10:59:22 root1 corosync[2718543]:   [QB    ] server name: cpg
Sep 15 10:59:22 root1 corosync[2718543]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Sep 15 10:59:22 root1 corosync[2718543]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Sep 15 10:59:22 root1 corosync[2718543]:   [WD    ] Watchdog not enabled by configuration
Sep 15 10:59:22 root1 corosync[2718543]:   [WD    ] resource load_15min missing a recovery key.
Sep 15 10:59:22 root1 corosync[2718543]:   [WD    ] resource memory_used missing a recovery key.
Sep 15 10:59:22 root1 corosync[2718543]:   [WD    ] no resources configured.
Sep 15 10:59:22 root1 corosync[2718543]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Sep 15 10:59:22 root1 corosync[2718543]:   [QUORUM] Using quorum provider corosync_votequorum
Sep 15 10:59:22 root1 corosync[2718543]:   [QUORUM] This node is within the primary component and will provide service.
Sep 15 10:59:22 root1 corosync[2718543]:   [QUORUM] Members[0]:
Sep 15 10:59:22 root1 corosync[2718543]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Sep 15 10:59:22 root1 corosync[2718543]:   [QB    ] server name: votequorum
Sep 15 10:59:22 root1 corosync[2718543]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Sep 15 10:59:22 root1 corosync[2718543]:   [QB    ] server name: quorum
Sep 15 10:59:22 root1 corosync[2718543]:   [TOTEM ] Configuring link 0
Sep 15 10:59:22 root1 corosync[2718543]:   [TOTEM ] Configured link number 0: local addr: 10.250.10.1, port=5405
Sep 15 10:59:22 root1 corosync[2718543]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0)
Sep 15 10:59:22 root1 corosync[2718543]:   [KNET  ] host: host: 1 has no active links
Sep 15 10:59:22 root1 corosync[2718543]:   [QUORUM] Sync members[1]: 1
Sep 15 10:59:22 root1 corosync[2718543]:   [QUORUM] Sync joined[1]: 1
Sep 15 10:59:22 root1 corosync[2718543]:   [TOTEM ] A new membership (1.5) was formed. Members joined: 1
Sep 15 10:59:22 root1 corosync[2718543]:   [QUORUM] Members[1]: 1
Sep 15 10:59:22 root1 corosync[2718543]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 15 10:59:22 root1 systemd[1]: Started Corosync Cluster Engine.
Sep 15 10:59:26 root1 pmxcfs[2718538]: [status] notice: update cluster info (cluster name  RooT, version = 1)
Sep 15 10:59:26 root1 pmxcfs[2718538]: [status] notice: node has quorum
Sep 15 10:59:26 root1 pmxcfs[2718538]: [dcdb] notice: members: 1/2718538
Sep 15 10:59:26 root1 pmxcfs[2718538]: [dcdb] notice: all data is up to date
Sep 15 10:59:26 root1 pmxcfs[2718538]: [status] notice: members: 1/2718538
Sep 15 10:59:26 root1 pmxcfs[2718538]: [status] notice: all data is up to date
Sep 15 11:01:54 root1 pmxcfs[2718538]: [dcdb] notice: wrote new corosync config '/etc/corosync/corosync.conf' (version = 2)
Sep 15 11:01:55 root1 corosync[2718543]:   [CFG   ] Config reload requested by node 1
Sep 15 11:01:55 root1 corosync[2718543]:   [TOTEM ] Configuring link 0
Sep 15 11:01:55 root1 corosync[2718543]:   [TOTEM ] Configured link number 0: local addr: 10.250.10.1, port=5405
Sep 15 11:01:55 root1 corosync[2718543]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Sep 15 11:01:55 root1 corosync[2718543]:   [QUORUM] Members[1]: 1
Sep 15 11:01:55 root1 pmxcfs[2718538]: [status] notice: node lost quorum
Sep 15 11:01:55 root1 corosync[2718543]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 0)
Sep 15 11:01:55 root1 corosync[2718543]:   [KNET  ] host: host: 2 has no active links
Sep 15 11:01:55 root1 corosync[2718543]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 15 11:01:55 root1 corosync[2718543]:   [KNET  ] host: host: 2 has no active links
Sep 15 11:01:55 root1 corosync[2718543]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 15 11:01:55 root1 corosync[2718543]:   [KNET  ] host: host: 2 has no active links
Sep 15 11:01:55 root1 pmxcfs[2718538]: [status] notice: update cluster info (cluster name  RooT, version = 2)

Code:

Sep 13 12:15:41 root2 systemd[1]: Starting The Proxmox VE cluster filesystem...
Sep 13 12:15:42 root2 systemd[1]: Started The Proxmox VE cluster filesystem.
Sep 13 12:15:42 root2 systemd[1]: Condition check resulted in Corosync Cluster Engine being skipped.
Sep 15 11:01:55 root2 systemd[1]: Stopping The Proxmox VE cluster filesystem...
Sep 15 11:01:55 root2 pmxcfs[1780]: [main] notice: teardown filesystem
Sep 15 11:01:56 root2 pmxcfs[1780]: [main] notice: exit proxmox configuration filesystem (0)
Sep 15 11:01:56 root2 systemd[1]: pve-cluster.service: Succeeded.
Sep 15 11:01:56 root2 systemd[1]: Stopped The Proxmox VE cluster filesystem.
Sep 15 11:01:56 root2 systemd[1]: pve-cluster.service: Consumed 1min 58.596s CPU time.
Sep 15 11:01:56 root2 systemd[1]: Starting Corosync Cluster Engine...
Sep 15 11:01:56 root2 systemd[1]: Starting The Proxmox VE cluster filesystem...
Sep 15 11:01:56 root2 corosync[2601685]:   [MAIN  ] Corosync Cluster Engine 3.1.5 starting up
Sep 15 11:01:56 root2 corosync[2601685]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzl>
Sep 15 11:01:56 root2 pmxcfs[2601687]: [quorum] crit: quorum_initialize failed: 2
Sep 15 11:01:56 root2 pmxcfs[2601687]: [quorum] crit: can't initialize service
Sep 15 11:01:56 root2 pmxcfs[2601687]: [confdb] crit: cmap_initialize failed: 2
Sep 15 11:01:56 root2 pmxcfs[2601687]: [confdb] crit: can't initialize service
Sep 15 11:01:56 root2 pmxcfs[2601687]: [dcdb] crit: cpg_initialize failed: 2
Sep 15 11:01:56 root2 pmxcfs[2601687]: [dcdb] crit: can't initialize service
Sep 15 11:01:56 root2 pmxcfs[2601687]: [status] crit: cpg_initialize failed: 2
Sep 15 11:01:56 root2 pmxcfs[2601687]: [status] crit: can't initialize service
Sep 15 11:01:56 root2 corosync[2601685]:   [TOTEM ] Initializing transport (Kronosnet).
Sep 15 11:01:57 root2 corosync[2601685]:   [TOTEM ] totemknet initialized
Sep 15 11:01:57 root2 corosync[2601685]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/c>
Sep 15 11:01:57 root2 corosync[2601685]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Sep 15 11:01:57 root2 corosync[2601685]:   [QB    ] server name: cmap
Sep 15 11:01:57 root2 corosync[2601685]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Sep 15 11:01:57 root2 corosync[2601685]:   [QB    ] server name: cfg
Sep 15 11:01:57 root2 corosync[2601685]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Sep 15 11:01:57 root2 corosync[2601685]:   [QB    ] server name: cpg
Sep 15 11:01:57 root2 corosync[2601685]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Sep 15 11:01:57 root2 corosync[2601685]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Sep 15 11:01:57 root2 corosync[2601685]:   [WD    ] Watchdog not enabled by configuration
Sep 15 11:01:57 root2 corosync[2601685]:   [WD    ] resource load_15min missing a recovery key.
Sep 15 11:01:57 root2 corosync[2601685]:   [WD    ] resource memory_used missing a recovery key.
Sep 15 11:01:57 root2 corosync[2601685]:   [WD    ] no resources configured.
Sep 15 11:01:57 root2 corosync[2601685]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Sep 15 11:01:57 root2 corosync[2601685]:   [QUORUM] Using quorum provider corosync_votequorum
Sep 15 11:01:57 root2 corosync[2601685]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Sep 15 11:01:57 root2 corosync[2601685]:   [QB    ] server name: votequorum
Sep 15 11:01:57 root2 corosync[2601685]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Sep 15 11:01:57 root2 corosync[2601685]:   [QB    ] server name: quorum
Sep 15 11:01:57 root2 corosync[2601685]:   [TOTEM ] Configuring link 0
Sep 15 11:01:57 root2 corosync[2601685]:   [TOTEM ] Configured link number 0: local addr: 10.250.10.2, port=5405
Sep 15 11:01:57 root2 corosync[2601685]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0)
Sep 15 11:01:57 root2 corosync[2601685]:   [KNET  ] host: host: 1 has no active links
Sep 15 11:01:57 root2 corosync[2601685]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 15 11:01:57 root2 corosync[2601685]:   [KNET  ] host: host: 1 has no active links
Sep 15 11:01:57 root2 corosync[2601685]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 15 11:01:57 root2 corosync[2601685]:   [KNET  ] host: host: 1 has no active links
Sep 15 11:01:57 root2 corosync[2601685]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 0)
Sep 15 11:01:57 root2 corosync[2601685]:   [KNET  ] host: host: 2 has no active links
Sep 15 11:01:57 root2 corosync[2601685]:   [QUORUM] Sync members[1]: 2
Sep 15 11:01:57 root2 corosync[2601685]:   [QUORUM] Sync joined[1]: 2
Sep 15 11:01:57 root2 corosync[2601685]:   [TOTEM ] A new membership (2.5) was formed. Members joined: 2
Sep 15 11:01:57 root2 corosync[2601685]:   [QUORUM] Members[1]: 2
Sep 15 11:01:57 root2 corosync[2601685]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 15 11:01:57 root2 systemd[1]: Started Corosync Cluster Engine.
Sep 15 11:01:57 root2 systemd[1]: Started The Proxmox VE cluster filesystem.
Sep 15 11:02:02 root2 pmxcfs[2601687]: [status] notice: update cluster info (cluster name  RooT, version = 2)
Sep 15 11:02:02 root2 pmxcfs[2601687]: [dcdb] notice: members: 2/2601687
Sep 15 11:02:02 root2 pmxcfs[2601687]: [dcdb] notice: all data is up to date
Sep 15 11:02:02 root2 pmxcfs[2601687]: [status] notice: members: 2/2601687
Sep 15 11:02:02 root2 pmxcfs[2601687]: [status] notice: all data is up to date

Cluster creation at root1, join at root2:

fabian · Sep 15, 2022

so the logs indicate that the links never become "up" at all.

could you try monitoring the corosync ports with tcpdump on both nodes to see which kind of traffic goes in/out? there must either be an MTU related problem or a firewall/routing issue preventing the traffic going over that. since SSH and the API seems to work, I rather suspect the former. could you try setting "netmtu" (on both nodes) to 1200 (note on editing corosync.conf - you always need to bump the config_version as well when changing something, and the config needs to be consistent on all nodes).

knet *should* detect and correct many MTU issues on its own (using path-based MTU discovery), but maybe whatever issue your network has is also affecting that..

CieNTi · Sep 15, 2022

Just to be precise and to be sure:

1. Could you give me the exact tcpdump command you want?
2. Could you tell me the exact procedure to edit corosync config file?

Assuming 5404 and 5405 as per pve-admin-guide, 5405 seems the one to be tested:

Code:

root@root1:~# netstat -antu | grep 540
udp        0      0 10.250.10.1:5405        0.0.0.0:*

root@root1:~# tcpdump -vv -n port 5405
tcpdump: listening on enp4s0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
11:39:59.748875 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.250.10.1.5405 > 10.250.10.2.5405: [bad udp cksum 0x2a60 -> 0x6ca0!] UDP, length 80
11:40:00.549139 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.250.10.1.5405 > 10.250.10.2.5405: [bad udp cksum 0x2a60 -> 0xf6f7!] UDP, length 80
11:40:01.349494 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.250.10.1.5405 > 10.250.10.2.5405: [bad udp cksum 0x2a60 -> 0x2943!] UDP, length 80
11:40:02.149750 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.250.10.1.5405 > 10.250.10.2.5405: [bad udp cksum 0x2a60 -> 0x53d9!] UDP, length 80
11:40:02.950060 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.250.10.1.5405 > 10.250.10.2.5405: [bad udp cksum 0x2a60 -> 0xf9e2!] UDP, length 80
11:40:03.750302 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.250.10.1.5405 > 10.250.10.2.5405: [bad udp cksum 0x2a60 -> 0x58a1!] UDP, length 80
11:40:04.550621 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.250.10.1.5405 > 10.250.10.2.5405: [bad udp cksum 0x2a60 -> 0xdd70!] UDP, length 80
11:40:05.350924 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.250.10.1.5405 > 10.250.10.2.5405: [bad udp cksum 0x2a60 -> 0x0e14!] UDP, length 80
^C
8 packets captured
8 packets received by filter
0 packets dropped by kernel

At root1, where the cluster was created, I can force quorum by expecting 1, then edit the file, then .. should I assume it is OK? Should I restart corosync.service or pve-cluster.service by hand?

Code:

root@root1:~# pvecm e 1
root@root1:~# nano /etc/pve/corosync.conf
root@root1:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: root1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.250.10.1
  }
  node {
    name: root2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.250.10.2
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: RooT
  config_version: 3
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
  netmtu: 1200
}

root@root1:~# cat /etc/corosync/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: root1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.250.10.1
  }
  node {
    name: root2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.250.10.2
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: RooT
  config_version: 3
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
  netmtu: 1200
}

root@root1:~# pvecm s
Cluster information
-------------------
Name:             RooT
Config Version:   3
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Sep 15 11:48:43 2022
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1.5
Quorate:          No

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      1
Quorum:           2 Activity blocked
Flags:         

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.250.10.1 (local)

root@root1:~# tcpdump -vv -n port 5405
tcpdump: listening on enp4s0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
11:50:56.778832 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.250.10.1.5405 > 10.250.10.2.5405: [bad udp cksum 0x2a60 -> 0x2b76!] UDP, length 80
11:50:57.579124 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.250.10.1.5405 > 10.250.10.2.5405: [bad udp cksum 0x2a60 -> 0x70cf!] UDP, length 80
11:50:58.379370 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.250.10.1.5405 > 10.250.10.2.5405: [bad udp cksum 0x2a60 -> 0x2a19!] UDP, length 80
11:50:59.179679 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.250.10.1.5405 > 10.250.10.2.5405: [bad udp cksum 0x2a60 -> 0x1c45!] UDP, length 80
^C
4 packets captured
4 packets received by filter
0 packets dropped by kernel

Code:

root@root2:~# pvecm e 1
root@root2:~# nano /etc/pve/corosync.conf
root@root2:/etc/network# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: root1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.250.10.1
  }
  node {
    name: root2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.250.10.2
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: RooT
  config_version: 3
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
  netmtu: 1200
}

root@root2:/etc/network# cat /etc/corosync/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: root1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.250.10.1
  }
  node {
    name: root2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.250.10.2
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: RooT
  config_version: 3
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
  netmtu: 1200
}

root@root2:/etc/network# pvecm s
Cluster information
-------------------
Name:             RooT
Config Version:   3
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Sep 15 12:00:28 2022
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000002
Ring ID:          2.a
Quorate:          No

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      1
Quorum:           2 Activity blocked
Flags:          

Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 10.250.10.2 (local)
root@root2:/etc/network# tcpdump -vv -n port 5405
tcpdump: listening on enp4s0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:00:58.056592 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.250.10.2.5405 > 10.250.10.1.5405: [bad udp cksum 0x2a60 -> 0x8c5e!] UDP, length 80
12:00:58.856925 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.250.10.2.5405 > 10.250.10.1.5405: [bad udp cksum 0x2a60 -> 0x72ef!] UDP, length 80
12:00:59.657320 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.250.10.2.5405 > 10.250.10.1.5405: [bad udp cksum 0x2a60 -> 0x95f1!] UDP, length 80
^C
3 packets captured
3 packets received by filter
0 packets dropped by kernel

EDIT: I forgot to mention that after the mod, I reverted voting with `pvecm e 2` on both nodes.

Should I restart corosync service? pve-cluster? I read that is dangerous to reboot root2 at this moment, is that true?

CieNTi · Sep 15, 2022

Assuming `systemctl restart corosync.service` first with no success, and then `systemctl restart pve-cluster.service`, the new journal is seen as:

Code:

Sep 15 11:01:55 root1 corosync[2718543]:   [KNET  ] host: host: 2 has no active links
Sep 15 11:01:55 root1 pmxcfs[2718538]: [status] notice: update cluster info (cluster name  RooT, version = 2)
Sep 15 11:44:24 root1 corosync[2718543]:   [QUORUM] This node is within the primary component and will provide service.
Sep 15 11:44:24 root1 corosync[2718543]:   [QUORUM] Members[1]: 1
Sep 15 11:44:24 root1 pmxcfs[2718538]: [status] notice: node has quorum
Sep 15 11:48:28 root1 pmxcfs[2718538]: [dcdb] notice: wrote new corosync config '/etc/corosync/corosync.conf' (version = 3)
Sep 15 11:48:29 root1 corosync[2718543]:   [CFG   ] Config reload requested by node 1
Sep 15 11:48:29 root1 corosync[2718543]:   [CFG   ] Modified entry 'totem.netmtu' in corosync.conf cannot be changed at run-time
Sep 15 11:48:29 root1 corosync[2718543]:   [TOTEM ] Configuring link 0
Sep 15 11:48:29 root1 corosync[2718543]:   [TOTEM ] Configured link number 0: local addr: 10.250.10.1, port=5405
Sep 15 11:48:29 root1 corosync[2718543]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Sep 15 11:48:29 root1 corosync[2718543]:   [QUORUM] Members[1]: 1
Sep 15 11:48:29 root1 pmxcfs[2718538]: [status] notice: node lost quorum
Sep 15 11:48:29 root1 pmxcfs[2718538]: [status] notice: update cluster info (cluster name  RooT, version = 3)
Sep 15 11:54:19 root1 systemd[1]: Stopping Corosync Cluster Engine...
Sep 15 11:54:19 root1 corosync-cfgtool[2768946]: Shutting down corosync
Sep 15 11:54:19 root1 corosync[2718543]:   [MAIN  ] Node was shut down by a signal
Sep 15 11:54:19 root1 corosync[2718543]:   [SERV  ] Unloading all Corosync service engines.
Sep 15 11:54:19 root1 corosync[2718543]:   [QB    ] withdrawing server sockets
Sep 15 11:54:19 root1 corosync[2718543]:   [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
Sep 15 11:54:19 root1 corosync[2718543]:   [CFG   ] Node 1 was shut down by sysadmin
Sep 15 11:54:19 root1 pmxcfs[2718538]: [confdb] crit: cmap_dispatch failed: 2
Sep 15 11:54:19 root1 corosync[2718543]:   [QB    ] withdrawing server sockets
Sep 15 11:54:19 root1 corosync[2718543]:   [SERV  ] Service engine unloaded: corosync configuration map access
Sep 15 11:54:19 root1 corosync[2718543]:   [QB    ] withdrawing server sockets
Sep 15 11:54:19 root1 corosync[2718543]:   [SERV  ] Service engine unloaded: corosync configuration service
Sep 15 11:54:19 root1 corosync[2718543]:   [QB    ] withdrawing server sockets
Sep 15 11:54:19 root1 corosync[2718543]:   [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Sep 15 11:54:19 root1 corosync[2718543]:   [QB    ] withdrawing server sockets
Sep 15 11:54:19 root1 corosync[2718543]:   [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Sep 15 11:54:19 root1 corosync[2718543]:   [SERV  ] Service engine unloaded: corosync profile loading service
Sep 15 11:54:19 root1 corosync[2718543]:   [SERV  ] Service engine unloaded: corosync resource monitoring service
Sep 15 11:54:19 root1 corosync[2718543]:   [SERV  ] Service engine unloaded: corosync watchdog service
Sep 15 11:54:19 root1 pmxcfs[2718538]: [quorum] crit: quorum_dispatch failed: 2
Sep 15 11:54:19 root1 pmxcfs[2718538]: [dcdb] crit: cpg_dispatch failed: 2
Sep 15 11:54:19 root1 pmxcfs[2718538]: [dcdb] crit: cpg_leave failed: 2
Sep 15 11:54:19 root1 pmxcfs[2718538]: [quorum] crit: quorum_initialize failed: 2
Sep 15 11:54:19 root1 pmxcfs[2718538]: [quorum] crit: can't initialize service
Sep 15 11:54:19 root1 pmxcfs[2718538]: [confdb] crit: cmap_initialize failed: 2
Sep 15 11:54:19 root1 pmxcfs[2718538]: [confdb] crit: can't initialize service
Sep 15 11:54:19 root1 pmxcfs[2718538]: [dcdb] notice: start cluster connection
Sep 15 11:54:19 root1 pmxcfs[2718538]: [dcdb] crit: cpg_initialize failed: 2
Sep 15 11:54:19 root1 pmxcfs[2718538]: [dcdb] crit: can't initialize service
Sep 15 11:54:19 root1 pmxcfs[2718538]: [status] crit: cpg_dispatch failed: 2
Sep 15 11:54:19 root1 pmxcfs[2718538]: [status] crit: cpg_leave failed: 2
Sep 15 11:54:19 root1 corosync[2718543]:   [MAIN  ] Corosync Cluster Engine exiting normally
Sep 15 11:54:19 root1 systemd[1]: corosync.service: Succeeded.
Sep 15 11:54:19 root1 systemd[1]: Stopped Corosync Cluster Engine.
Sep 15 11:54:19 root1 systemd[1]: corosync.service: Consumed 20.129s CPU time.
Sep 15 11:54:19 root1 systemd[1]: Starting Corosync Cluster Engine...
Sep 15 11:54:20 root1 corosync[2768949]:   [MAIN  ] Corosync Cluster Engine 3.1.5 starting up
Sep 15 11:54:20 root1 corosync[2768949]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzl>
Sep 15 11:54:20 root1 corosync[2768949]:   [TOTEM ] Initializing transport (Kronosnet).
Sep 15 11:54:20 root1 corosync[2768949]:   [TOTEM ] totemknet initialized
Sep 15 11:54:20 root1 corosync[2768949]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/c>
Sep 15 11:54:20 root1 pmxcfs[2718538]: [status] notice: start cluster connection
Sep 15 11:54:20 root1 pmxcfs[2718538]: [status] crit: cpg_initialize failed: 2
Sep 15 11:54:20 root1 pmxcfs[2718538]: [status] crit: can't initialize service
Sep 15 11:54:20 root1 corosync[2768949]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Sep 15 11:54:20 root1 corosync[2768949]:   [QB    ] server name: cmap
Sep 15 11:54:20 root1 corosync[2768949]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Sep 15 11:54:20 root1 corosync[2768949]:   [QB    ] server name: cfg
Sep 15 11:54:20 root1 corosync[2768949]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Sep 15 11:54:20 root1 corosync[2768949]:   [QB    ] server name: cpg
Sep 15 11:54:20 root1 corosync[2768949]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Sep 15 11:54:20 root1 corosync[2768949]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Sep 15 11:54:20 root1 corosync[2768949]:   [WD    ] Watchdog not enabled by configuration
Sep 15 11:54:20 root1 corosync[2768949]:   [WD    ] resource load_15min missing a recovery key.
Sep 15 11:54:20 root1 corosync[2768949]:   [WD    ] resource memory_used missing a recovery key.
Sep 15 11:54:20 root1 corosync[2768949]:   [WD    ] no resources configured.
Sep 15 11:54:20 root1 corosync[2768949]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Sep 15 11:54:20 root1 corosync[2768949]:   [QUORUM] Using quorum provider corosync_votequorum
Sep 15 11:54:20 root1 corosync[2768949]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Sep 15 11:54:20 root1 corosync[2768949]:   [QB    ] server name: votequorum
Sep 15 11:54:20 root1 corosync[2768949]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Sep 15 11:54:20 root1 corosync[2768949]:   [QB    ] server name: quorum
Sep 15 11:54:20 root1 corosync[2768949]:   [TOTEM ] Configuring link 0
Sep 15 11:54:20 root1 corosync[2768949]:   [TOTEM ] Configured link number 0: local addr: 10.250.10.1, port=5405
Sep 15 11:54:20 root1 corosync[2768949]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 0)
Sep 15 11:54:20 root1 corosync[2768949]:   [KNET  ] host: host: 2 has no active links
Sep 15 11:54:20 root1 corosync[2768949]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 15 11:54:20 root1 corosync[2768949]:   [KNET  ] host: host: 2 has no active links
Sep 15 11:54:20 root1 corosync[2768949]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 15 11:54:20 root1 corosync[2768949]:   [KNET  ] host: host: 2 has no active links
Sep 15 11:54:20 root1 corosync[2768949]:   [QUORUM] Sync members[1]: 1
Sep 15 11:54:20 root1 corosync[2768949]:   [QUORUM] Sync joined[1]: 1
Sep 15 11:54:20 root1 corosync[2768949]:   [TOTEM ] A new membership (1.a) was formed. Members joined: 1
Sep 15 11:54:20 root1 corosync[2768949]:   [QUORUM] Members[1]: 1
Sep 15 11:54:20 root1 corosync[2768949]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 15 11:54:20 root1 systemd[1]: Started Corosync Cluster Engine.
Sep 15 11:54:25 root1 pmxcfs[2718538]: [status] notice: update cluster info (cluster name  RooT, version = 3)
Sep 15 11:54:25 root1 pmxcfs[2718538]: [dcdb] notice: members: 1/2718538
Sep 15 11:54:25 root1 pmxcfs[2718538]: [dcdb] notice: all data is up to date
Sep 15 11:54:26 root1 pmxcfs[2718538]: [status] notice: members: 1/2718538
Sep 15 11:54:26 root1 pmxcfs[2718538]: [status] notice: all data is up to date
Sep 15 11:55:00 root1 systemd[1]: Stopping The Proxmox VE cluster filesystem...
Sep 15 11:55:00 root1 pmxcfs[2718538]: [main] notice: teardown filesystem
Sep 15 11:55:01 root1 pmxcfs[2718538]: [main] notice: exit proxmox configuration filesystem (0)
Sep 15 11:55:01 root1 systemd[1]: pve-cluster.service: Succeeded.
Sep 15 11:55:01 root1 systemd[1]: Stopped The Proxmox VE cluster filesystem.
Sep 15 11:55:01 root1 systemd[1]: pve-cluster.service: Consumed 2.387s CPU time.
Sep 15 11:55:01 root1 systemd[1]: Starting The Proxmox VE cluster filesystem...
Sep 15 11:55:01 root1 pmxcfs[2769659]: [status] notice: update cluster info (cluster name  RooT, version = 3)
Sep 15 11:55:01 root1 pmxcfs[2769659]: [dcdb] notice: members: 1/2769659
Sep 15 11:55:01 root1 pmxcfs[2769659]: [dcdb] notice: all data is up to date
Sep 15 11:55:01 root1 pmxcfs[2769659]: [status] notice: members: 1/2769659
Sep 15 11:55:01 root1 pmxcfs[2769659]: [status] notice: all data is up to date
Sep 15 11:55:02 root1 systemd[1]: Started The Proxmox VE cluster filesystem.
Sep 15 11:55:13 root1 systemd[1]: Stopping The Proxmox VE cluster filesystem...
Sep 15 11:55:13 root1 pmxcfs[2769659]: [main] notice: teardown filesystem
Sep 15 11:55:13 root1 pmxcfs[2769659]: [main] notice: exit proxmox configuration filesystem (0)
Sep 15 11:55:13 root1 systemd[1]: pve-cluster.service: Succeeded.
Sep 15 11:55:13 root1 systemd[1]: Stopped The Proxmox VE cluster filesystem.
Sep 15 11:55:13 root1 systemd[1]: Starting The Proxmox VE cluster filesystem...
Sep 15 11:55:13 root1 pmxcfs[2769934]: [status] notice: update cluster info (cluster name  RooT, version = 3)
Sep 15 11:55:13 root1 pmxcfs[2769934]: [dcdb] notice: members: 1/2769934
Sep 15 11:55:13 root1 pmxcfs[2769934]: [dcdb] notice: all data is up to date
Sep 15 11:55:13 root1 pmxcfs[2769934]: [status] notice: members: 1/2769934
Sep 15 11:55:13 root1 pmxcfs[2769934]: [status] notice: all data is up to date
Sep 15 11:55:14 root1 systemd[1]: Started The Proxmox VE cluster filesystem.

CieNTi · Sep 15, 2022

Code:

Sep 15 11:02:02 root2 pmxcfs[2601687]: [status] notice: all data is up to date
Sep 15 11:52:15 root2 corosync[2601685]:   [QUORUM] This node is within the primary component and will provide service.
Sep 15 11:52:15 root2 corosync[2601685]:   [QUORUM] Members[1]: 2
Sep 15 11:52:15 root2 pmxcfs[2601687]: [status] notice: node has quorum
Sep 15 11:52:32 root2 pmxcfs[2601687]: [dcdb] notice: wrote new corosync config '/etc/corosync/corosync.conf' (version = 3)
Sep 15 11:52:33 root2 corosync[2601685]:   [CFG   ] Config reload requested by node 2
Sep 15 11:52:33 root2 corosync[2601685]:   [CFG   ] Modified entry 'totem.netmtu' in corosync.conf cannot be changed at run-time
Sep 15 11:52:33 root2 corosync[2601685]:   [TOTEM ] Configuring link 0
Sep 15 11:52:33 root2 corosync[2601685]:   [TOTEM ] Configured link number 0: local addr: 10.250.10.2, port=5405
Sep 15 11:52:33 root2 corosync[2601685]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Sep 15 11:52:33 root2 corosync[2601685]:   [QUORUM] Members[1]: 2
Sep 15 11:52:33 root2 pmxcfs[2601687]: [status] notice: node lost quorum
Sep 15 11:52:33 root2 pmxcfs[2601687]: [status] notice: update cluster info (cluster name  RooT, version = 3)
Sep 15 11:54:22 root2 systemd[1]: Stopping Corosync Cluster Engine...
Sep 15 11:54:22 root2 corosync-cfgtool[2640788]: Shutting down corosync
Sep 15 11:54:22 root2 corosync[2601685]:   [MAIN  ] Node was shut down by a signal
Sep 15 11:54:22 root2 corosync[2601685]:   [SERV  ] Unloading all Corosync service engines.
Sep 15 11:54:22 root2 corosync[2601685]:   [QB    ] withdrawing server sockets
Sep 15 11:54:22 root2 corosync[2601685]:   [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
Sep 15 11:54:22 root2 corosync[2601685]:   [CFG   ] Node 2 was shut down by sysadmin
Sep 15 11:54:22 root2 pmxcfs[2601687]: [confdb] crit: cmap_dispatch failed: 2
Sep 15 11:54:22 root2 corosync[2601685]:   [QB    ] withdrawing server sockets
Sep 15 11:54:22 root2 corosync[2601685]:   [SERV  ] Service engine unloaded: corosync configuration map access
Sep 15 11:54:22 root2 corosync[2601685]:   [QB    ] withdrawing server sockets
Sep 15 11:54:22 root2 corosync[2601685]:   [SERV  ] Service engine unloaded: corosync configuration service
Sep 15 11:54:22 root2 corosync[2601685]:   [QB    ] withdrawing server sockets
Sep 15 11:54:22 root2 corosync[2601685]:   [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Sep 15 11:54:22 root2 corosync[2601685]:   [QB    ] withdrawing server sockets
Sep 15 11:54:22 root2 corosync[2601685]:   [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Sep 15 11:54:22 root2 corosync[2601685]:   [SERV  ] Service engine unloaded: corosync profile loading service
Sep 15 11:54:22 root2 corosync[2601685]:   [SERV  ] Service engine unloaded: corosync resource monitoring service
Sep 15 11:54:22 root2 corosync[2601685]:   [SERV  ] Service engine unloaded: corosync watchdog service
Sep 15 11:54:22 root2 pmxcfs[2601687]: [quorum] crit: quorum_dispatch failed: 2
Sep 15 11:54:22 root2 pmxcfs[2601687]: [dcdb] crit: cpg_dispatch failed: 2
Sep 15 11:54:22 root2 pmxcfs[2601687]: [dcdb] crit: cpg_leave failed: 2
Sep 15 11:54:22 root2 pmxcfs[2601687]: [status] crit: cpg_dispatch failed: 2
Sep 15 11:54:22 root2 pmxcfs[2601687]: [status] crit: cpg_leave failed: 2
Sep 15 11:54:23 root2 corosync[2601685]:   [MAIN  ] Corosync Cluster Engine exiting normally
Sep 15 11:54:23 root2 systemd[1]: corosync.service: Succeeded.
Sep 15 11:54:23 root2 systemd[1]: Stopped Corosync Cluster Engine.
Sep 15 11:54:23 root2 systemd[1]: corosync.service: Consumed 18.844s CPU time.
Sep 15 11:54:23 root2 systemd[1]: Starting Corosync Cluster Engine...
Sep 15 11:54:23 root2 corosync[2640812]:   [MAIN  ] Corosync Cluster Engine 3.1.5 starting up
Sep 15 11:54:23 root2 corosync[2640812]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzl>
Sep 15 11:54:23 root2 corosync[2640812]:   [TOTEM ] Initializing transport (Kronosnet).
Sep 15 11:54:23 root2 pmxcfs[2601687]: [quorum] crit: quorum_initialize failed: 2
Sep 15 11:54:23 root2 pmxcfs[2601687]: [quorum] crit: can't initialize service
Sep 15 11:54:23 root2 pmxcfs[2601687]: [confdb] crit: cmap_initialize failed: 2
Sep 15 11:54:23 root2 pmxcfs[2601687]: [confdb] crit: can't initialize service
Sep 15 11:54:23 root2 pmxcfs[2601687]: [dcdb] notice: start cluster connection
Sep 15 11:54:23 root2 pmxcfs[2601687]: [dcdb] crit: cpg_initialize failed: 2
Sep 15 11:54:23 root2 pmxcfs[2601687]: [dcdb] crit: can't initialize service
Sep 15 11:54:23 root2 pmxcfs[2601687]: [status] notice: start cluster connection
Sep 15 11:54:23 root2 pmxcfs[2601687]: [status] crit: cpg_initialize failed: 2
Sep 15 11:54:23 root2 pmxcfs[2601687]: [status] crit: can't initialize service
Sep 15 11:54:23 root2 corosync[2640812]:   [TOTEM ] totemknet initialized
Sep 15 11:54:23 root2 corosync[2640812]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/c>
Sep 15 11:54:23 root2 corosync[2640812]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Sep 15 11:54:23 root2 corosync[2640812]:   [QB    ] server name: cmap
Sep 15 11:54:23 root2 corosync[2640812]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Sep 15 11:54:23 root2 corosync[2640812]:   [QB    ] server name: cfg
Sep 15 11:54:23 root2 corosync[2640812]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Sep 15 11:54:23 root2 corosync[2640812]:   [QB    ] server name: cpg
Sep 15 11:54:23 root2 corosync[2640812]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Sep 15 11:54:23 root2 corosync[2640812]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Sep 15 11:54:23 root2 corosync[2640812]:   [WD    ] Watchdog not enabled by configuration
Sep 15 11:54:23 root2 corosync[2640812]:   [WD    ] resource load_15min missing a recovery key.
Sep 15 11:54:23 root2 corosync[2640812]:   [WD    ] resource memory_used missing a recovery key.
Sep 15 11:54:23 root2 corosync[2640812]:   [WD    ] no resources configured.
Sep 15 11:54:23 root2 corosync[2640812]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Sep 15 11:54:23 root2 corosync[2640812]:   [QUORUM] Using quorum provider corosync_votequorum
Sep 15 11:54:23 root2 corosync[2640812]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Sep 15 11:54:23 root2 corosync[2640812]:   [QB    ] server name: votequorum
Sep 15 11:54:23 root2 corosync[2640812]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Sep 15 11:54:23 root2 corosync[2640812]:   [QB    ] server name: quorum
Sep 15 11:54:23 root2 corosync[2640812]:   [TOTEM ] Configuring link 0
Sep 15 11:54:23 root2 corosync[2640812]:   [TOTEM ] Configured link number 0: local addr: 10.250.10.2, port=5405
Sep 15 11:54:23 root2 corosync[2640812]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0)
Sep 15 11:54:23 root2 corosync[2640812]:   [KNET  ] host: host: 1 has no active links
Sep 15 11:54:23 root2 corosync[2640812]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 15 11:54:23 root2 corosync[2640812]:   [KNET  ] host: host: 1 has no active links
Sep 15 11:54:23 root2 corosync[2640812]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 15 11:54:23 root2 corosync[2640812]:   [KNET  ] host: host: 1 has no active links
Sep 15 11:54:23 root2 corosync[2640812]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 0)
Sep 15 11:54:23 root2 corosync[2640812]:   [KNET  ] host: host: 2 has no active links
Sep 15 11:54:23 root2 corosync[2640812]:   [QUORUM] Sync members[1]: 2
Sep 15 11:54:23 root2 corosync[2640812]:   [QUORUM] Sync joined[1]: 2
Sep 15 11:54:23 root2 corosync[2640812]:   [TOTEM ] A new membership (2.a) was formed. Members joined: 2
Sep 15 11:54:23 root2 corosync[2640812]:   [QUORUM] Members[1]: 2
Sep 15 11:54:23 root2 corosync[2640812]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 15 11:54:23 root2 systemd[1]: Started Corosync Cluster Engine.
Sep 15 11:54:29 root2 pmxcfs[2601687]: [status] notice: update cluster info (cluster name  RooT, version = 3)
Sep 15 11:54:29 root2 pmxcfs[2601687]: [dcdb] notice: members: 2/2601687
Sep 15 11:54:29 root2 pmxcfs[2601687]: [dcdb] notice: all data is up to date
Sep 15 11:54:29 root2 pmxcfs[2601687]: [status] notice: members: 2/2601687
Sep 15 11:54:29 root2 pmxcfs[2601687]: [status] notice: all data is up to date
Sep 15 11:55:05 root2 systemd[1]: Stopping The Proxmox VE cluster filesystem...
Sep 15 11:55:05 root2 pmxcfs[2601687]: [main] notice: teardown filesystem
Sep 15 11:55:06 root2 pmxcfs[2601687]: [main] notice: exit proxmox configuration filesystem (0)
Sep 15 11:55:06 root2 systemd[1]: pve-cluster.service: Succeeded.
Sep 15 11:55:06 root2 systemd[1]: Stopped The Proxmox VE cluster filesystem.
Sep 15 11:55:06 root2 systemd[1]: pve-cluster.service: Consumed 1.581s CPU time.
Sep 15 11:55:06 root2 systemd[1]: Starting The Proxmox VE cluster filesystem...
Sep 15 11:55:06 root2 pmxcfs[2641503]: [status] notice: update cluster info (cluster name  RooT, version = 3)
Sep 15 11:55:06 root2 pmxcfs[2641503]: [dcdb] notice: members: 2/2641503
Sep 15 11:55:06 root2 pmxcfs[2641503]: [dcdb] notice: all data is up to date
Sep 15 11:55:06 root2 pmxcfs[2641503]: [status] notice: members: 2/2641503
Sep 15 11:55:06 root2 pmxcfs[2641503]: [status] notice: all data is up to date
Sep 15 11:55:07 root2 systemd[1]: Started The Proxmox VE cluster filesystem.
Sep 15 11:55:16 root2 systemd[1]: Stopping The Proxmox VE cluster filesystem...
Sep 15 11:55:16 root2 pmxcfs[2641503]: [main] notice: teardown filesystem
Sep 15 11:55:17 root2 pmxcfs[2641503]: [main] notice: exit proxmox configuration filesystem (0)
Sep 15 11:55:17 root2 systemd[1]: pve-cluster.service: Succeeded.
Sep 15 11:55:17 root2 systemd[1]: Stopped The Proxmox VE cluster filesystem.
Sep 15 11:55:17 root2 systemd[1]: Starting The Proxmox VE cluster filesystem...
Sep 15 11:55:17 root2 pmxcfs[2641689]: [status] notice: update cluster info (cluster name  RooT, version = 3)
Sep 15 11:55:17 root2 pmxcfs[2641689]: [dcdb] notice: members: 2/2641689
Sep 15 11:55:17 root2 pmxcfs[2641689]: [dcdb] notice: all data is up to date
Sep 15 11:55:17 root2 pmxcfs[2641689]: [status] notice: members: 2/2641689
Sep 15 11:55:17 root2 pmxcfs[2641689]: [status] notice: all data is up to date
Sep 15 11:55:18 root2 systemd[1]: Started The Proxmox VE cluster filesystem.

fabian · Sep 15, 2022

Code:

    10.250.10.1.5405 > 10.250.10.2.5405: [bad udp cksum 0x2a60 -> 0x6ca0!] UDP, length 80

the traffic is probably rejected because of that.. check your NIC settings, maybe some (buggy) offloading needs to be disabled?

CieNTi · Sep 15, 2022

Hi @fabian

It got a connection! It seems a bad firewall configuration from my side:

I did a few more tests regarding the MTU path with no success, and I moved to your other suggestion, the firewall and traffic rejection one.

The cluster firewall is permissive (just ACCEPT policies in both input and output, no rules), while the node one is restrictive (everything is dropped but a few port rules). No change lead to different behaviour, still broken cluster.

I then look at the Hetzner side, where I already had configured the firewall, and noticed a generic rule leaded me to think Corosync was covered but it was not. As soon as I specifically accepted traffic for 5404 and 5405 cluster become to life.

Not everything was working at the beginning: Green node, summary on cluster side, but summary on node failed (while at the same time stats was updated continuously).

I did a full reinstall of both nodes and repeated the process, just to be sure, and it connected without issues from GUI and no changes from default options. In fact, as you said, knet self-adapt to the MTU (around 1200, is that right?). See my journals in the next posts (16k chars. limit hit per post)

But when I check tcpdump I get both OK's and Bad checksums (in both root1 and root2 is: OK if incoming packet, mismatch if outgoing), see same time lapse in both sides:

Code:

root@root1:/etc/network# tcpdump -vv -n port 5405
tcpdump: listening on enp4s0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
22:01:20.706213 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 156)
    10.250.10.1.5405 > 10.250.10.2.5405: [bad udp cksum 0x2a90 -> 0x848c!] UDP, length 128
22:01:20.706905 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 156)
    10.250.10.2.5405 > 10.250.10.1.5405: [udp sum ok] UDP, length 128
22:01:20.907142 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 124)
    10.250.10.1.5405 > 10.250.10.2.5405: [bad udp cksum 0x2a70 -> 0xeb53!] UDP, length 96
22:01:20.961689 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.250.10.1.5405 > 10.250.10.2.5405: [bad udp cksum 0x2a60 -> 0x0bf0!] UDP, length 80
22:01:20.962177 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.250.10.2.5405 > 10.250.10.1.5405: [udp sum ok] UDP, length 80
22:01:21.268158 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 156)
    10.250.10.1.5405 > 10.250.10.2.5405: [bad udp cksum 0x2a90 -> 0x2865!] UDP, length 128
22:01:21.268752 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 156)
    10.250.10.2.5405 > 10.250.10.1.5405: [udp sum ok] UDP, length 128
22:01:21.414155 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.250.10.2.5405 > 10.250.10.1.5405: [udp sum ok] UDP, length 80
22:01:21.414319 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.250.10.1.5405 > 10.250.10.2.5405: [bad udp cksum 0x2a60 -> 0x4a7b!] UDP, length 80
22:01:21.469024 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 124)
    10.250.10.1.5405 > 10.250.10.2.5405: [bad udp cksum 0x2a70 -> 0xe79a!] UDP, length 96
22:01:21.761947 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.250.10.1.5405 > 10.250.10.2.5405: [bad udp cksum 0x2a60 -> 0x68df!] UDP, length 80
22:01:21.762437 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.250.10.2.5405 > 10.250.10.1.5405: [udp sum ok] UDP, length 80

Code:

root@root2:/etc/network# tcpdump -vv -n port 5405
tcpdump: listening on enp4s0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
22:01:20.495824 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 156)
    10.250.10.1.5405 > 10.250.10.2.5405: [udp sum ok] UDP, length 128
22:01:20.496174 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 156)
    10.250.10.2.5405 > 10.250.10.1.5405: [bad udp cksum 0x2a90 -> 0x063d!] UDP, length 128
22:01:20.696761 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 124)
    10.250.10.1.5405 > 10.250.10.2.5405: [udp sum ok] UDP, length 96
22:01:20.751296 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.250.10.1.5405 > 10.250.10.2.5405: [udp sum ok] UDP, length 80
22:01:20.751450 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.250.10.2.5405 > 10.250.10.1.5405: [bad udp cksum 0x2a60 -> 0xefcd!] UDP, length 80
22:01:21.057765 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 156)
    10.250.10.1.5405 > 10.250.10.2.5405: [udp sum ok] UDP, length 128
22:01:21.058026 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 156)
    10.250.10.2.5405 > 10.250.10.1.5405: [bad udp cksum 0x2a90 -> 0xf6d2!] UDP, length 128
22:01:21.203419 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.250.10.2.5405 > 10.250.10.1.5405: [bad udp cksum 0x2a60 -> 0xbae8!] UDP, length 80
22:01:21.203919 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.250.10.1.5405 > 10.250.10.2.5405: [udp sum ok] UDP, length 80
22:01:21.258636 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 124)
    10.250.10.1.5405 > 10.250.10.2.5405: [udp sum ok] UDP, length 96
22:01:21.551556 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.250.10.1.5405 > 10.250.10.2.5405: [udp sum ok] UDP, length 80
22:01:21.551724 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.250.10.2.5405 > 10.250.10.1.5405: [bad udp cksum 0x2a60 -> 0x1317!] UDP, length 80

But in terms of functionality, I don't see any problem at all .. Is that normal?

CieNTi · Sep 15, 2022

Code:

Sep 15 14:51:14 root1 systemd[1]: Condition check resulted in Corosync Cluster Engine being skipped.
Sep 15 15:00:20 root1 systemd[1]: Condition check resulted in Corosync Cluster Engine being skipped.
Sep 15 15:00:20 root1 systemd[1]: Stopping The Proxmox VE cluster filesystem...
Sep 15 15:00:20 root1 pmxcfs[1891]: [main] notice: teardown filesystem
Sep 15 15:00:21 root1 pmxcfs[1891]: [main] notice: exit proxmox configuration filesystem (0)
Sep 15 15:00:21 root1 systemd[1]: pve-cluster.service: Succeeded.
Sep 15 15:00:21 root1 systemd[1]: Stopped The Proxmox VE cluster filesystem.
Sep 15 15:00:21 root1 systemd[1]: Starting The Proxmox VE cluster filesystem...
Sep 15 15:00:21 root1 pmxcfs[12106]: [dcdb] notice: wrote new corosync config '/etc/corosync/corosync.conf' (version = 1)
Sep 15 15:00:21 root1 pmxcfs[12106]: [dcdb] notice: wrote new corosync config '/etc/corosync/corosync.conf' (version = 1)
Sep 15 15:00:21 root1 pmxcfs[12107]: [quorum] crit: quorum_initialize failed: 2
Sep 15 15:00:21 root1 pmxcfs[12107]: [quorum] crit: can't initialize service
Sep 15 15:00:21 root1 pmxcfs[12107]: [confdb] crit: cmap_initialize failed: 2
Sep 15 15:00:21 root1 pmxcfs[12107]: [confdb] crit: can't initialize service
Sep 15 15:00:21 root1 pmxcfs[12107]: [dcdb] crit: cpg_initialize failed: 2
Sep 15 15:00:21 root1 pmxcfs[12107]: [dcdb] crit: can't initialize service
Sep 15 15:00:21 root1 pmxcfs[12107]: [status] crit: cpg_initialize failed: 2
Sep 15 15:00:21 root1 pmxcfs[12107]: [status] crit: can't initialize service
Sep 15 15:00:22 root1 systemd[1]: Started The Proxmox VE cluster filesystem.
Sep 15 15:00:22 root1 systemd[1]: Starting Corosync Cluster Engine...
Sep 15 15:00:22 root1 corosync[12114]:   [MAIN  ] Corosync Cluster Engine 3.1.5 starting up
Sep 15 15:00:22 root1 corosync[12114]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle >
Sep 15 15:00:22 root1 corosync[12114]:   [TOTEM ] Initializing transport (Kronosnet).
Sep 15 15:00:23 root1 corosync[12114]:   [TOTEM ] totemknet initialized
Sep 15 15:00:23 root1 corosync[12114]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/cry>
Sep 15 15:00:23 root1 corosync[12114]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Sep 15 15:00:23 root1 corosync[12114]:   [QB    ] server name: cmap
Sep 15 15:00:23 root1 corosync[12114]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Sep 15 15:00:23 root1 corosync[12114]:   [QB    ] server name: cfg
Sep 15 15:00:23 root1 corosync[12114]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Sep 15 15:00:23 root1 corosync[12114]:   [QB    ] server name: cpg
Sep 15 15:00:23 root1 corosync[12114]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Sep 15 15:00:23 root1 corosync[12114]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Sep 15 15:00:23 root1 corosync[12114]:   [WD    ] Watchdog not enabled by configuration
Sep 15 15:00:23 root1 corosync[12114]:   [WD    ] resource load_15min missing a recovery key.
Sep 15 15:00:23 root1 corosync[12114]:   [WD    ] resource memory_used missing a recovery key.
Sep 15 15:00:23 root1 corosync[12114]:   [WD    ] no resources configured.
Sep 15 15:00:23 root1 corosync[12114]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Sep 15 15:00:23 root1 corosync[12114]:   [QUORUM] Using quorum provider corosync_votequorum
Sep 15 15:00:23 root1 corosync[12114]:   [QUORUM] This node is within the primary component and will provide service.
Sep 15 15:00:23 root1 corosync[12114]:   [QUORUM] Members[0]:
Sep 15 15:00:23 root1 corosync[12114]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Sep 15 15:00:23 root1 corosync[12114]:   [QB    ] server name: votequorum
Sep 15 15:00:23 root1 corosync[12114]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Sep 15 15:00:23 root1 corosync[12114]:   [QB    ] server name: quorum
Sep 15 15:00:23 root1 corosync[12114]:   [TOTEM ] Configuring link 0
Sep 15 15:00:23 root1 corosync[12114]:   [TOTEM ] Configured link number 0: local addr: 10.250.10.1, port=5405
Sep 15 15:00:23 root1 corosync[12114]:   [QUORUM] Sync members[1]: 1
Sep 15 15:00:23 root1 corosync[12114]:   [QUORUM] Sync joined[1]: 1
Sep 15 15:00:23 root1 corosync[12114]:   [TOTEM ] A new membership (1.5) was formed. Members joined: 1
Sep 15 15:00:23 root1 corosync[12114]:   [QUORUM] Members[1]: 1
Sep 15 15:00:23 root1 corosync[12114]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 15 15:00:23 root1 systemd[1]: Started Corosync Cluster Engine.
Sep 15 15:00:27 root1 pmxcfs[12107]: [status] notice: update cluster info (cluster name  RooT, version = 1)
Sep 15 15:00:27 root1 pmxcfs[12107]: [status] notice: node has quorum
Sep 15 15:00:27 root1 pmxcfs[12107]: [dcdb] notice: members: 1/12107
Sep 15 15:00:27 root1 pmxcfs[12107]: [dcdb] notice: all data is up to date
Sep 15 15:00:27 root1 pmxcfs[12107]: [status] notice: members: 1/12107
Sep 15 15:00:27 root1 pmxcfs[12107]: [status] notice: all data is up to date
Sep 15 15:00:47 root1 pmxcfs[12107]: [dcdb] notice: wrote new corosync config '/etc/corosync/corosync.conf' (version = 2)
Sep 15 15:00:49 root1 corosync[12114]:   [CFG   ] Config reload requested by node 1
Sep 15 15:00:49 root1 corosync[12114]:   [TOTEM ] Configuring link 0
Sep 15 15:00:49 root1 corosync[12114]:   [TOTEM ] Configured link number 0: local addr: 10.250.10.1, port=5405
Sep 15 15:00:49 root1 corosync[12114]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Sep 15 15:00:49 root1 corosync[12114]:   [QUORUM] Members[1]: 1
Sep 15 15:00:49 root1 pmxcfs[12107]: [status] notice: node lost quorum
Sep 15 15:00:49 root1 corosync[12114]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 0)
Sep 15 15:00:49 root1 corosync[12114]:   [KNET  ] host: host: 2 has no active links
Sep 15 15:00:49 root1 corosync[12114]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 15 15:00:49 root1 corosync[12114]:   [KNET  ] host: host: 2 has no active links
Sep 15 15:00:49 root1 corosync[12114]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 15 15:00:49 root1 corosync[12114]:   [KNET  ] host: host: 2 has no active links
Sep 15 15:00:49 root1 pmxcfs[12107]: [status] notice: update cluster info (cluster name  RooT, version = 2)
Sep 15 15:00:53 root1 corosync[12114]:   [KNET  ] rx: host: 2 link: 0 is up
Sep 15 15:00:53 root1 corosync[12114]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 15 15:00:53 root1 corosync[12114]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1285
Sep 15 15:00:53 root1 corosync[12114]:   [KNET  ] pmtud: Global data MTU changed to: 1285
Sep 15 15:00:53 root1 corosync[12114]:   [QUORUM] Sync members[2]: 1 2
Sep 15 15:00:53 root1 corosync[12114]:   [QUORUM] Sync joined[1]: 2
Sep 15 15:00:53 root1 corosync[12114]:   [TOTEM ] A new membership (1.9) was formed. Members joined: 2
Sep 15 15:00:53 root1 corosync[12114]:   [QUORUM] This node is within the primary component and will provide service.
Sep 15 15:00:53 root1 corosync[12114]:   [QUORUM] Members[2]: 1 2
Sep 15 15:00:53 root1 corosync[12114]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 15 15:00:53 root1 pmxcfs[12107]: [status] notice: node has quorum
Sep 15 15:00:56 root1 pmxcfs[12107]: [dcdb] notice: members: 1/12107, 2/10963
Sep 15 15:00:56 root1 pmxcfs[12107]: [dcdb] notice: starting data syncronisation
Sep 15 15:00:56 root1 pmxcfs[12107]: [dcdb] notice: received sync request (epoch 1/12107/00000002)
Sep 15 15:00:56 root1 pmxcfs[12107]: [status] notice: members: 1/12107, 2/10963
Sep 15 15:00:56 root1 pmxcfs[12107]: [status] notice: starting data syncronisation
Sep 15 15:00:56 root1 pmxcfs[12107]: [status] notice: received sync request (epoch 1/12107/00000002)
Sep 15 15:00:56 root1 pmxcfs[12107]: [dcdb] notice: received all states
Sep 15 15:00:56 root1 pmxcfs[12107]: [dcdb] notice: leader is 1/12107
Sep 15 15:00:56 root1 pmxcfs[12107]: [dcdb] notice: synced members: 1/12107
Sep 15 15:00:56 root1 pmxcfs[12107]: [dcdb] notice: start sending inode updates
Sep 15 15:00:56 root1 pmxcfs[12107]: [dcdb] notice: sent all (49) updates
Sep 15 15:00:56 root1 pmxcfs[12107]: [dcdb] notice: all data is up to date
Sep 15 15:00:56 root1 pmxcfs[12107]: [status] notice: received all states
Sep 15 15:00:56 root1 pmxcfs[12107]: [status] notice: all data is up to date
Sep 15 15:01:00 root1 pmxcfs[12107]: [status] notice: received log
Sep 15 15:01:17 root1 pmxcfs[12107]: [status] notice: received log
Sep 15 15:16:16 root1 pmxcfs[12107]: [status] notice: received log
Sep 15 15:24:33 root1 pmxcfs[12107]: [status] notice: received log
Sep 15 15:24:33 root1 pmxcfs[12107]: [status] notice: received log
Sep 15 15:31:16 root1 pmxcfs[12107]: [status] notice: received log
Sep 15 15:34:27 root1 pmxcfs[12107]: [status] notice: received log
Sep 15 15:36:35 root1 pmxcfs[12107]: [status] notice: received log
Sep 15 15:36:52 root1 pmxcfs[12107]: [status] notice: received log
Sep 15 15:38:02 root1 pmxcfs[12107]: [status] notice: received log
Sep 15 15:53:02 root1 pmxcfs[12107]: [status] notice: received log
Sep 15 16:00:21 root1 pmxcfs[12107]: [dcdb] notice: data verification successful

CieNTi · Sep 15, 2022

Code:

Sep 15 14:51:01 root2 systemd[1]: Condition check resulted in Corosync Cluster Engine being skipped.
Sep 15 15:00:48 root2 systemd[1]: Stopping The Proxmox VE cluster filesystem...
Sep 15 15:00:48 root2 pmxcfs[1861]: [main] notice: teardown filesystem
Sep 15 15:00:50 root2 pmxcfs[1861]: [main] notice: exit proxmox configuration filesystem (0)
Sep 15 15:00:50 root2 systemd[1]: pve-cluster.service: Succeeded.
Sep 15 15:00:50 root2 systemd[1]: Stopped The Proxmox VE cluster filesystem.
Sep 15 15:00:50 root2 systemd[1]: Starting Corosync Cluster Engine...
Sep 15 15:00:50 root2 systemd[1]: Starting The Proxmox VE cluster filesystem...
Sep 15 15:00:50 root2 pmxcfs[10963]: [quorum] crit: quorum_initialize failed: 2
Sep 15 15:00:50 root2 pmxcfs[10963]: [quorum] crit: can't initialize service
Sep 15 15:00:50 root2 pmxcfs[10963]: [confdb] crit: cmap_initialize failed: 2
Sep 15 15:00:50 root2 pmxcfs[10963]: [confdb] crit: can't initialize service
Sep 15 15:00:50 root2 pmxcfs[10963]: [dcdb] crit: cpg_initialize failed: 2
Sep 15 15:00:50 root2 pmxcfs[10963]: [dcdb] crit: can't initialize service
Sep 15 15:00:50 root2 pmxcfs[10963]: [status] crit: cpg_initialize failed: 2
Sep 15 15:00:50 root2 pmxcfs[10963]: [status] crit: can't initialize service
Sep 15 15:00:50 root2 corosync[10961]:   [MAIN  ] Corosync Cluster Engine 3.1.5 starting up
Sep 15 15:00:50 root2 corosync[10961]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle >
Sep 15 15:00:50 root2 corosync[10961]:   [TOTEM ] Initializing transport (Kronosnet).
Sep 15 15:00:50 root2 corosync[10961]:   [TOTEM ] totemknet initialized
Sep 15 15:00:50 root2 corosync[10961]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/cry>
Sep 15 15:00:50 root2 corosync[10961]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Sep 15 15:00:50 root2 corosync[10961]:   [QB    ] server name: cmap
Sep 15 15:00:50 root2 corosync[10961]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Sep 15 15:00:50 root2 corosync[10961]:   [QB    ] server name: cfg
Sep 15 15:00:50 root2 corosync[10961]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Sep 15 15:00:50 root2 corosync[10961]:   [QB    ] server name: cpg
Sep 15 15:00:50 root2 corosync[10961]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Sep 15 15:00:50 root2 corosync[10961]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Sep 15 15:00:50 root2 corosync[10961]:   [WD    ] Watchdog not enabled by configuration
Sep 15 15:00:50 root2 corosync[10961]:   [WD    ] resource load_15min missing a recovery key.
Sep 15 15:00:50 root2 corosync[10961]:   [WD    ] resource memory_used missing a recovery key.
Sep 15 15:00:50 root2 corosync[10961]:   [WD    ] no resources configured.
Sep 15 15:00:50 root2 corosync[10961]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Sep 15 15:00:50 root2 corosync[10961]:   [QUORUM] Using quorum provider corosync_votequorum
Sep 15 15:00:50 root2 corosync[10961]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Sep 15 15:00:50 root2 corosync[10961]:   [QB    ] server name: votequorum
Sep 15 15:00:50 root2 corosync[10961]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Sep 15 15:00:50 root2 corosync[10961]:   [QB    ] server name: quorum
Sep 15 15:00:50 root2 corosync[10961]:   [TOTEM ] Configuring link 0
Sep 15 15:00:50 root2 corosync[10961]:   [TOTEM ] Configured link number 0: local addr: 10.250.10.2, port=5405
Sep 15 15:00:50 root2 corosync[10961]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0)
Sep 15 15:00:50 root2 corosync[10961]:   [KNET  ] host: host: 1 has no active links
Sep 15 15:00:50 root2 corosync[10961]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 15 15:00:50 root2 corosync[10961]:   [KNET  ] host: host: 1 has no active links
Sep 15 15:00:50 root2 corosync[10961]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 15 15:00:50 root2 corosync[10961]:   [KNET  ] host: host: 1 has no active links
Sep 15 15:00:50 root2 corosync[10961]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 0)
Sep 15 15:00:50 root2 corosync[10961]:   [KNET  ] host: host: 2 has no active links
Sep 15 15:00:50 root2 corosync[10961]:   [QUORUM] Sync members[1]: 2
Sep 15 15:00:50 root2 corosync[10961]:   [QUORUM] Sync joined[1]: 2
Sep 15 15:00:50 root2 corosync[10961]:   [TOTEM ] A new membership (2.5) was formed. Members joined: 2
Sep 15 15:00:50 root2 corosync[10961]:   [QUORUM] Members[1]: 2
Sep 15 15:00:50 root2 corosync[10961]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 15 15:00:50 root2 systemd[1]: Started Corosync Cluster Engine.
Sep 15 15:00:51 root2 systemd[1]: Started The Proxmox VE cluster filesystem.
Sep 15 15:00:52 root2 corosync[10961]:   [KNET  ] rx: host: 1 link: 0 is up
Sep 15 15:00:52 root2 corosync[10961]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 15 15:00:52 root2 corosync[10961]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1285
Sep 15 15:00:52 root2 corosync[10961]:   [KNET  ] pmtud: Global data MTU changed to: 1285
Sep 15 15:00:53 root2 corosync[10961]:   [QUORUM] Sync members[2]: 1 2
Sep 15 15:00:53 root2 corosync[10961]:   [QUORUM] Sync joined[1]: 1
Sep 15 15:00:53 root2 corosync[10961]:   [TOTEM ] A new membership (1.9) was formed. Members joined: 1
Sep 15 15:00:53 root2 corosync[10961]:   [QUORUM] This node is within the primary component and will provide service.
Sep 15 15:00:53 root2 corosync[10961]:   [QUORUM] Members[2]: 1 2
Sep 15 15:00:53 root2 corosync[10961]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 15 15:00:56 root2 pmxcfs[10963]: [status] notice: update cluster info (cluster name  RooT, version = 2)
Sep 15 15:00:56 root2 pmxcfs[10963]: [status] notice: node has quorum
Sep 15 15:00:56 root2 pmxcfs[10963]: [dcdb] notice: members: 1/12107, 2/10963
Sep 15 15:00:56 root2 pmxcfs[10963]: [dcdb] notice: starting data syncronisation
Sep 15 15:00:56 root2 pmxcfs[10963]: [status] notice: members: 1/12107, 2/10963
Sep 15 15:00:56 root2 pmxcfs[10963]: [status] notice: starting data syncronisation
Sep 15 15:00:56 root2 pmxcfs[10963]: [dcdb] notice: received sync request (epoch 1/12107/00000002)
Sep 15 15:00:56 root2 pmxcfs[10963]: [status] notice: received sync request (epoch 1/12107/00000002)
Sep 15 15:00:56 root2 pmxcfs[10963]: [dcdb] notice: received all states
Sep 15 15:00:56 root2 pmxcfs[10963]: [dcdb] notice: leader is 1/12107
Sep 15 15:00:56 root2 pmxcfs[10963]: [dcdb] notice: synced members: 1/12107
Sep 15 15:00:56 root2 pmxcfs[10963]: [dcdb] notice: waiting for updates from leader
Sep 15 15:00:56 root2 pmxcfs[10963]: [status] notice: received all states
Sep 15 15:00:56 root2 pmxcfs[10963]: [status] notice: all data is up to date
Sep 15 15:00:56 root2 pmxcfs[10963]: [dcdb] notice: update complete - trying to commit (got 49 inode updates)
Sep 15 15:00:56 root2 pmxcfs[10963]: [dcdb] notice: all data is up to date
Sep 15 15:01:45 root2 pmxcfs[10963]: [status] notice: received log
Sep 15 15:16:46 root2 pmxcfs[10963]: [status] notice: received log
Sep 15 15:24:24 root2 pmxcfs[10963]: [status] notice: received log
Sep 15 15:24:24 root2 pmxcfs[10963]: [status] notice: received log
Sep 15 15:31:47 root2 pmxcfs[10963]: [status] notice: received log
Sep 15 15:34:56 root2 pmxcfs[10963]: [status] notice: received log
Sep 15 15:34:58 root2 pmxcfs[10963]: [status] notice: received log
Sep 15 15:37:00 root2 pmxcfs[10963]: [status] notice: received log
Sep 15 15:37:06 root2 pmxcfs[10963]: [status] notice: received log
Sep 15 15:37:23 root2 pmxcfs[10963]: [status] notice: received log
Sep 15 15:37:27 root2 pmxcfs[10963]: [status] notice: received log
Sep 15 15:37:54 root2 pmxcfs[10963]: [status] notice: received log
Sep 15 15:52:54 root2 pmxcfs[10963]: [status] notice: received log
Sep 15 15:54:55 root2 pmxcfs[10963]: [status] notice: received log
Sep 15 16:00:21 root2 pmxcfs[10963]: [dcdb] notice: data verification successful

fabian · Sep 16, 2022

the UDP csum seems to be a tcpdump issue (when offloading is enabled the checksums are calculated by the NIC after tcpdump captured the packets), so likely benign as long as it just shows up on outgoing packets

the MTU thing seems to be correct now - knet detects a maximum MTU along the path and automatically uses that (you don't need to manually configure it anywhere on the corosync side).

CieNTi · Sep 16, 2022

Perfect then

Thanks so much, @fabian and @aaron, I really appreciate your help

Rekario · May 17, 2024

Hello there, sorry to dig out such an old thread but I had an similiar issue with hetzner which was driving me crazy.

At the end it was also a firewall related issue but not at hetzner side.
I had to disable the default datacenter and node firewall at fresh installed Proxmox VE 8 nodes to join the freshly created cluster ...

fabian · May 17, 2024

did you try to join over public IPs?

Rekario · May 17, 2024

fabian said:
did you try to join over public IPs?

Yes and no. I tried every possible constellation.
I have a small vm at hetzner cloud as peer with just a public ip without any vlans and 2 host nodes with public ips and a vlan with internal IPs for cluster communication.

I was unable to join the cluster between vm and host via public IP and also host <-> host cluster (just the two nodes for testing purposes) via public or internal IP.
After disabling the firewall for datacenter and the nodes itself all joins were successful (so node to vm via public and also node to node via internal).

Clericer · Jun 9, 2024

"I have the same problem: the joining node becomes completely unresponsive except for SSH. I can see the join on the cluster's main node, but the joining node makes all directories read-only.

Certificates are not shared, and a bunch of other things are not shared between the main and the joining node.

On the main node, the joined node appears offline from the beginning. Through the GUI of the main node, I can access SSH via the console tab, but that's all.

I tried connecting via public IP and VPN, and even tried copying all necessary files and directories to the joining node before joining, but nothing helped.

After removing the malicious node, even the web UI was not restorable.

My first attempt used the Main Node: Proxmox Version 8.0, and the joining node was 8.2, but I reinstalled the joining node on 8.0, and the same behavior occurred.

The Corosync log is empty, and I can’t find any logs containing information.

Why is there no CLI-based method of joining with a debug level to check for errors during the join process?

fabian · Jun 10, 2024

please provide logs from both ends (task logs and journal covering the attempted join).

there is a CLI way, but it doesn't generate any more output than the task logs.

[SOLVED] IMPOSSIBLE to join node after cluster is created

Member

Proxmox Staff Member

Proxmox Staff Member

Member

Member

Attachments

Proxmox Staff Member

Member

Member

Member

Proxmox Staff Member

Member

Member

Member

Proxmox Staff Member

Member

Member

Proxmox Staff Member

Member

New Member

Proxmox Staff Member

We value your privacy