We've created a cluster on PVE 7.0-11 and am attempting to join other nodes. The network is up and the hosts can ping each other by hostname. I can run "systemctl status corosync" and get the following output:
Also if I run "pvecm status" I get the following:
When I try to join another node to the cluster, either in the GUI or from the CLI, I get the following:
In reading, the common issues are time/date variance (they're all using the same NTP which is in sync), DNS issues (we're using IP and they all recognize the hostname correctly and can ping by hostname), and lastly open ports. That's where I'm focusing, specifically on corosync's ports 5404 and 5405. When I run "netstat -tulpn" I get the following:
We have no non-standard firewall rules in place. Because I'm seeing nothing listening on those ports, I'm wondering what might be not working? I thought I read that any needed firewall rules would be auto-created; there's nothing in the documentation about needing to make any.
The other thing i'm concerned about is SSH. We have bonded 4 nics and then use open vswitch to create virtual nics. For now, there are 2 networks; a fiber storage network on the 13 subnet, and the 10 network which is a vlan on the 10 subnet over the bonded nics. Here's a sample /etc/network/interfaces (this is reflected on all hosts):
The resulting network config is this (macs and exact IPs obfuscated):
I present all this because I think there also might be an ssh issue. We have made no changes to the ssh config. I can ssh to each server on port 22 with root, and I can ping each server from each other, but I CANNOT ssh from one server to another as root; I get the following:
I'm not sure what else to troubleshoot; there's nothing in syslog or auth to indicate an issue. Is it an ssh quirk over the bond? Is it an ipv6 thing where it's listening on the wrong address space?
Code:
root@PROX-02:~# systemctl status corosync
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2021-09-14 16:35:40 PDT; 18h ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 2241 (corosync)
Tasks: 9 (limit: 115906)
Memory: 144.6M
CPU: 6min 53.810s
CGroup: /system.slice/corosync.service
└─2241 /usr/sbin/corosync -f
Sep 14 16:35:40 PROX-02 corosync[2241]: [QB ] server name: quorum
Sep 14 16:35:40 PROX-02 corosync[2241]: [TOTEM ] Configuring link 0
Sep 14 16:35:40 PROX-02 corosync[2241]: [TOTEM ] Configured link number 0: local addr: 10.1.10.###, port=5405
Sep 14 16:35:40 PROX-02 corosync[2241]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 0)
Sep 14 16:35:40 PROX-02 corosync[2241]: [KNET ] host: host: 1 has no active links
Sep 14 16:35:40 PROX-02 corosync[2241]: [QUORUM] Sync members[1]: 1
Sep 14 16:35:40 PROX-02 corosync[2241]: [QUORUM] Sync joined[1]: 1
Sep 14 16:35:40 PROX-02 corosync[2241]: [TOTEM ] A new membership (1.f) was formed. Members joined: 1
Sep 14 16:35:40 PROX-02 corosync[2241]: [QUORUM] Members[1]: 1
Sep 14 16:35:40 PROX-02 corosync[2241]: [MAIN ] Completed service synchronization, ready to provide service.
Also if I run "pvecm status" I get the following:
Code:
root@PROX-02:~# pvecm status
Cluster information
-------------------
Name: PROX-G11
Config Version: 1
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Wed Sep 15 12:24:54 2021
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1.f
Quorate: Yes
Votequorum information
----------------------
Expected votes: 1
Highest expected: 1
Total votes: 1
Quorum: 1
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.1.10.## (local)
When I try to join another node to the cluster, either in the GUI or from the CLI, I get the following:
Code:
root@PROX-03:~# pvecm add 10.1.10.##2
Please enter superuser (root) password for '10.1.10.##2': ********************
Establishing API connection with host '10.1.10.##2'
500 Can't connect to 10.1.10.##2:8006
In reading, the common issues are time/date variance (they're all using the same NTP which is in sync), DNS issues (we're using IP and they all recognize the hostname correctly and can ping by hostname), and lastly open ports. That's where I'm focusing, specifically on corosync's ports 5404 and 5405. When I run "netstat -tulpn" I get the following:
Code:
root@PROX-02:~# netstat -tulpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 1/init
tcp 0 0 127.0.0.1:85 0.0.0.0:* LISTEN 2309/pvedaemon
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 2066/sshd: /usr/sbi
tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN 2227/master
tcp6 0 0 :::111 :::* LISTEN 1/init
tcp6 0 0 :::22 :::* LISTEN 2066/sshd: /usr/sbi
tcp6 0 0 :::3128 :::* LISTEN 2340/spiceproxy
tcp6 0 0 ::1:25 :::* LISTEN 2227/master
tcp6 0 0 :::1311 :::* LISTEN 1831/dsm_om_connsvc
tcp6 0 0 :::8006 :::* LISTEN 2334/pveproxy
udp 0 0 0.0.0.0:111 0.0.0.0:* 1/init
udp 0 0 127.0.0.1:161 0.0.0.0:* 2026/snmpd
udp 0 0 127.0.0.1:323 0.0.0.0:* 2030/chronyd
udp6 0 0 :::111 :::* 1/init
udp6 0 0 ::1:161 :::* 2026/snmpd
udp6 0 0 ::1:323 :::* 2030/chronyd
We have no non-standard firewall rules in place. Because I'm seeing nothing listening on those ports, I'm wondering what might be not working? I thought I read that any needed firewall rules would be auto-created; there's nothing in the documentation about needing to make any.
The other thing i'm concerned about is SSH. We have bonded 4 nics and then use open vswitch to create virtual nics. For now, there are 2 networks; a fiber storage network on the 13 subnet, and the 10 network which is a vlan on the 10 subnet over the bonded nics. Here's a sample /etc/network/interfaces (this is reflected on all hosts):
Code:
root@PROX-02:~# cat /etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!
auto lo
iface lo inet loopback
auto eno1
allow-vmbr0 eno1
iface eno1 inet manual
ovs_mtu 9000
auto eno2
allow-vmbr0 eno2
iface eno2 inet manual
ovs_mtu 9000
auto eno3
allow-vmbr0 eno3
iface eno3 inet manual
ovs_mtu 9000
auto eno4
allow-vmbr0 eno4
iface eno4 inet manual
ovs_mtu 9000
auto enp4s0
iface enp4s0 inet manual
address 10.1.13.##2
netmask 255.255.255.0
auto bond0
allow-vmbr0 bond0
iface bond0 inet manual
ovs_bridge vmbr0
ovs_type OVSBond
ovs_bonds eno1 eno2 eno3 eno4
ovs_options bond_mode=balance-tcp lacp=active other_config:lacp-time=fast
ovs_mtu 9000
allow-ovs vmbr0
iface vmbr0 inet manual
ovs_type OVSBridge
ovs_ports bond0 vlan10
ovs_mtu 9000
auto vlan10
allow-vmbr0 vlan10
iface vlan10 inet static
ovs_type OVSIntPort
ovs_bridge vmbr0
ovs_options tag=10
address 10.1.10.##2
netmask 255.255.255.0
gateway 10.1.10.253
ovs_mtu 9000
The resulting network config is this (macs and exact IPs obfuscated):
Code:
root@PROX-02:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq master ovs-system state UP group default qlen 1000
link/ether ##:##:##:##:##:## brd ff:ff:ff:ff:ff:ff
altname enp1s0f0
inet6 ####::####:####:####:####/64 scope link
valid_lft forever preferred_lft forever
3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 9000 qdisc mq master ovs-system state DOWN group default qlen 1000
link/ether ##:##:##:##:##:## brd ff:ff:ff:ff:ff:ff
altname enp1s0f1
4: eno3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 9000 qdisc mq master ovs-system state DOWN group default qlen 1000
link/ether ##:##:##:##:##:## brd ff:ff:ff:ff:ff:ff
altname enp2s0f0
5: eno4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq master ovs-system state UP group default qlen 1000
link/ether ##:##:##:##:##:## brd ff:ff:ff:ff:ff:ff
altname enp2s0f1
inet6 ####::####:####:####:####/64 scope link
valid_lft forever preferred_lft forever
6: enp4s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether ##:##:##:##:##:## brd ff:ff:ff:ff:ff:ff
inet 10.1.13.##2/24 scope global enp4s0
valid_lft forever preferred_lft forever
inet6 ####::####:####:####:####/64 scope link
valid_lft forever preferred_lft forever
7: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether ##:##:##:##:##:## brd ff:ff:ff:ff:ff:ff
8: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether ##:##:##:##:##:## brd ff:ff:ff:ff:ff:ff
inet6 ####::####:####:####:####/64 scope link
valid_lft forever preferred_lft forever
9: bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether ##:##:##:##:##:## brd ff:ff:ff:ff:ff:ff
inet6 ####::####:####:####:####/64 scope link
valid_lft forever preferred_lft forever
10: vlan10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether ##:##:##:##:##:## brd ff:ff:ff:ff:ff:ff
inet 10.1.10.##2/24 scope global vlan10
valid_lft forever preferred_lft forever
inet6 ####::####:####:####:####/64 scope link
valid_lft forever preferred_lft forever
I present all this because I think there also might be an ssh issue. We have made no changes to the ssh config. I can ssh to each server on port 22 with root, and I can ping each server from each other, but I CANNOT ssh from one server to another as root; I get the following:
Code:
root@PROX-03:~# ssh 10.1.10.##2
Connection closed by 10.1.10.##2 port 22
I'm not sure what else to troubleshoot; there's nothing in syslog or auth to indicate an issue. Is it an ssh quirk over the bond? Is it an ipv6 thing where it's listening on the wrong address space?
Last edited: