Network migration leading to unhealthy cluster

sociodicy · Thursday at 17:03

Hi all, long time reader first time poster. Thanks for all you do.

I've got a two node (named pxhpdesky and pve) homelab cluster. I recently switched ISPs and was forced to transition from PPPoE to DHCP. In the process I had to scrap my router configuration. Starting from scratch, I've done little more to my router than assign two static IPs for my proxmox nodes that match their previous addresses. That didn't seem to satisfy the cluster, as I've had all manner of issues. At times the web UI will be inaccessible. VMs and LXCs won't start at boot. When I can access the UI the nodes they will sometimes have grey question marks or red X when powered up. Most commands on CLI hang. Worst of all, there is so much going wrong I can't really identify an error to google dork on. The most common entry in journalctl I see is pve corosync[899]: [TOTEM ] Retransmit List: 14 15 1b 1f 21 22 27 29.

If I try to reboot I get

Code:

Failed to set wall message, ignoring: Transport endpoint is not connected
Call to Reboot failed: Transport endpoint is not connected

I can ping one node from the other, both can ping the router, and both resolve google.com. I've tried shutting down everything on each node but the problems still existed without powered up VMs and LXCs. I haven't had a reason to dig into logs on non-proxmox devices to see if they're having connectivity issues as I've noticed no other problems in the house.

Here is the simple output of pveversion (same for both):

Code:

pve-manager/8.4.1/2a5fa54a8503f96d (running kernel: 6.8.12-11-pve)

Here are the /etc/hosts files from pxhpdesky (main) and pve:

Code:

127.0.0.1 localhost.localdomain localhost
192.168.1.187 pxhpdesky.apra pxhpdesky
192.168.1.192 pve pve.beelink.s12

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

Code:

127.0.0.1 localhost.localdomain localhost
192.168.1.192 pve.beelink.s12 pve
192.168.1.187 pxhpdesky.apra pxhpdesky

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

Here are the results of ip address for pxhpdesky and pve:

Code:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UP group default qlen 1000
    link/ether 80:e8:2c:d1:b7:47 brd ff:ff:ff:ff:ff:ff
3: wlp3s0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 1c:bf:c0:85:de:69 brd ff:ff:ff:ff:ff:ff
4: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 80:e8:2c:d1:b7:47 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.187/24 scope global vmbr0
       valid_lft forever preferred_lft forever
5: vmbr0.10@vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 80:e8:2c:d1:b7:47 brd ff:ff:ff:ff:ff:ff
    inet 10.0.10.0/24 scope global vmbr0.10
       valid_lft forever preferred_lft forever
6: tap102i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master fwbr102i0 state UNKNOWN group default qlen 1000
    link/ether 06:74:02:9a:cb:4b brd ff:ff:ff:ff:ff:ff
7: fwbr102i0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether a2:6d:4a:da:f7:a1 brd ff:ff:ff:ff:ff:ff
8: fwpr102p0@fwln102i0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr0 state UP group default qlen 1000
    link/ether 8a:5a:11:9c:58:eb brd ff:ff:ff:ff:ff:ff
9: fwln102i0@fwpr102p0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master fwbr102i0 state UP group default qlen 1000
    link/ether a2:6d:4a:da:f7:a1 brd ff:ff:ff:ff:ff:ff

Code:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00                                                                                                                                                     
    inet 127.0.0.1/8 scope host lo                                                                     
       valid_lft forever preferred_lft forever                                                         
    inet6 ::1/128 scope host noprefixroute                                                             
       valid_lft forever preferred_lft forever
2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UP group default qlen 1000
    link/ether e8:ff:1e:d8:fc:7c brd ff:ff:ff:ff:ff:ff
3: wlo1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether e8:62:be:2f:ae:e5 brd ff:ff:ff:ff:ff:ff
    altname wlp0s20f3
4: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether e8:ff:1e:d8:fc:7c brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.192/24 scope global vmbr0
       valid_lft forever preferred_lft forever
    inet6 fe80::eaff:1eff:fed8:fc7c/64 scope link
       valid_lft forever preferred_lft forever
9: veth105i0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master fwbr105i0 state UP group default qlen 1000
    link/ether fe:7c:9a:de:70:4e brd ff:ff:ff:ff:ff:ff link-netnsid 0
10: fwbr105i0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether aa:9e:f6:db:4e:e6 brd ff:ff:ff:ff:ff:ff
    
    ....

Here are the interface files for pxhpdesky and pve:

Code:

auto lo
iface lo inet loopback

iface enp2s0 inet manual

auto vmbr0
iface vmbr0 inet static
        address 192.168.1.187/24
        gateway 192.168.1.1
        bridge-ports enp2s0
        bridge-stp off
        bridge-fd 0

iface wlp3s0 inet manual

auto vmbr0.10
iface vmbr0.10 inet static
        address 10.0.10.0/24
#edgerouter IoT vlan 10

#post-up iptables-restore < /etc/network/save-iptables

Code:

auto lo
iface lo inet loopback

iface enp1s0 inet manual

auto vmbr0
iface vmbr0 inet static
        address 192.168.1.192/24
        gateway 192.168.1.1
        bridge-ports enp1s0
        bridge-stp off
        bridge-fd 0

iface wlo1 inet manual

Here are abridged outputs for journalctl for pxhpdesky and pve:

Code:

Jun 12 06:18:10 pxhpdesky systemd[1]: Failed to start pveproxy.service - PVE API Proxy Server.
....
Jun 12 07:21:07 pxhpdesky corosync[1153]:   [KNET  ] link: host: 2 link: 0 is down                                                                                                                             Jun 12 07:21:07 pxhpdesky corosync[1153]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 12 07:21:07 pxhpdesky corosync[1153]:   [KNET  ] host: host: 2 has no active links                                                                                                                         Jun 12 07:21:08 pxhpdesky corosync[1153]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined                                                                                                      
Jun 12 07:21:08 pxhpdesky corosync[1153]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)                                                                                                             Jun 12 07:21:08 pxhpdesky corosync[1153]:   [KNET  ] pmtud: Global data MTU changed to: 1397                                                                                                                   
Jun 12 07:21:34 pxhpdesky corosync[1153]:   [TOTEM ] FAILED TO RECEIVE                                                                                                                                         
Jun 12 07:21:34 pxhpdesky corosync[1153]:   [QUORUM] Sync members[1]: 1                                                                                                                                        
Jun 12 07:21:34 pxhpdesky corosync[1153]:   [QUORUM] Sync left[1]: 2                                                                                                                                           
Jun 12 07:21:34 pxhpdesky corosync[1153]:   [TOTEM ] A new membership (1.1613) was formed. Members left: 2                                                                                                     
Jun 12 07:21:34 pxhpdesky corosync[1153]:   [TOTEM ] Failed to receive the leave message. failed: 2                                                                                                            
Jun 12 07:21:34 pxhpdesky corosync[1153]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jun 12 07:21:34 pxhpdesky corosync[1153]:   [QUORUM] Members[1]: 1                                                                                                                                             
Jun 12 07:21:34 pxhpdesky corosync[1153]:   [MAIN  ] Completed service synchronization, ready to provide service.
....
Jun 12 07:21:37 pxhpdesky pmxcfs[1069]: [dcdb] crit: cpg_send_message failed: 9  
....
Jun 12 07:29:05 pxhpdesky corosync[1153]:   [KNET  ] link: host: 2 link: 0 is down
Jun 12 07:29:05 pxhpdesky corosync[1153]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 12 07:29:05 pxhpdesky corosync[1153]:   [KNET  ] host: host: 2 has no active links
Jun 12 07:29:06 pxhpdesky corosync[1153]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Jun 12 07:29:06 pxhpdesky corosync[1153]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 12 07:29:06 pxhpdesky corosync[1153]:   [KNET  ] pmtud: Global data MTU changed to: 1397

Code:

Jun 12 07:36:24 pve corosync[899]:   [KNET  ] link: host: 1 link: 0 is down                                                                                                                                   
Jun 12 07:36:24 pve corosync[899]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)                                                                                                                   
Jun 12 07:36:24 pve corosync[899]:   [KNET  ] host: host: 1 has no active links                                                                                                                               
Jun 12 07:36:24 pve corosync[899]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined                                                                                                             
Jun 12 07:36:24 pve corosync[899]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)                                                                                                                   
Jun 12 07:36:24 pve corosync[899]:   [KNET  ] pmtud: Global data MTU changed to: 1397     

....

Jun 12 07:37:14 pve corosync[899]:   [TOTEM ] Retransmit List: 14 15 1b 1f 21 22 27 29
Jun 12 07:37:15 pve corosync[899]:   [TOTEM ] Retransmit List: 14 15 1b 1f 21 22 27 29
Jun 12 07:37:15 pve corosync[899]:   [TOTEM ] Retransmit List: 14 15 1b 1f 21 22 27 29
Jun 12 07:37:15 pve corosync[899]:   [TOTEM ] Retransmit List: 14 15 1b 1f 21 22 27 29
Jun 12 07:37:16 pve corosync[899]:   [TOTEM ] Retransmit List: 14 15 1b 1f 21 22 27 29
Jun 12 07:37:16 pve corosync[899]:   [TOTEM ] Retransmit List: 14 15 1b 1f 21 22 27 29

Can anyone help me troubleshoot this?

sociodicy · Friday at 01:13

Well, I went ahead and deleted the cluster. If I have more problems I'll report back.

Search

Search

Network migration leading to unhealthy cluster

sociodicy

New Member

sociodicy

New Member

We value your privacy