[SOLVED] Moving one node to a different subnet

bly

Member
Mar 15, 2024
56
15
8
Hi all,
I have to move a 4-nodes (1,2,3,4) cluster from subnet A to subnet B, so far what I did:

- added firewall rules to let A and B see each other fully (tested, ok) (rules will be removed after successful change)

on node 1:
edited network in the web interface of node 1 moving its ip from A to B
edited /etc/pve/corosync.conf changing IP of node 1 and added 1 to corosync_version
applied configuration
reattaching to the web interface with new ip address and asking a reboot of the node

After reboot, I can attach to node 1 web interface and I see the node up, but all other nodes are shown as offline.
1737980913682.png
but, if I navigate an "offline" node, I can open, as example, its shell
1737980998298.png

What I am still missing/overlooked at? Some service is restriced to only node's network?
TIA

edit:
I noticed, even after rebooting another node, if it tries to connect to node 1 it still uses the old IP.
Do I had missed to update some other config?
1737981785653.png
 
Last edited:
I did set manually the hosts on all 4 nodes to be sure;

This is hosts of a good node (fuji2):
1737985734774.png

This is hosts of node 1 (rsthost2):
1737985888505.png
 
Have you restarted the `systemctl restart pveproxy.service pvestatd.service`?

Do you still can't SSH to the another node from the hostname even after you modified the /etc/hosts?

Could you please provide us with the syslog from the `rsthost2`?
 
Have you restarted the `systemctl restart pveproxy.service pvestatd.service`?
I need do that on a node of the "good" group? After setting hosts on all nodes I did reboot of rsthost2

Do you still can't SSH to the another node from the hostname even after you modified the /etc/hosts?

After reboot I can SSH other nodes from node 1, but it still see all of them as offline in cluster


Could you please provide us with the syslog from the `rsthost2`?

From rsthost2
Jan 27 15:53:09 rsthost2 pvescheduler[5403]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Jan 27 15:53:09 rsthost2 pvescheduler[5402]: replication: cfs-lock 'file-replication_cfg' error: no quorum!

From fuji2 node I see a load of:
Jan 27 15:57:16 fuji2 corosync[1358]: [KNET ] rx: Packet rejected from 192.168.1.22:5405
 
ok!
On one of the quorate nodes I see this is repeating :

Jan 27 16:34:00 fuji2 corosync[1358]: [QUORUM] Sync members[3]: 1 3 4
Jan 27 16:34:00 fuji2 corosync[1358]: [TOTEM ] A new membership (1.f96) was formed. Members
Jan 27 16:34:00 fuji2 corosync[1358]: [QUORUM] Members[3]: 1 3 4
Jan 27 16:34:00 fuji2 corosync[1358]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 27 16:34:00 fuji2 corosync[1358]: [KNET ] rx: Packet rejected from 192.168.1.22:5405
Jan 27 16:34:02 fuji2 corosync[1358]: [KNET ] rx: Packet rejected from 192.168.1.22:5405
Jan 27 16:34:03 fuji2 corosync[1358]: [KNET ] rx: Packet rejected from 192.168.1.22:5405
Jan 27 16:34:04 fuji2 corosync[1358]: [KNET ] rx: Packet rejected from 192.168.1.22:5405
Jan 27 16:34:05 fuji2 corosync[1358]: [KNET ] rx: Packet rejected from 192.168.1.22:5405
Jan 27 16:34:06 fuji2 corosync[1358]: [KNET ] rx: Packet rejected from 192.168.1.22:5405
Jan 27 16:34:08 fuji2 corosync[1358]: [KNET ] rx: Packet rejected from 192.168.1.22:5405
Jan 27 16:34:09 fuji2 corosync[1358]: [KNET ] rx: Packet rejected from 192.168.1.22:5405
Jan 27 16:34:10 fuji2 corosync[1358]: [KNET ] rx: Packet rejected from 192.168.1.22:5405
Jan 27 16:34:11 fuji2 corosync[1358]: [KNET ] rx: Packet rejected from 192.168.1.22:5405
Jan 27 16:34:12 fuji2 corosync[1358]: [KNET ] rx: Packet rejected from 192.168.1.22:5405
Jan 27 16:34:14 fuji2 corosync[1358]: [KNET ] rx: Packet rejected from 192.168.1.22:5405


While on the new subnet node rsthost2 I see this repeating:

Jan 27 16:35:28 rsthost2 corosync[1090]: [QUORUM] Sync members[1]: 2
Jan 27 16:35:28 rsthost2 corosync[1090]: [TOTEM ] A new membership (2.fde) was formed. Members
Jan 27 16:35:28 rsthost2 corosync[1090]: [QUORUM] Members[1]: 2
Jan 27 16:35:28 rsthost2 corosync[1090]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 27 16:35:31 rsthost2 corosync[1090]: [TOTEM ] Token has not been received in 3226 ms
Jan 27 16:35:35 rsthost2 corosync[1090]: [TOTEM ] Token has not been received in 7527 ms
Jan 27 16:35:42 rsthost2 corosync[1090]: [QUORUM] Sync members[1]: 2
Jan 27 16:35:42 rsthost2 corosync[1090]: [TOTEM ] A new membership (2.fea) was formed. Members
Jan 27 16:35:42 rsthost2 corosync[1090]: [QUORUM] Members[1]: 2
Jan 27 16:35:42 rsthost2 corosync[1090]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 27 16:35:45 rsthost2 corosync[1090]: [TOTEM ] Token has not been received in 3226 ms
Jan 27 16:35:49 rsthost2 corosync[1090]: [TOTEM ] Token has not been received in 7527 ms
Jan 27 16:35:55 rsthost2 corosync[1090]: [QUORUM] Sync members[1]: 2
Jan 27 16:35:55 rsthost2 corosync[1090]: [TOTEM ] A new membership (2.ff6) was formed. Members
Jan 27 16:35:55 rsthost2 corosync[1090]: [QUORUM] Members[1]: 2
Jan 27 16:35:55 rsthost2 corosync[1090]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 27 16:35:59 rsthost2 corosync[1090]: [TOTEM ] Token has not been received in 3225 ms
Jan 27 16:36:03 rsthost2 corosync[1090]: [TOTEM ] Token has not been received in 7526 ms
Jan 27 16:36:09 rsthost2 corosync[1090]: [QUORUM] Sync members[1]: 2
Jan 27 16:36:09 rsthost2 corosync[1090]: [TOTEM ] A new membership (2.1002) was formed. Members
Jan 27 16:36:09 rsthost2 corosync[1090]: [QUORUM] Members[1]: 2
Jan 27 16:36:09 rsthost2 corosync[1090]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 27 16:36:12 rsthost2 corosync[1090]: [TOTEM ] Token has not been received in 3226 ms
Jan 27 16:36:17 rsthost2 corosync[1090]: [TOTEM ] Token has not been received in 7527 ms
Jan 27 16:36:23 rsthost2 corosync[1090]: [QUORUM] Sync members[1]: 2
Jan 27 16:36:23 rsthost2 corosync[1090]: [TOTEM ] A new membership (2.100e) was formed. Members
Jan 27 16:36:23 rsthost2 corosync[1090]: [QUORUM] Members[1]: 2

pvecm status on rsthost2 :

root@rsthost2:~# pvecm status
Cluster information
-------------------
Name: restore
Config Version: 11
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Mon Jan 27 16:38:05 2025
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000002
Ring ID: 2.1062
Quorate: No

Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 1
Quorum: 3 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 192.168.1.22 (local)
root@rsthost2:~#

pvecm status on fuji2:
root@fuji2:~# pvecm status
Cluster information
-------------------
Name: restore
Config Version: 13
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Mon Jan 27 16:39:48 2025
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000004
Ring ID: 1.10c2
Quorate: Yes

Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 4
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.165.24
0x00000003 2 192.168.165.25
0x00000004 1 192.168.165.19 (local)
root@fuji2:~#
 
if I want SSH using the web interface from fuji2 to rsthost2, it tries the OLD ip address: ssh: connect to host 192.168.165.22 port 22: No route to host

feels like the old ip is still lingering somewhere
 
Thank you for the logs, could you please disable the firewall temporary to see if the issue related to the firewall config, especially for the Corosync traffic UDP ports 5404 and 5405.
 
Ok, firewall disabled on all nodes. On the not quorate I had to force the change and reboot.

1737994598598.png

after rsthost2 reboot the traffic is still rejected, logs are the same
 
did a netstat to see if the port is listening.

root@rsthost2:~# netstat -pln | grep 5405
udp 0 0 192.168.1.22:5405 0.0.0.0:* 1094/corosync
 
Thank you for testing!

Could you please run the following command on the `rsthost2` node and provide us with the output?
Bash:
grep -r "192.168.165.22" /etc/

I would also check the the Corosync on which IP uses `rsthost2` node you can run `ss` as the following command:
Bash:
ss -tulpn | grep corosync

Additionally please provide us with the output of the following:

Bash:
cat /etc/pve/.members
 
Here the results:

root@rsthost2:~# grep -r "192.168.165.22" /etc/
/etc/pve/priv/known_hosts:192.168.165.22 ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCc7x8Dyy0mtB5KQiftSZHlUzIr/HgrFolsr96r6ClfUma96T7BIK21G4bX/lhZ3Wt3oIw4XCsQbU2CXVKb+rl0iPJWmH0hLqJQS3jgrMdGuLccWbHRNKW59t5UBAlBo1tWiy6LrqNCteg0m2JCWy/rFgm7+HW2mU6QCA9PS/WiZyABii13/QYB7iw1tqT1PDmMGH+3mnZNG35RvCCx6DHmf3jmEiUo5aAIsAct6grTovMTIiIKCHyaxC29V+q3x6i8GTzdLAxP5l/AZ85oUD4MD+Wn4Us94T6gMxOmGcwwWKkSJPwMw9SAh2EaSIAo+etLLwkJc+gMXSt7hTJe+HfVYqz0qJtbgJDSJpYrxz8G1Z5l97mGIUJTnaE1Mh6XcIclXCFC3sPaFFnKvnAL4xzbkCtyL9tT1jE4CmfnYWQFNZA2je4YSRk1pCxQiFNvrI9IzlkPoquYWcfUC+wkmrMJ/fFiWsCRVRvbR5oExenELywLdPOjNcOSXIRPo05CoQU=
root@rsthost2:~#

root@rsthost2:~# ss -tulpn | grep corosync
udp UNCONN 0 0 192.168.1.22:5405 0.0.0.0:* users:(("corosync",pid=1094,fd=28))

root@rsthost2:~# cat /etc/pve/.members
{
"nodename": "rsthost2",
"version": 3,
"cluster": { "name": "restore", "version": 11, "nodes": 4, "quorate": 0 },
"nodelist": {
"rsthost2": { "id": 2, "online": 1, "ip": "192.168.1.22"},
"fuji2": { "id": 4, "online": 0},
"rsthost4": { "id": 1, "online": 0},
"rsthost5": { "id": 3, "online": 0}
}
}
root@rsthost2:~#
 
As side note, I moved out all VMs from the node before the changes, so unloading ceph and remove/rebuilding the node is not a problem.
 
I decided to remove its OSD from ceph and destroy the node to recreate it in its supposed subnet, just to be sure I have an healthy node, cannot rule out if something gone wrong in that node,something else has gone wrong too and I still didn't noticed :)

If anything goes up I'll update the thread! Thanks a lot for the help so far!
 
Last edited: