[SOLVED] Changed cluster node IP and now it's isolated

SergioRius · Dec 20, 2022

I've changed the IP of one node in the cluster as instructed here, rebooted and now the node is out of the cluster and refuses to bring up any VM & CT.

The steps I've done are:

Backup of all VM/CT
Informed the vlan tag in all the CT/VM. It was a vlan+ip change.
Changed /etc/network/interfaces with the new IP and changed vlan settings.
Changed IP in /etc/hosts
Changed IP in /etc/pve/corosync.conf and pumped 1 up the token version.
Poweroff
Changed the switch port to a trunk one (it was in an untagged/fixed vlan)
Rebooted.

Now the node boots normally, correctly sets the new IP and is reachable through ssh and it's own interface, but it doesn't communicate with the cluster.
Perhaps it's something new and silly to be done in newer versions.
Can anyone please give me a hand to fix it?

Edit:

The changes made in corosync have propagated to the other nodes in the cluster. But still no communication.

Bash:

# pvecm status
Cluster information
-------------------
Name:             rainland
Config Version:   7
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue Dec 20 21:57:30 2022
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000002
Ring ID:          2.6d0
Quorate:          No

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      1
Quorum:           3 Activity blocked
Flags:           

Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 10.1.3.22 (local)

Log (In last lines I overrided the quorum so I could start the services):

Bash:

Dec 20 20:16:29 core pmxcfs[1156]: [dcdb] notice: data verification successful
Dec 20 20:27:58 core pmxcfs[1156]: [status] notice: node lost quorum
Dec 20 20:27:58 core pmxcfs[1156]: [dcdb] notice: members: 2/1156
Dec 20 20:27:58 core pmxcfs[1156]: [status] notice: members: 2/1156
Dec 20 20:27:58 core pmxcfs[1156]: [dcdb] crit: received write while not quorate - trigger resync
Dec 20 20:27:58 core pmxcfs[1156]: [dcdb] crit: leaving CPG group
Dec 20 20:27:59 core pmxcfs[1156]: [dcdb] notice: start cluster connection
Dec 20 20:27:59 core pmxcfs[1156]: [dcdb] crit: cpg_join failed: 14
Dec 20 20:27:59 core pmxcfs[1156]: [dcdb] crit: can't initialize service
Dec 20 20:28:05 core pmxcfs[1156]: [dcdb] notice: members: 2/1156
Dec 20 20:28:05 core pmxcfs[1156]: [dcdb] notice: all data is up to date
Dec 20 20:37:22 core pmxcfs[1156]: [dcdb] notice: data verification successful
Dec 20 20:41:13 core pmxcfs[1156]: [dcdb] notice: members: 1/894, 2/1156, 3/843, 4/2503
Dec 20 20:41:13 core pmxcfs[1156]: [dcdb] notice: starting data syncronisation
Dec 20 20:41:13 core pmxcfs[1156]: [status] notice: members: 1/894, 2/1156, 3/843, 4/2503
Dec 20 20:41:13 core pmxcfs[1156]: [status] notice: starting data syncronisation
Dec 20 20:41:13 core pmxcfs[1156]: [status] notice: node has quorum
Dec 20 20:41:13 core pmxcfs[1156]: [dcdb] notice: received sync request (epoch 1/894/00000012)
Dec 20 20:41:13 core pmxcfs[1156]: [status] notice: received sync request (epoch 1/894/0000000E)
Dec 20 20:41:13 core pmxcfs[1156]: [dcdb] notice: received all states
Dec 20 20:41:13 core pmxcfs[1156]: [dcdb] notice: leader is 1/894
Dec 20 20:41:13 core pmxcfs[1156]: [dcdb] notice: synced members: 1/894, 3/843, 4/2503
Dec 20 20:41:13 core pmxcfs[1156]: [dcdb] notice: waiting for updates from leader
Dec 20 20:41:13 core pmxcfs[1156]: [status] notice: received all states
Dec 20 20:41:13 core pmxcfs[1156]: [status] notice: all data is up to date
Dec 20 20:41:13 core pmxcfs[1156]: [dcdb] notice: update complete - trying to commit (got 4 inode updates)
Dec 20 20:41:13 core pmxcfs[1156]: [dcdb] notice: all data is up to date
Dec 20 20:51:04 core pmxcfs[1156]: [status] notice: received log
Dec 20 21:06:04 core pmxcfs[1156]: [status] notice: received log
Dec 20 21:08:42 core pmxcfs[1156]: [status] notice: received log
Dec 20 21:08:42 core pmxcfs[1156]: [status] notice: received log
Dec 20 21:08:44 core pmxcfs[1156]: [status] notice: received log
Dec 20 21:09:49 core pmxcfs[1156]: [dcdb] notice: wrote new corosync config '/etc/corosync/corosync.conf' (version = 7)
Dec 20 21:09:50 core pmxcfs[1156]: [dcdb] crit: corosync-cfgtool -R failed with exit code 7#010
Dec 20 21:12:39 core pmxcfs[1156]: [confdb] crit: cmap_dispatch failed: 2
Dec 20 21:12:39 core pmxcfs[1156]: [quorum] crit: quorum_dispatch failed: 2
Dec 20 21:12:39 core pmxcfs[1156]: [status] notice: node lost quorum
Dec 20 21:12:39 core pmxcfs[1156]: [dcdb] crit: cpg_dispatch failed: 2
Dec 20 21:12:39 core pmxcfs[1156]: [dcdb] crit: cpg_leave failed: 2
Dec 20 21:12:39 core pmxcfs[1156]: [status] crit: cpg_dispatch failed: 2
Dec 20 21:12:39 core pmxcfs[1156]: [status] crit: cpg_leave failed: 2
Dec 20 21:12:39 core pmxcfs[1156]: [quorum] crit: quorum_initialize failed: 2
Dec 20 21:12:39 core pmxcfs[1156]: [quorum] crit: can't initialize service
Dec 20 21:12:39 core pmxcfs[1156]: [confdb] crit: cmap_initialize failed: 2
Dec 20 21:12:39 core pmxcfs[1156]: [confdb] crit: can't initialize service
Dec 20 21:12:39 core pmxcfs[1156]: [dcdb] notice: start cluster connection
Dec 20 21:12:39 core pmxcfs[1156]: [dcdb] crit: cpg_initialize failed: 2
Dec 20 21:12:39 core pmxcfs[1156]: [dcdb] crit: can't initialize service
Dec 20 21:12:39 core pmxcfs[1156]: [status] notice: start cluster connection
Dec 20 21:12:39 core pmxcfs[1156]: [status] crit: cpg_initialize failed: 2
Dec 20 21:12:39 core pmxcfs[1156]: [status] crit: can't initialize service
Dec 20 21:12:40 core systemd[1]: Stopping The Proxmox VE cluster filesystem...
Dec 20 21:12:40 core pmxcfs[1156]: [main] notice: teardown filesystem
Dec 20 21:12:41 core pmxcfs[1156]: [quorum] crit: quorum_finalize failed: 9
Dec 20 21:12:41 core pmxcfs[1156]: [confdb] crit: cmap_track_delete nodelist failed: 9
Dec 20 21:12:41 core pmxcfs[1156]: [confdb] crit: cmap_track_delete version failed: 9
Dec 20 21:12:41 core pmxcfs[1156]: [confdb] crit: cmap_finalize failed: 9
Dec 20 21:12:41 core pmxcfs[1156]: [main] notice: exit proxmox configuration filesystem (0)
Dec 20 21:12:41 core systemd[1]: pve-cluster.service: Succeeded.
Dec 20 21:12:41 core systemd[1]: Stopped The Proxmox VE cluster filesystem.
Dec 20 21:12:41 core systemd[1]: pve-cluster.service: Consumed 6min 29.513s CPU time.
-- Boot ddeb7fdd72f24a028bdeff71910fe3b4 --
Dec 20 21:13:30 core systemd[1]: Starting The Proxmox VE cluster filesystem...
Dec 20 21:13:30 core pmxcfs[1195]: [quorum] crit: quorum_initialize failed: 2
Dec 20 21:13:30 core pmxcfs[1195]: [quorum] crit: can't initialize service
Dec 20 21:13:30 core pmxcfs[1195]: [confdb] crit: cmap_initialize failed: 2
Dec 20 21:13:30 core pmxcfs[1195]: [confdb] crit: can't initialize service
Dec 20 21:13:30 core pmxcfs[1195]: [dcdb] crit: cpg_initialize failed: 2
Dec 20 21:13:30 core pmxcfs[1195]: [dcdb] crit: can't initialize service
Dec 20 21:13:30 core pmxcfs[1195]: [status] crit: cpg_initialize failed: 2
Dec 20 21:13:30 core pmxcfs[1195]: [status] crit: can't initialize service
Dec 20 21:13:31 core systemd[1]: Started The Proxmox VE cluster filesystem.
Dec 20 21:13:36 core pmxcfs[1195]: [status] notice: update cluster info (cluster name  rainland, version = 7)
Dec 20 21:13:36 core pmxcfs[1195]: [dcdb] notice: members: 2/1195
Dec 20 21:13:36 core pmxcfs[1195]: [dcdb] notice: all data is up to date
Dec 20 21:13:36 core pmxcfs[1195]: [status] notice: members: 2/1195
Dec 20 21:13:36 core pmxcfs[1195]: [status] notice: all data is up to date
Dec 20 21:34:12 core pmxcfs[1195]: [status] notice: node has quorum

mr44er · Dec 21, 2022

SergioRius said:
Changed the switch port to a trunk one (it was in an untagged/fixed vlan)

Maybe you forgot to allow the (new) vlan(s) of corosync on all ports of your switch where your nodes are plugged in? Can you ping 10.1.3.22 from the other nodes?
Same goes for bond/multiple rings of corosync.

For example when you move all nodes from before untagged net to a tagged vlan, the changes need to be made on all hosts and ports of the switch, before they can see/ping each other again.

If all is correct and ping works, reboot all nodes, not only the changed one.

SergioRius · Dec 21, 2022

mr44er said:
Maybe you forgot to allow the (new) vlan(s) of corosync on all ports of your switch where your nodes are plugged in? Can you ping 10.1.3.22 from the other nodes?
Same goes for bond/multiple rings of corosync.

For example when you move all nodes from before untagged net to a tagged vlan, the changes need to be made on all hosts and ports of the switch, before they can see/ping each other again.

If all is correct and ping works, reboot all nodes, not only the changed one.

Thanks for your help.
The other three node where already on trunk configuration. It was only this node that initially all it's services where on an vlan and it was made like this.
Anyways, ping works between all nodes. I had to add hosts entries for them to resolve names. Ping already worked on IPs.

As you suggested, I've tried rebooting another node and it also fell from the cluster. Now I have two nodes isolated.

Logs from corosync on node 10.1.3.23 shows something:

Bash:

Dec 21 00:49:12 betelgeuse corosync[2508]:   [KNET  ] rx: Packet rejected from 10.1.3.22:5405
Dec 21 00:49:12 betelgeuse corosync[2508]:   [QUORUM] Sync members[2]: 1 4
Dec 21 00:49:12 betelgeuse corosync[2508]:   [TOTEM ] A new membership (1.b4f) was formed. Members
Dec 21 00:49:13 betelgeuse corosync[2508]:   [KNET  ] rx: Packet rejected from 10.1.3.22:5405
Dec 21 00:49:14 betelgeuse corosync[2508]:   [KNET  ] rx: Packet rejected from 10.1.3.22:5405
Dec 21 00:49:16 betelgeuse corosync[2508]:   [KNET  ] rx: Packet rejected from 10.1.3.22:5405
Dec 21 00:49:17 betelgeuse corosync[2508]:   [KNET  ] rx: Packet rejected from 10.1.3.22:5405
Dec 21 00:49:17 betelgeuse corosync[2508]:   [QUORUM] Sync members[2]: 1 4
Dec 21 00:49:17 betelgeuse corosync[2508]:   [TOTEM ] A new membership (1.b53) was formed. Members
Dec 21 00:49:18 betelgeuse corosync[2508]:   [KNET  ] rx: Packet rejected from 10.1.3.22:5405

mr44er · Dec 21, 2022

Very strange.

I played that through in some form some days ago.

I double-checked /etc/pve/corosync.conf, upvoted quorum to make needed changes possible, shutdown all nodes and rebootet them one after another. All went fine.

SergioRius · Dec 21, 2022

Also I'm experiencing so much lag whenever I ask for corosync restart or pvecm status.
It's strange.

I'm reading about IGMP Snooping on the switch, but It's not clear to me how it has to be configured. Specially the Querier part.

mr44er · Dec 21, 2022

Maybe you can find something here https://bugzilla.redhat.com/show_bug.cgi?id=1153818

If I understand https://forum.proxmox.com/threads/239-192-142-10.64973/post-293603 correctly, it doesn't matter if the switch has igmp snooping on or off. With off you only have more overhead/traffic on all ports, but shouldn't interfere with corosync.

https://forum.proxmox.com/threads/cluster-node-ip-change.117359/ <- Now reading this could mean, that it needs all the steps again one more time +1 token?

Another idea (you have backups?): Shutdown all nodes. Reboot switch to be sure that all MACs in memory gets fresh. Boot isolated node1, do changes +1 token, reboot node1. Double-check all configs. Now boot node2 and see if they see each other on pvecm status

SergioRius · Dec 21, 2022

So it finally solved by shutting down the 4th node and then rebooting the 1st node, that still hadn't been rebooted.
The cluster formed again and by starting the 4th node I had it all running again.

Search

Search

[SOLVED] Changed cluster node IP and now it's isolated

SergioRius

Renowned Member

mr44er

Renowned Member

SergioRius

Renowned Member

mr44er

Renowned Member

SergioRius

Renowned Member

mr44er

Renowned Member

SergioRius

Renowned Member

We value your privacy