Change node in cluster's IP address PVE 7.4

eugenevdm · Apr 27, 2024

I had to migrate one of the nodes of my four node cluster to another data centre. The IP address had to change.

Now I've googled how to change the IP address of the node that is currently down but I'm struggling to find a HOWTO or concrete information that doesn't overlap or contradict.

Are these the files to update?

#1 /etc/network/interfaces
#2 /etc/hosts
#3 /etc/corosync/corosync.conf

#1 and #2 are standard Debian sysadmin stuff.

When I try to edit #3 I start getting read-only messages and then I get really scared I'm going to break my cluster.

What am I missing? Is this quite a complicated process? Is it supported by any means? I once ended up with a broken cluster and I'm trying to reduce all risks.

Moayad · Apr 29, 2024

Hi,

eugenevdm said:
When I try to edit #3 I start getting read-only messages and then I get really scared I'm going to break my cluster.

This is due to a loss of quroum in your cluster or remaining node, you can try to set the expected votes to 1 as the following command

Bash:

pvecm expected 1

eugenevdm · Apr 30, 2024

Thanks @Moayad Unfortunately I didn't see your reply and below is how I think I fixed the problem.

Your technique is "change quorum to have one expected vote". It seems what I found what "make file system not read-only to affect changes". With regards to pvecm expected 1, is there something one is supposed to do to make it expect more than 1 again?

Anyway this is how I fixed it on the three other nodes (this might high risk), and I still have one remaining problem:

service corosync stop
pmxcfs -l
vi /etc/pve/corosync.conf
vi /etc/corosync/corosync.conf

I changed the old IP address to new IP address
I increase config_version in totem section.

service corosync start

Somewhere I might have also tried service pve-cluster restart, perhaps on all four nodes.

Anyway, for now corosync seems happy on all four nodes:

pvecm status
Cluster information
-------------------
Name: cluster
Config Version: 10
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Tue Apr 30 05:43:09 2024
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000004
Ring ID: 1.b596
Quorate: Yes

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 4
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 a.b.c.d
0x00000002 1 e.f.g.h
0x00000003 1 i.j.k.l
0x00000004 1 m.n.o.p (local)

Except, I started having this repeat on the other three nodes!

Apr 30 05:44:51 host pmxcfs[3870530]: [main] crit: unable to acquire pmxcfs lock: Resource temporarily unavailable
Apr 30 05:44:51 host pmxcfs[3870530]: [main] notice: exit proxmox configuration filesystem (-1)
Apr 30 05:44:51 host pmxcfs[3870530]: [main] crit: unable to acquire pmxcfs lock: Resource temporarily unavailable
Apr 30 05:44:51 host pmxcfs[3870530]: [main] notice: exit proxmox configuration filesystem (-1)
Apr 30 05:44:51 host systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Apr 30 05:44:51 host systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Apr 30 05:44:51 host systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Apr 30 05:44:51 host systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 21826.
Apr 30 05:44:51 host systemd[1]: Stopped The Proxmox VE cluster filesystem.
Apr 30 05:44:51 host systemd[1]: Starting The Proxmox VE cluster filesystem...
Apr 30 05:44:51 host pmxcfs[3870734]: [main] notice: unable to acquire pmxcfs lock - trying again
Apr 30 05:44:51 host pmxcfs[3870734]: [main] notice: unable to acquire pmxcfs lock - trying again

I found advice to reboot. This worked on two of the nodes.

The 3rd node is extremely hard to reboot due to down time.

Do you by change have any tips to fix the above lock problem without rebooting?

May I request in the future that the Proxmox releases an official way to change the IP address of a cluster node as it seems this is not an exact science.

Edit: I found the courage to reboot the server a few days later. The server didn't want to boot and got stuck on startup:

DMAR-IR: [Firmware Bug]: ioapic 3 has no mapping iommu, interrupt remapping will be disabled

While it hung I did urgent research to see how to fix it. The motherboard is Supermicro X10DRL-i
It was very hard to get any information on how to fix this problem but I found an entire section in the manual devoted to IOMMU:
https://pve.proxmox.com/wiki/PCI_Passthrough#Verifying_IOMMU_parameters

So after around 25 minutes and mild panic I decided I'm going to try booting again.
The same message appeared again but the system booted!

But then, after the server had booted, some of my volumes did not mount:

activating LV 'raid0/raid0' failed: Activation of logical volume raid0/raid0 is prohibited while logical volume raid0/raid0_tmeta is active

This problem is well documented and to do with Debian delays and LVM. Fortunately using lvchange -an on both tmeta and tdata worked, after a few tries and lots of waiting.

Total downtime was around 1 hour but it was 2AM in the morning so the damage is contained.

Conclusion: Rebooting (at least on 7.4) is not fun but finally after 5 days my cluster is back to a consistent state.

Search

Search

Change node in cluster's IP address PVE 7.4

eugenevdm

Member

Moayad

Proxmox Staff Member

eugenevdm

Member