Thanks
@Moayad Unfortunately I didn't see your reply and below is how I think I fixed the problem.
Your technique is "
change quorum to have one expected vote". It seems what I found what "
make file system not read-only to affect changes". With regards to
pvecm expected 1
, is there something one is supposed to do to make it expect more than 1 again?
Anyway this is how I fixed it on the three other nodes (this might high risk), and I still have one remaining problem:
service corosync stop
pmxcfs -l
vi /etc/pve/corosync.conf
vi /etc/corosync/corosync.conf
- I changed the old IP address to new IP address
- I increase
config_version
in totem
section.
service corosync start
Somewhere I might have also tried
service pve-cluster restart
, perhaps on all four nodes.
Anyway, for now corosync seems happy on all four nodes:
pvecm status
Cluster information
-------------------
Name: cluster
Config Version: 10
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Tue Apr 30 05:43:09 2024
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000004
Ring ID: 1.b596
Quorate: Yes
Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 4
Quorum: 3
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 a.b.c.d
0x00000002 1 e.f.g.h
0x00000003 1 i.j.k.l
0x00000004 1 m.n.o.p (local)
Except, I started having this repeat on the other three nodes!
Apr 30 05:44:51 host pmxcfs[3870530]: [main] crit: unable to acquire pmxcfs lock: Resource temporarily unavailable
Apr 30 05:44:51 host pmxcfs[3870530]: [main] notice: exit proxmox configuration filesystem (-1)
Apr 30 05:44:51 host pmxcfs[3870530]: [main] crit: unable to acquire pmxcfs lock: Resource temporarily unavailable
Apr 30 05:44:51 host pmxcfs[3870530]: [main] notice: exit proxmox configuration filesystem (-1)
Apr 30 05:44:51 host systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Apr 30 05:44:51 host systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Apr 30 05:44:51 host systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Apr 30 05:44:51 host systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 21826.
Apr 30 05:44:51 host systemd[1]: Stopped The Proxmox VE cluster filesystem.
Apr 30 05:44:51 host systemd[1]: Starting The Proxmox VE cluster filesystem...
Apr 30 05:44:51 host pmxcfs[3870734]: [main] notice: unable to acquire pmxcfs lock - trying again
Apr 30 05:44:51 host pmxcfs[3870734]: [main] notice: unable to acquire pmxcfs lock - trying again
I found advice to reboot. This worked on two of the nodes.
The 3rd node is extremely hard to reboot due to down time.
Do you by change have any tips to fix the above lock problem without rebooting?
May I request in the future that the Proxmox releases an official way to change the IP address of a cluster node as it seems this is not an exact science.
Edit: I found the courage to reboot the server a few days later. The server didn't want to boot and got stuck on startup:
DMAR-IR: [Firmware Bug]: ioapic 3 has no mapping iommu, interrupt remapping will be disabled
While it hung I did urgent research to see how to fix it. The motherboard is Supermicro X10DRL-i
It was very hard to get any information on how to fix this problem but I found an entire section in the manual devoted to IOMMU:
https://pve.proxmox.com/wiki/PCI_Passthrough#Verifying_IOMMU_parameters
So after around 25 minutes and mild panic I decided I'm going to try booting again.
The same message appeared again but the system booted!
But then, after the server had booted, some of my volumes did not mount:
activating LV 'raid0/raid0' failed: Activation of logical volume raid0/raid0 is prohibited while logical volume raid0/raid0_tmeta is active
This problem is well documented and to do with Debian delays and LVM. Fortunately using
lvchange -an
on both
tmeta
and
tdata
worked, after a few tries and lots of waiting.
Total downtime was around 1 hour but it was 2AM in the morning so the damage is contained.
Conclusion: Rebooting (at least on 7.4) is not fun but finally after 5 days my cluster is back to a consistent state.