Cluster not working after moving all nodes to new DataCenter

mehhos · Mar 8, 2023

Hi,
Hope someone can help me here.
We moved proxmox with 6 nodes to new DataCenter, after that Cluster not working. I restart cluster but i didn't help:
root@vsr-app1:~# cat /etc/pve/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: vsr-app5
nodeid: 2
quorum_votes: 1
ring0_addr: vsr-app5
}

node {
name: vsr-app3
nodeid: 4
quorum_votes: 1
ring0_addr: vsr-app3
}

node {
name: vsr-app6
nodeid: 1
quorum_votes: 1
ring0_addr: vsr-app6
}

node {
name: vsr-app2
nodeid: 5
quorum_votes: 1
ring0_addr: vsr-app2
}

node {
name: vsr-app1
nodeid: 6
quorum_votes: 1
ring0_addr: vsr-app1
}

node {
name: vsr-app4
nodeid: 3
quorum_votes: 1
ring0_addr: vsr-app4
}

}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: vsrappcluster1
config_version: 6
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: 10.112.65.19
ringnumber: 0
}

}

root@vsr-app1:~#

root@vsr-app1:~# pvecm status
Quorum information
------------------
Date: Wed Mar 8 14:10:48 2023
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000006
Ring ID: 6/516
Quorate: No

Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 1
Quorum: 4 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000006 1 10.112.65.14 (local)
root@vsr-app1:~#

root@vsr-app1:~# pvecm status
Quorum information
------------------
Date: Wed Mar 8 14:10:48 2023
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000006
Ring ID: 6/516
Quorate: No

Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 1
Quorum: 4 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000006 1 10.112.65.14 (local)
root@vsr-app1:~# service pve-cluster status
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
Active: failed (Result: exit-code) since Wed 2023-03-08 13:38:48 CET; 32min ago
Process: 39680 ExecStartPost=/usr/bin/pvecm updatecerts --silent (code=exited, status=0/SUCCESS)
Process: 29490 ExecStart=/usr/bin/pmxcfs $DAEMON_OPTS (code=exited, status=255)
Main PID: 39678 (code=exited, status=0/SUCCESS)

Mar 08 13:38:38 vsr-app1 pmxcfs[29490]: [main] notice: unable to aquire pmxcfs lock - trying again
Mar 08 13:38:38 vsr-app1 pmxcfs[29490]: [main] notice: unable to aquire pmxcfs lock - trying again
Mar 08 13:38:48 vsr-app1 pmxcfs[29490]: [main] crit: unable to aquire pmxcfs lock: Resource temporarily unavailable
Mar 08 13:38:48 vsr-app1 pmxcfs[29490]: [main] crit: unable to aquire pmxcfs lock: Resource temporarily unavailable
Mar 08 13:38:48 vsr-app1 pmxcfs[29490]: [main] notice: exit proxmox configuration filesystem (-1)
Mar 08 13:38:48 vsr-app1 systemd[1]: pve-cluster.service: control process exited, code=exited status=255
Mar 08 13:38:48 vsr-app1 systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Mar 08 13:38:48 vsr-app1 systemd[1]: Unit pve-cluster.service entered failed state.
root@vsr-app1:~#

root@vsr-app1:~# systemctl status corosync
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
Active: active (running) since Wed 2023-03-08 13:37:23 CET; 34min ago
Process: 29120 ExecStop=/usr/share/corosync/corosync stop (code=exited, status=0/SUCCESS)
Process: 29141 ExecStart=/usr/share/corosync/corosync start (code=exited, status=0/SUCCESS)
Main PID: 29152 (corosync)
CGroup: /system.slice/corosync.service
└─29152 corosync

Mar 08 13:40:13 vsr-app1 corosync[29152]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 08 13:40:18 vsr-app1 corosync[29152]: [TOTEM ] A new membership (10.112.65.14:508) was formed. Members
Mar 08 13:40:18 vsr-app1 corosync[29152]: [QUORUM] Members[1]: 6
Mar 08 13:40:18 vsr-app1 corosync[29152]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 08 13:40:23 vsr-app1 corosync[29152]: [TOTEM ] A new membership (10.112.65.14:512) was formed. Members
Mar 08 13:40:23 vsr-app1 corosync[29152]: [QUORUM] Members[1]: 6
Mar 08 13:40:23 vsr-app1 corosync[29152]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 08 13:40:27 vsr-app1 corosync[29152]: [TOTEM ] A new membership (10.112.65.14:516) was formed. Members
Mar 08 13:40:27 vsr-app1 corosync[29152]: [QUORUM] Members[1]: 6
Mar 08 13:40:27 vsr-app1 corosync[29152]: [MAIN ] Completed service synchronization, ready to provide service.
root@vsr-app1:~#

root@vsr-app1:~# systemctl status pveproxy
● pveproxy.service - PVE API Proxy Server
Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled)
Active: active (running) since Wed 2023-03-08 06:25:16 CET; 7h ago
Process: 17170 ExecStop=/usr/bin/pveproxy stop (code=exited, status=0/SUCCESS)
Process: 17233 ExecStart=/usr/bin/pveproxy start (code=exited, status=0/SUCCESS)
Main PID: 17238 (pveproxy)
CGroup: /system.slice/pveproxy.service
├─17238 pveproxy
├─36752 pveproxy worker
├─38228 pveproxy worker
└─46986 pveproxy worker

Mar 08 13:51:22 vsr-app1 pveproxy[17960]: problem with client 10.80.45.53; ssl3_read_bytes: ssl handshake failure
Mar 08 13:51:22 vsr-app1 pveproxy[17960]: Can't call method "timeout_reset" on an undefined value at /usr/share/perl5/PVE/HTTPServer.pm line 227.
Mar 08 13:51:34 vsr-app1 pveproxy[34603]: worker exit
Mar 08 13:51:34 vsr-app1 pveproxy[17238]: worker 34603 finished
Mar 08 13:51:34 vsr-app1 pveproxy[17238]: starting 1 worker(s)
Mar 08 13:51:34 vsr-app1 pveproxy[17238]: worker 38228 started
Mar 08 14:05:00 vsr-app1 pveproxy[17960]: worker exit
Mar 08 14:05:00 vsr-app1 pveproxy[17238]: worker 17960 finished
Mar 08 14:05:00 vsr-app1 pveproxy[17238]: starting 1 worker(s)
Mar 08 14:05:00 vsr-app1 pveproxy[17238]: worker 46986 started
root@vsr-app1:~#

shanreich · Mar 8, 2023

Can you ping between the nodes on the corosync interface?
Did any IP addresses change while moving?

UdoB · Mar 8, 2023

mehhos said:
Expected votes: 6
Highest expected: 6
Total votes: 1
Quorum: 4 Activity blocked

This host is isolated, corosync was not able to find its five neighbours. You need to re-establish network connectivity - in a compatible way compared to the past...

Good luck!

alexskysilk · Mar 8, 2023

UdoB said:
in a compatible way compared to the past...

what switches did you use in the original location, and what switches on the new? are all ports used set to trunk, or at least to the same vids?

mehhos · Mar 9, 2023

shanreich said:
Can you ping between the nodes on the corosync interface?
Did any IP addresses change while moving?

I can ping and ssh to all other nodes.

mehhos · Mar 9, 2023

shanreich said:
Can you ping between the nodes on the corosync interface?
Did any IP addresses change while moving?

No, all the ip's and dns are the same. All nodes can ssh to each other (same VLAN)

mehhos · Mar 9, 2023

UdoB said:
This host is isolated, corosync was not able to find its five neighbours. You need to re-establish network connectivity - in a compatible way compared to the past...

Good luck!

tnx for replay. All nodes are in same VLAN, all nodes can SSH to each other.

UdoB · Mar 9, 2023

Again: "Total votes: 1 Quorum: 4 Activity blocked" tells you that the cluster network does not work. What is the output of corosync-cfgtool -s?

SSH is not sufficient for this. And in your config there are lines like "ring0_addr: vsr-app3". Do these names resolve?

mehhos · Mar 9, 2023

UdoB said:
Again: "Total votes: 1 Quorum: 4 Activity blocked" tells you that the cluster network does not work. What is the output of corosync-cfgtool -s?

SSH is not sufficient for this. And in your config there are lines like "ring0_addr: vsr-app3". Do these names resolve?

root@vso-app1:~# corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
id = 10.112.14.163
status = ring 0 active with no faults
root@vso-app1:~#

mehhos · Mar 9, 2023

mehhos said:
root@vso-app1:~# corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
id = 10.112.14.163
status = ring 0 active with no faults
root@vso-app1:~#

root@vso-app1:~# nslookup vsr-app3
Server: 89.254.64.20
Address: 89.254.64.20#53

Non-authoritative answer:
Name: vsr-app3.dax.net
Address: 10.112.65.16

root@vso-app1:~#

shanreich · Mar 9, 2023

This IP looks off in the output of corosync-cfgtool:

Code:

id = 10.112.14.163

The IPs from pvecm and from the nslookup are in a different subnet maybe? (if this is a /24 subnet)

Code:

10.112.65.16

The output from the corosync-cfgtool looks a bit different than on my local cluster, what versions are you running? (pveversion -v)

What do the hosts files look like? Might the IPs there be wrong? Are you trying to ping with the IP or the hostname? Are UDP packets maybe blocked?

I would also recommend using IPs instead of hostnames in the corosync configuration.

As @UdoB already pointed out, this is certainly a problem with the network configuration somewhere.

From at least two hosts, can you provide the following output?

Code:

cat /etc/network/interfaces
systemctl status networking
ip a
cat /etc/hosts
ping <other_host_ip>
ping <other_host_hostname>
pvecm status

Search

Search

Cluster not working after moving all nodes to new DataCenter

mehhos

Member

Attachments

shanreich

Proxmox Staff Member

UdoB

Distinguished Member

alexskysilk

Distinguished Member

mehhos

Member

mehhos

Member

mehhos

Member

UdoB

Distinguished Member

mehhos

Member

mehhos

Member

shanreich

Proxmox Staff Member

We value your privacy