Cluster not working after moving all nodes to new DataCenter

mehhos

Member
Jan 27, 2021
8
0
6
57
Hi,
Hope someone can help me here.
We moved proxmox with 6 nodes to new DataCenter, after that Cluster not working. I restart cluster but i didn't help:
root@vsr-app1:~# cat /etc/pve/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: vsr-app5
nodeid: 2
quorum_votes: 1
ring0_addr: vsr-app5
}

node {
name: vsr-app3
nodeid: 4
quorum_votes: 1
ring0_addr: vsr-app3
}

node {
name: vsr-app6
nodeid: 1
quorum_votes: 1
ring0_addr: vsr-app6
}

node {
name: vsr-app2
nodeid: 5
quorum_votes: 1
ring0_addr: vsr-app2
}

node {
name: vsr-app1
nodeid: 6
quorum_votes: 1
ring0_addr: vsr-app1
}

node {
name: vsr-app4
nodeid: 3
quorum_votes: 1
ring0_addr: vsr-app4
}

}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: vsrappcluster1
config_version: 6
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: 10.112.65.19
ringnumber: 0
}

}

root@vsr-app1:~#

root@vsr-app1:~# pvecm status
Quorum information
------------------
Date: Wed Mar 8 14:10:48 2023
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000006
Ring ID: 6/516
Quorate: No

Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 1
Quorum: 4 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000006 1 10.112.65.14 (local)
root@vsr-app1:~#

root@vsr-app1:~# pvecm status
Quorum information
------------------
Date: Wed Mar 8 14:10:48 2023
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000006
Ring ID: 6/516
Quorate: No

Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 1
Quorum: 4 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000006 1 10.112.65.14 (local)
root@vsr-app1:~# service pve-cluster status
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
Active: failed (Result: exit-code) since Wed 2023-03-08 13:38:48 CET; 32min ago
Process: 39680 ExecStartPost=/usr/bin/pvecm updatecerts --silent (code=exited, status=0/SUCCESS)
Process: 29490 ExecStart=/usr/bin/pmxcfs $DAEMON_OPTS (code=exited, status=255)
Main PID: 39678 (code=exited, status=0/SUCCESS)

Mar 08 13:38:38 vsr-app1 pmxcfs[29490]: [main] notice: unable to aquire pmxcfs lock - trying again
Mar 08 13:38:38 vsr-app1 pmxcfs[29490]: [main] notice: unable to aquire pmxcfs lock - trying again
Mar 08 13:38:48 vsr-app1 pmxcfs[29490]: [main] crit: unable to aquire pmxcfs lock: Resource temporarily unavailable
Mar 08 13:38:48 vsr-app1 pmxcfs[29490]: [main] crit: unable to aquire pmxcfs lock: Resource temporarily unavailable
Mar 08 13:38:48 vsr-app1 pmxcfs[29490]: [main] notice: exit proxmox configuration filesystem (-1)
Mar 08 13:38:48 vsr-app1 systemd[1]: pve-cluster.service: control process exited, code=exited status=255
Mar 08 13:38:48 vsr-app1 systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Mar 08 13:38:48 vsr-app1 systemd[1]: Unit pve-cluster.service entered failed state.
root@vsr-app1:~#

root@vsr-app1:~# systemctl status corosync
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
Active: active (running) since Wed 2023-03-08 13:37:23 CET; 34min ago
Process: 29120 ExecStop=/usr/share/corosync/corosync stop (code=exited, status=0/SUCCESS)
Process: 29141 ExecStart=/usr/share/corosync/corosync start (code=exited, status=0/SUCCESS)
Main PID: 29152 (corosync)
CGroup: /system.slice/corosync.service
└─29152 corosync

Mar 08 13:40:13 vsr-app1 corosync[29152]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 08 13:40:18 vsr-app1 corosync[29152]: [TOTEM ] A new membership (10.112.65.14:508) was formed. Members
Mar 08 13:40:18 vsr-app1 corosync[29152]: [QUORUM] Members[1]: 6
Mar 08 13:40:18 vsr-app1 corosync[29152]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 08 13:40:23 vsr-app1 corosync[29152]: [TOTEM ] A new membership (10.112.65.14:512) was formed. Members
Mar 08 13:40:23 vsr-app1 corosync[29152]: [QUORUM] Members[1]: 6
Mar 08 13:40:23 vsr-app1 corosync[29152]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 08 13:40:27 vsr-app1 corosync[29152]: [TOTEM ] A new membership (10.112.65.14:516) was formed. Members
Mar 08 13:40:27 vsr-app1 corosync[29152]: [QUORUM] Members[1]: 6
Mar 08 13:40:27 vsr-app1 corosync[29152]: [MAIN ] Completed service synchronization, ready to provide service.
root@vsr-app1:~#


root@vsr-app1:~# systemctl status pveproxy
● pveproxy.service - PVE API Proxy Server
Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled)
Active: active (running) since Wed 2023-03-08 06:25:16 CET; 7h ago
Process: 17170 ExecStop=/usr/bin/pveproxy stop (code=exited, status=0/SUCCESS)
Process: 17233 ExecStart=/usr/bin/pveproxy start (code=exited, status=0/SUCCESS)
Main PID: 17238 (pveproxy)
CGroup: /system.slice/pveproxy.service
├─17238 pveproxy
├─36752 pveproxy worker
├─38228 pveproxy worker
└─46986 pveproxy worker

Mar 08 13:51:22 vsr-app1 pveproxy[17960]: problem with client 10.80.45.53; ssl3_read_bytes: ssl handshake failure
Mar 08 13:51:22 vsr-app1 pveproxy[17960]: Can't call method "timeout_reset" on an undefined value at /usr/share/perl5/PVE/HTTPServer.pm line 227.
Mar 08 13:51:34 vsr-app1 pveproxy[34603]: worker exit
Mar 08 13:51:34 vsr-app1 pveproxy[17238]: worker 34603 finished
Mar 08 13:51:34 vsr-app1 pveproxy[17238]: starting 1 worker(s)
Mar 08 13:51:34 vsr-app1 pveproxy[17238]: worker 38228 started
Mar 08 14:05:00 vsr-app1 pveproxy[17960]: worker exit
Mar 08 14:05:00 vsr-app1 pveproxy[17238]: worker 17960 finished
Mar 08 14:05:00 vsr-app1 pveproxy[17238]: starting 1 worker(s)
Mar 08 14:05:00 vsr-app1 pveproxy[17238]: worker 46986 started
root@vsr-app1:~#
 

Attachments

  • proxmox_1.jpg
    proxmox_1.jpg
    163.2 KB · Views: 7
Can you ping between the nodes on the corosync interface?
Did any IP addresses change while moving?
 
Expected votes: 6
Highest expected: 6
Total votes: 1
Quorum: 4 Activity blocked
This host is isolated, corosync was not able to find its five neighbours. You need to re-establish network connectivity - in a compatible way compared to the past...

Good luck!
 
This host is isolated, corosync was not able to find its five neighbours. You need to re-establish network connectivity - in a compatible way compared to the past...

Good luck!
tnx for replay. All nodes are in same VLAN, all nodes can SSH to each other.
 
Again: "Total votes: 1 Quorum: 4 Activity blocked" tells you that the cluster network does not work. What is the output of corosync-cfgtool -s?

SSH is not sufficient for this. And in your config there are lines like "ring0_addr: vsr-app3". Do these names resolve?
 
Again: "Total votes: 1 Quorum: 4 Activity blocked" tells you that the cluster network does not work. What is the output of corosync-cfgtool -s?

SSH is not sufficient for this. And in your config there are lines like "ring0_addr: vsr-app3". Do these names resolve?
root@vso-app1:~# corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
id = 10.112.14.163
status = ring 0 active with no faults
root@vso-app1:~#
 
root@vso-app1:~# corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
id = 10.112.14.163
status = ring 0 active with no faults
root@vso-app1:~#
root@vso-app1:~# nslookup vsr-app3
Server: 89.254.64.20
Address: 89.254.64.20#53

Non-authoritative answer:
Name: vsr-app3.dax.net
Address: 10.112.65.16

root@vso-app1:~#
 
This IP looks off in the output of corosync-cfgtool:
Code:
id = 10.112.14.163

The IPs from pvecm and from the nslookup are in a different subnet maybe? (if this is a /24 subnet)
Code:
10.112.65.16

The output from the corosync-cfgtool looks a bit different than on my local cluster, what versions are you running? (pveversion -v)

What do the hosts files look like? Might the IPs there be wrong? Are you trying to ping with the IP or the hostname? Are UDP packets maybe blocked?

I would also recommend using IPs instead of hostnames in the corosync configuration.

As @UdoB already pointed out, this is certainly a problem with the network configuration somewhere.


From at least two hosts, can you provide the following output?
Code:
cat /etc/network/interfaces
systemctl status networking
ip a
cat /etc/hosts
ping <other_host_ip>
ping <other_host_hostname>
pvecm status
 
Last edited: