[SOLVED] Cluster lost, readonly

David Calvache Casas

Well-Known Member
Jun 14, 2013
35
2
48
Almería, Spain
i am deploying a new cluster with 7 hosts. Today a power outage results on a broken cluster, i reviewed all conf , and i cant see the fail.

Data:

7 hosts

2 corosync rings (10.10.10.X/27 , 10.10.20.X/27) over 2 dedicated 1Gb networks connected to 2 standalone switches(IGM fowarding/querier active)

The only error:syslog
Mar 7 21:25:53 NPX1 corosync[53528]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 7 21:25:59 NPX1 corosync[53528]: notice [TOTEM ] A new membership (10.10.10.1:69120) was formed. Members
Mar 7 21:25:59 NPX1 corosync[53528]: [TOTEM ] A new membership (10.10.10.1:69120) was formed. Members
Mar 7 21:25:59 NPX1 corosync[53528]: warning [CPG ] downlist left_list: 0 received
Mar 7 21:25:59 NPX1 corosync[53528]: notice [QUORUM] Members[1]: 1
Mar 7 21:25:59 NPX1 corosync[53528]: notice [MAIN ] Completed service synchronization, ready to provide service.
Mar 7 21:25:59 NPX1 corosync[53528]: [CPG ] downlist left_list: 0 received
Mar 7 21:25:59 NPX1 corosync[53528]: [QUORUM] Members[1]: 1
Mar 7 21:25:59 NPX1 corosync[53528]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 7 21:26:00 NPX1 systemd[1]: Starting Proxmox VE replication runner...
Mar 7 21:26:00 NPX1 pvesr[477933]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 7 21:26:01 NPX1 pvesr[477933]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 7 21:26:02 NPX1 pvesr[477933]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 7 21:26:03 NPX1 pvesr[477933]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 7 21:26:04 NPX1 pvesr[477933]: trying to acquire cfs lock 'file-replication_cfg' ...




/etc/pve/corosync.conf:
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: NPX1
nodeid: 1
quorum_votes: 1
ring0_addr: 10.10.10.1
ring1_addr: 10.10.20.1
}
node {
name: NPX2
nodeid: 2
quorum_votes: 1
ring0_addr: 10.10.10.2
ring1_addr: 10.10.20.2
}
node {
name: NPX3
nodeid: 3
quorum_votes: 1
ring0_addr: 10.10.10.3
ring1_addr: 10.10.20.3
}
node {
name: NPX4
nodeid: 4
quorum_votes: 1
ring0_addr: 10.10.10.4
ring1_addr: 10.10.20.4
}
node {
name: NPX5
nodeid: 5
quorum_votes: 1
ring0_addr: 10.10.10.5
ring1_addr: 10.10.20.5
}
node {
name: NPX6
nodeid: 6
quorum_votes: 1
ring0_addr: 10.10.10.6
ring1_addr: 10.10.20.6
}
node {
name: NPX7
nodeid: 7
quorum_votes: 1
ring0_addr: 10.10.10.7
ring1_addr: 10.10.20.7
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: ncluster
config_version: 7
interface {
bindnetaddr: 10.10.10.1
ringnumber: 0
}
interface {
bindnetaddr: 10.10.20.1
ringnumber: 1
}
ip_version: ipv4
rrp_mode: passive
secauth: on
version: 2
}

/etc/network/interfaces:

auto lo
iface lo inet loopback

auto enp175s0f0
iface enp175s0f0 inet static
address 10.10.10.3
netmask 255.255.255.224
#Corosync RING0

auto enp175s0f1
iface enp175s0f1 inet static
address 10.10.20.3
netmask 255.255.255.224
#Corosync RING1

result of pvecm status:
pvecm status
Quorum information
------------------
Date: Thu Mar 7 21:29:09 2019
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1/69252
Quorate: No

Votequorum information
----------------------
Expected votes: 7
Highest expected: 7
Total votes: 1
Quorum: 4 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.10.10.1 (local)

Result of pvecm nodes:

pvecm nodes

Membership information
----------------------
Nodeid Votes Name
1 1 10.10.10.1 (local)


Test:
omping -c 10000 -i 0.001 -F -q 10.10.10.1 10.10.10.2 10.10.10.3 10.10.10.4 10.10.10.5 10.10.10.5 10.10.10.6 10.10.10.7
10.10.10.2 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.028/0.135/0.652/0.096
10.10.10.2 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.030/0.140/0.653/0.099
10.10.10.3 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.027/0.101/0.557/0.049
10.10.10.3 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.029/0.108/0.558/0.054
10.10.10.4 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.029/0.090/0.334/0.029
10.10.10.4 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.031/0.094/0.360/0.031
10.10.10.5 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.027/0.092/0.363/0.026
10.10.10.5 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.030/0.096/0.365/0.027
10.10.10.6 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.027/0.119/2.361/0.087
10.10.10.6 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.030/0.126/2.405/0.089
10.10.10.7 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.028/0.091/0.338/0.026
10.10.10.7 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.032/0.095/0.341/0.027

omping -c 600 -i 1 -q 10.10.10.1 10.10.10.2 10.10.10.3 10.10.10.4 10.10.10.5 10.10.10.5 10.10.10.6

10.10.10.2 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.041/0.152/0.310/0.055
10.10.10.2 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.045/0.168/0.334/0.063
10.10.10.3 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.047/0.148/0.285/0.054
10.10.10.3 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.050/0.165/0.307/0.062
10.10.10.4 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.044/0.113/0.273/0.034
10.10.10.4 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.046/0.123/0.293/0.037
10.10.10.5 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.035/0.096/0.215/0.038
10.10.10.5 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.037/0.106/0.264/0.041
10.10.10.6 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.031/0.143/0.266/0.054
10.10.10.6 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.046/0.160/0.288/0.063
10.10.10.7 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.037/0.104/0.242/0.040
10.10.10.7 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.041/0.111/0.262/0.042


¿Anyone to help me?, i'm empty of ideas.
 
Hi,

Did you check the switch if multicast is enabled?
Maybe the config was not saved.
 
Personally, I would check the traffic on multiple nodes with tcpdump, to see if the corosync traffic is coming thru.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!