Problem after network failure

axion.joey · Aug 15, 2019

Hey Everyone,

We recently had a network failure in one of our data centers. The network failure caused all of the proxmox nodes in our customer to fence themselves. They're back up an running, and the cluster shows all nodes in, but we're having the following issues:
1. HA no longer works. Containers that are managed by HA can't be started. In order to start them we have to remove them from HA.
2. We can't add new nodes to the cluster. When I try to add new nodes I get this response
pvecm add 10.3.16.20
root@10.3.16.20's password:
copy corosync auth key
stopping pve-cluster service
backup old database
Job for corosync.service failed. See 'systemctl status corosync.service' and 'journalctl -xn' for details.
waiting for quorum...
And then it hangs.

Any help would be greatly appreciated.

axion.joey · Aug 15, 2019

Here's some additional info.
proxmox-ve: 4.2-48 (running kernel: 4.4.6-1-pve)
pve-manager: 4.2-2 (running version: 4.2-2/725d76f0)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.2.6-1-pve: 4.2.6-36
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-39
qemu-server: 4.0-72
pve-firmware: 1.1-8
libpve-common-perl: 4.0-59
libpve-access-control: 4.0-16
libpve-storage-perl: 4.0-50
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-14
pve-container: 1.0-62
pve-firewall: 2.0-25
pve-ha-manager: 1.0-28
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve9~jessie

systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
Active: active (running) since Sat 2019-08-10 16:55:47 PDT; 4 days ago
Process: 3199 ExecStart=/usr/share/corosync/corosync start (code=exited, status=0/SUCCESS)
Main PID: 3209 (corosync)
CGroup: /system.slice/corosync.service
└─3209 corosync
Aug 11 22:50:25 proxmoxnj1 corosync[3209]: [QUORUM] Members[9]: 1 2 3 4 5 6 7 8 9
Aug 11 22:50:25 proxmoxnj1 corosync[3209]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 14 18:35:21 proxmoxnj1 corosync[3209]: [CFG ] Config reload requested by node 1
Aug 14 19:04:51 proxmoxnj1 corosync[3209]: [TOTEM ] A new membership (10.3.16.20:1316) was formed. Members left: 4
Aug 14 19:04:51 proxmoxnj1 corosync[3209]: [QUORUM] Members[8]: 1 2 3 5 6 7 8 9
Aug 14 19:04:51 proxmoxnj1 corosync[3209]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 14 19:11:22 proxmoxnj1 corosync[3209]: [TOTEM ] A new membership (10.3.16.20:1320) was formed. Members joined: 4
Aug 14 19:11:22 proxmoxnj1 corosync[3209]: [QUORUM] Members[9]: 1 2 3 4 5 6 7 8 9
Aug 14 19:11:22 proxmoxnj1 corosync[3209]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 14 19:20:54 proxmoxnj1 corosync[3209]: [CFG ] Config reload requested by node 4

cat /etc/pve/corosync.conf
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: proxmoxnj1
nodeid: 1
quorum_votes: 1
ring0_addr: proxmoxnj1
}
node {
name: ProxmoxCoreNJ2
nodeid: 10
quorum_votes: 1
ring0_addr: ProxmoxCoreNJ2
}
node {
name: proxmoxnj2
nodeid: 2
quorum_votes: 1
ring0_addr: proxmoxnj2
}
node {
name: proxmoxnj3
nodeid: 3
quorum_votes: 1
ring0_addr: proxmoxnj3
}
node {
name: ProxmoxCoreNJ1
nodeid: 11
quorum_votes: 1
ring0_addr: ProxmoxCoreNJ1
}
node {
name: ProxmoxNJ4
nodeid: 4
quorum_votes: 1
ring0_addr: ProxmoxNJ4
}
node {
name: ProxmoxNJ6
nodeid: 6
quorum_votes: 1
ring0_addr: ProxmoxNJ6
}
node {
name: ProxmoxNJ8
nodeid: 8
quorum_votes: 1
ring0_addr: ProxmoxNJ8
}
node {
name: ProxmoxNJ9
nodeid: 9
quorum_votes: 1
ring0_addr: ProxmoxNJ9
}
node {
name: ProxmoxNJ7
nodeid: 7
quorum_votes: 1
ring0_addr: ProxmoxNJ7
}
node {
name: ProxmoxNJ5
nodeid: 5
quorum_votes: 1
ring0_addr: ProxmoxNJ5
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: proxmoxnj
config_version: 15
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: 10.3.16.20
ringnumber: 0
}
}

axion.joey · Aug 15, 2019

pvecm status
Quorum information
------------------
Date: Wed Aug 14 20:29:31 2019
Quorum provider: corosync_votequorum
Nodes: 9
Node ID: 0x00000001
Ring ID: 1320
Quorate: Yes
Votequorum information
----------------------
Expected votes: 11
Highest expected: 11
Total votes: 9
Quorum: 6
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.3.16.20 (local)
0x00000002 1 10.3.16.21
0x00000003 1 10.3.16.22
0x00000004 1 10.3.16.25
0x00000005 1 10.3.16.26
0x00000006 1 10.3.16.27
0x00000007 1 10.3.16.28
0x00000008 1 10.3.16.40
0x00000009 1 10.3.16.41

the 2 missing nodes are nodes that i tried to add to the cluster today.

Search

Search

Problem after network failure

axion.joey

Active Member

axion.joey

Active Member

axion.joey

Active Member

We value your privacy