Broken cluster after reboot

galphanet

Renowned Member
Jun 18, 2009
33
1
73
Long story short, we had to reboot our proxmox cluster (6 machines) after a switch change and everything went south.
We had HA enabled and at some point, each server decided to start every VM...which corrupted their disks on the shared storage.

But this was our fault, here is what we did on the nodes to make them start again the VM:
pvecm expected 1
pmxcfs -l

So now we have each node running the VM but they don't see each other and we need help to make it work again. Thanks for your time.

Each node show itself only on pvecm :
# pvecm status
Quorum information

------------------

Date: Thu Sep 13 19:56:48 2018

Quorum provider: corosync_votequorum

Nodes: 1

Node ID: 0x00000003

Ring ID: 3/56416

Quorate: Yes


Votequorum information

----------------------

Expected votes: 1

Highest expected: 1

Total votes: 1

Quorum: 1

Flags: Quorate


Membership information

----------------------

Nodeid Votes Name

0x00000003 1 10.50.188.146 (local)

# systemctl status pve-cluster

Sep 13 19:41:37 blade6 systemd[1]: Starting The Proxmox VE cluster filesystem...

Sep 13 19:41:37 blade6 pmxcfs[64119]: [status] notice: update cluster info (cluster name qls2, version = 6)

Sep 13 19:41:37 blade6 pmxcfs[64119]: [dcdb] notice: members: 3/64119

Sep 13 19:41:37 blade6 pmxcfs[64119]: [dcdb] notice: all data is up to date

Sep 13 19:41:37 blade6 pmxcfs[64119]: [status] notice: members: 3/64119

Sep 13 19:41:37 blade6 pmxcfs[64119]: [status] notice: all data is up to date

Sep 13 19:41:38 blade6 systemd[1]: Started The Proxmox VE cluster filesystem.

Sep 13 19:50:44 blade6 pmxcfs[64119]: [status] notice: node has quorum

#systemctl status pvestatd.service

Sep 13 19:40:54 blade6 pvestatd[3404]: ipcc_send_rec[1] failed: Connection refused

Sep 13 19:40:54 blade6 pvestatd[3404]: ipcc_send_rec[2] failed: Connection refused

Sep 13 19:40:54 blade6 pvestatd[3404]: ipcc_send_rec[3] failed: Connection refused

Sep 13 19:40:54 blade6 pvestatd[3404]: ipcc_send_rec[4] failed: Connection refused

Sep 13 19:40:54 blade6 pvestatd[3404]: status update error: Connection refused

Sep 13 19:41:04 blade6 pvestatd[3404]: ipcc_send_rec[1] failed: Connection refused

Sep 13 19:41:04 blade6 pvestatd[3404]: ipcc_send_rec[2] failed: Connection refused

Sep 13 19:41:04 blade6 pvestatd[3404]: ipcc_send_rec[3] failed: Connection refused

Sep 13 19:41:04 blade6 pvestatd[3404]: ipcc_send_rec[4] failed: Connection refused

Sep 13 19:41:04 blade6 pvestatd[3404]: status update error: Connection refused

Code:
root@blade6:~# pveversion --verbose
proxmox-ve: 5.2-2 (running kernel: 4.15.18-1-pve)
pve-manager: 5.2-6 (running version: 5.2-6/bcd5f008)
pve-kernel-4.15: 5.2-4
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-1-pve: 4.15.18-17
pve-kernel-4.15.17-2-pve: 4.15.17-10
pve-kernel-4.13.16-4-pve: 4.13.16-51
pve-kernel-4.13.16-3-pve: 4.13.16-50
pve-kernel-4.13.16-1-pve: 4.13.16-46
pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.13-4-pve: 4.13.13-35
pve-kernel-4.13.13-2-pve: 4.13.13-33
pve-kernel-4.10.17-2-pve: 4.10.17-20
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-37
libpve-guest-common-perl: 2.0-17
libpve-http-server-perl: 2.0-9
libpve-storage-perl: 5.0-24
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-3
lxcfs: 3.0.0-1
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-19
pve-cluster: 5.0-29
pve-container: 2.0-24
pve-docs: 5.2-5
pve-firewall: 3.0-13
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-30
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.9-pve1~bpo9