Unexplained cluster crash after upgrade from 7. -> 7.1-8

ilia987

Active Member
Sep 9, 2019
275
13
38
37
Our system was stable in the last few months, but after upgrade 7 to 7.1-8 we have 3-4 random crash every day

(we had issue two months ago with corosync stability but after replacing the switch the cluster worked very well under high load. without any issue)

This Monday i took advantage of the power outage (building maintenance ) and made an upgrade on the cluster (11 servers, 4 of them with ceph)
After the upgrade and dist-upgrade i rebooted all servers. everything seems working and stable except there are some random crashes that initiate full cluster reboot

I have attacked the corosync logs . and i can attach any other logs needed to understand the issue

I have checked the switch logs. no issues at all running fine.

Before the crash all the cluster and services work well, ceph work (100%osd are up, 100% monitors are up). connects to PBS , all lxc and VM are UP and running. we are still tring to figure out what scenario cause it, but the problem that there were few crashes at a time without any load all is idle

Error started at around 15:43:59 (second log line)
this cluster failure the node pve-ws2 did not crash that i attacked its logs:
corosync syslog:
Rich (BB code):
Dec 22 15:01:06 pve-ws2 corosync[1156]:   [MAIN  ] Completed service synchronization, ready to provide service.
Dec 22 15:43:59 pve-ws2 corosync[1156]:   [KNET  ] link: host: 1 link: 0 is down
Dec 22 15:43:59 pve-ws2 corosync[1156]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec 22 15:43:59 pve-ws2 corosync[1156]:   [KNET  ] host: host: 1 has no active links
Dec 22 15:44:02 pve-ws2 corosync[1156]:   [TOTEM ] Token has not been received in 6637 ms
Dec 22 15:44:04 pve-ws2 corosync[1156]:   [TOTEM ] A processor failed, forming new configuration: token timed out (8850ms), waiting 10620ms for consensus.
Dec 22 15:44:15 pve-ws2 corosync[1156]:   [QUORUM] Sync members[9]: 2 3 4 6 7 8 9 10 11
Dec 22 15:44:15 pve-ws2 corosync[1156]:   [QUORUM] Sync joined[3]: 3 4 6
Dec 22 15:44:15 pve-ws2 corosync[1156]:   [QUORUM] Sync left[5]: 1 3 4 5 6
Dec 22 15:44:15 pve-ws2 corosync[1156]:   [TOTEM ] A new membership (2.46ec) was formed. Members joined: 3 4 6 left: 1 3 4 5 6
Dec 22 15:44:15 pve-ws2 corosync[1156]:   [TOTEM ] Failed to receive the leave message. failed: 1 3 4 5 6
Dec 22 15:44:15 pve-ws2 corosync[1156]:   [TOTEM ] Retransmit List: 1
Dec 22 15:44:18 pve-ws2 corosync[1156]:   [KNET  ] rx: host: 1 link: 0 is up
Dec 22 15:44:18 pve-ws2 corosync[1156]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec 22 15:44:26 pve-ws2 corosync[1156]:   [QUORUM] Sync members[9]: 2 3 4 6 7 8 9 10 11
Dec 22 15:44:26 pve-ws2 corosync[1156]:   [QUORUM] Sync joined[2]: 3 4
Dec 22 15:44:26 pve-ws2 corosync[1156]:   [QUORUM] Sync left[4]: 1 3 4 5
Dec 22 15:44:26 pve-ws2 corosync[1156]:   [TOTEM ] A new membership (2.46f4) was formed. Members joined: 3 4 left: 3 4
Dec 22 15:44:26 pve-ws2 corosync[1156]:   [TOTEM ] Failed to receive the leave message. failed: 3 4
Dec 22 15:44:26 pve-ws2 corosync[1156]:   [QUORUM] Sync members[9]: 2 3 4 6 7 8 9 10 11
Dec 22 15:44:26 pve-ws2 corosync[1156]:   [QUORUM] Sync joined[8]: 3 4 6 7 8 9 10 11
Dec 22 15:44:26 pve-ws2 corosync[1156]:   [QUORUM] Sync left[10]: 1 3 4 5 6 7 8 9 10 11
Dec 22 15:44:26 pve-ws2 corosync[1156]:   [TOTEM ] A new membership (2.46f8) was formed. Members joined: 3 4 6 7 8 9 10 11 left: 3 4 6 7 8 9 10 11
Dec 22 15:44:26 pve-ws2 corosync[1156]:   [TOTEM ] Failed to receive the leave message. failed: 3 4 6 7 8 9 10 11
Dec 22 15:44:37 pve-ws2 corosync[1156]:   [QUORUM] Sync members[9]: 2 3 4 6 7 8 9 10 11
Dec 22 15:44:37 pve-ws2 corosync[1156]:   [TOTEM ] A new membership (2.46fc) was formed. Members
Dec 22 15:44:47 pve-ws2 corosync[1156]:   [QUORUM] Sync members[9]: 2 3 4 6 7 8 9 10 11
Dec 22 15:44:47 pve-ws2 corosync[1156]:   [TOTEM ] A new membership (2.4700) was formed. Members
Dec 22 15:45:01 pve-ws2 corosync[1156]:   [KNET  ] link: host: 10 link: 0 is down
Dec 22 15:45:01 pve-ws2 corosync[1156]:   [KNET  ] link: host: 9 link: 0 is down
Dec 22 15:45:01 pve-ws2 corosync[1156]:   [KNET  ] link: host: 7 link: 0 is down
Dec 22 15:45:01 pve-ws2 corosync[1156]:   [KNET  ] link: host: 1 link: 0 is down
Dec 22 15:45:01 pve-ws2 corosync[1156]:   [KNET  ] link: host: 5 link: 0 is down
Dec 22 15:45:01 pve-ws2 corosync[1156]:   [KNET  ] host: host: 10 (passive) best link: 0 (pri: 1)
Dec 22 15:45:01 pve-ws2 corosync[1156]:   [KNET  ] host: host: 10 has no active links
Dec 22 15:45:01 pve-ws2 corosync[1156]:   [KNET  ] host: host: 9 (passive) best link: 0 (pri: 1)
Dec 22 15:45:01 pve-ws2 corosync[1156]:   [KNET  ] host: host: 9 has no active links
Dec 22 15:45:01 pve-ws2 corosync[1156]:   [KNET  ] host: host: 7 (passive) best link: 0 (pri: 1)
Dec 22 15:45:01 pve-ws2 corosync[1156]:   [KNET  ] host: host: 7 has no active links
Dec 22 15:45:01 pve-ws2 corosync[1156]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec 22 15:45:01 pve-ws2 corosync[1156]:   [KNET  ] host: host: 1 has no active links
Dec 22 15:45:01 pve-ws2 corosync[1156]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Dec 22 15:45:01 pve-ws2 corosync[1156]:   [KNET  ] host: host: 5 has no active links
Dec 22 15:45:04 pve-ws2 corosync[1156]:   [KNET  ] link: host: 11 link: 0 is down
Dec 22 15:45:04 pve-ws2 corosync[1156]:   [KNET  ] link: host: 8 link: 0 is down
Dec 22 15:45:04 pve-ws2 corosync[1156]:   [KNET  ] link: host: 4 link: 0 is down
Dec 22 15:45:04 pve-ws2 corosync[1156]:   [KNET  ] host: host: 11 (passive) best link: 0 (pri: 1)
Dec 22 15:45:04 pve-ws2 corosync[1156]:   [KNET  ] host: host: 11 has no active links
Dec 22 15:45:04 pve-ws2 corosync[1156]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
 
Last edited:
please provide the full information required to trouble shoot such an issue:
- network setup
- pveversion -v from all nodes
- corosync.conf
- logs from all nodes for the relevant timeframe
 
each server have 3 interfaces:
1GB used for corosync, ssh and basic network access.
10/40GB internal ceph
10/40GB ceph clients

pve version from servers with ceph:
Code:
proxmox-ve: 7.1-1 (running kernel: 5.13.19-2-pve)
pve-manager: 7.1-8 (running version: 7.1-8/5b267f33)
pve-kernel-helper: 7.1-6
pve-kernel-5.13: 7.1-5
pve-kernel-5.11: 7.0-10
pve-kernel-5.4: 6.4-11
pve-kernel-5.3: 6.1-6
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-5-pve: 5.11.22-10
pve-kernel-5.4.157-1-pve: 5.4.157-1
pve-kernel-5.4.140-1-pve: 5.4.140-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.78-1-pve: 5.4.78-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-4
libpve-storage-perl: 7.0-15
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.1.2-1
proxmox-backup-file-restore: 2.1.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-4
pve-cluster: 7.1-2
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3

pve version from first server (i think we started with proxmxo v3)
Code:
proxmox-ve: 7.1-1 (running kernel: 5.13.19-2-pve)
pve-manager: 7.1-8 (running version: 7.1-8/5b267f33)
pve-kernel-helper: 7.1-6
pve-kernel-5.13: 7.1-5
pve-kernel-5.11: 7.0-10
pve-kernel-5.4: 6.4-6
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-5-pve: 5.11.22-10
pve-kernel-5.4.140-1-pve: 5.4.140-1
pve-kernel-4.4.134-1-pve: 4.4.134-112
pve-kernel-4.4.35-1-pve: 4.4.35-77
ceph-fuse: 14.2.21-1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-4
libpve-storage-perl: 7.0-15
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.1.2-1
proxmox-backup-file-restore: 2.1.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-4
pve-cluster: 7.1-2
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3

pve version with reset servers:

Code:
proxmox-ve: 7.1-1 (running kernel: 5.13.19-2-pve)
pve-manager: 7.1-8 (running version: 7.1-8/5b267f33)
pve-kernel-helper: 7.1-6
pve-kernel-5.13: 7.1-5
pve-kernel-5.11: 7.0-10
pve-kernel-5.4: 6.4-6
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-5-pve: 5.11.22-10
pve-kernel-5.4.140-1-pve: 5.4.140-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.78-1-pve: 5.4.78-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-4.15: 5.4-19
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-3-pve: 5.0.21-7
pve-kernel-4.15.18-30-pve: 4.15.18-58
pve-kernel-4.15.18-29-pve: 4.15.18-57
pve-kernel-4.15.18-28-pve: 4.15.18-56
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.15.18-9-pve: 4.15.18-30
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-4
libpve-storage-perl: 7.0-15
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.1.2-1
proxmox-backup-file-restore: 2.1.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-4
pve-cluster: 7.1-2
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3


corosync conf
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve-blade-101
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 172.30.1.131
  }
  node {
    name: pve-blade-102
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 172.30.1.132
  }
  node {
    name: pve-blade-107
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 172.30.1.137
  }
  node {
    name: pve-blade-108
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 172.30.1.138
  }
  node {
    name: pve-blade-109
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 172.30.1.139
  }
  node {
    name: pve-srv1
    nodeid: 3
    quorum_votes: 1
    ring0_addr: pve-srv1
  }
  node {
    name: pve-srv2
    nodeid: 8
    quorum_votes: 1
    ring0_addr: 172.30.1.235
  }
  node {
    name: pve-srv3
    nodeid: 9
    quorum_votes: 1
    ring0_addr: 172.30.1.234
  }
  node {
    name: pve-srv4
    nodeid: 10
    quorum_votes: 1
    ring0_addr: 172.30.1.223
  }
  node {
    name: pve-srv5
    nodeid: 11
    quorum_votes: 1
    ring0_addr: 172.30.1.176
  }
  node {
    name: pve-ws2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: pve-ws2
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: vq-pve
  config_version: 45
  interface {
    bindnetaddr: 172.30.1.234
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

what logs exactly do you need ?
 
journal at least for corosync and pve-cluster units, preferable complete, from before the first link goes down anywhere up to the reboot/crash
 
here are the logs for some servers (one of each type of hardware configuration)
i can provide the logs of the rest but it looks the same

pve-ws2:

Code:
Dec 22 15:16:51 pve-ws2 pmxcfs[1146]: [status] notice: received log
Dec 22 15:18:21 pve-ws2 pmxcfs[1146]: [status] notice: received log
Dec 22 15:20:56 pve-ws2 pmxcfs[1146]: [status] notice: received log
Dec 22 15:30:45 pve-ws2 pmxcfs[1146]: [status] notice: received log
Dec 22 15:31:10 pve-ws2 pmxcfs[1146]: [status] notice: received log
Dec 22 15:31:51 pve-ws2 pmxcfs[1146]: [status] notice: received log
Dec 22 15:33:21 pve-ws2 pmxcfs[1146]: [status] notice: received log
Dec 22 15:35:56 pve-ws2 pmxcfs[1146]: [status] notice: received log
Dec 22 15:44:15 pve-ws2 pmxcfs[1146]: [dcdb] notice: members: 2/1146, 7/2180, 8/3940, 9/4386, 10/4205, 11/4370
Dec 22 15:44:15 pve-ws2 pmxcfs[1146]: [dcdb] notice: starting data syncronisation
Dec 22 15:44:16 pve-ws2 pmxcfs[1146]: [dcdb] notice: cpg_send_message retry 10
Dec 22 15:44:17 pve-ws2 pmxcfs[1146]: [dcdb] notice: cpg_send_message retry 20
Dec 22 15:44:18 pve-ws2 pmxcfs[1146]: [dcdb] notice: cpg_send_message retry 30
Dec 22 15:44:19 pve-ws2 pmxcfs[1146]: [dcdb] notice: cpg_send_message retry 40
Dec 22 15:44:19 pve-ws2 pmxcfs[1146]: [status] notice: cpg_send_message retry 10
Dec 22 15:44:20 pve-ws2 pmxcfs[1146]: [dcdb] notice: cpg_send_message retry 50
Dec 22 15:44:20 pve-ws2 pmxcfs[1146]: [status] notice: cpg_send_message retry 20
Dec 22 15:44:21 pve-ws2 pmxcfs[1146]: [dcdb] notice: cpg_send_message retry 60
Dec 22 15:44:21 pve-ws2 pmxcfs[1146]: [status] notice: cpg_send_message retry 30
Dec 22 15:44:22 pve-ws2 pmxcfs[1146]: [dcdb] notice: cpg_send_message retry 70
Dec 22 15:44:22 pve-ws2 pmxcfs[1146]: [status] notice: cpg_send_message retry 40
Dec 22 15:44:23 pve-ws2 pmxcfs[1146]: [dcdb] notice: cpg_send_message retry 80
Dec 22 15:44:23 pve-ws2 pmxcfs[1146]: [status] notice: cpg_send_message retry 50
Dec 22 15:44:24 pve-ws2 pmxcfs[1146]: [dcdb] notice: cpg_send_message retry 90
Dec 22 15:44:24 pve-ws2 pmxcfs[1146]: [status] notice: cpg_send_message retry 60
Dec 22 15:44:25 pve-ws2 pmxcfs[1146]: [dcdb] notice: cpg_send_message retry 100
Dec 22 15:44:25 pve-ws2 pmxcfs[1146]: [dcdb] notice: cpg_send_message retried 100 times
Dec 22 15:44:25 pve-ws2 pmxcfs[1146]: [dcdb] crit: failed to send SYNC_START message
Dec 22 15:44:25 pve-ws2 pmxcfs[1146]: [dcdb] crit: leaving CPG group
[CODE]Dec 22 15:01:06 pve-ws2 corosync[1156]:   [QUORUM] Sync members[11]: 1 2 3 4 5 6 7 8 9 10 11
Dec 22 15:01:06 pve-ws2 corosync[1156]:   [QUORUM] Sync joined[1]: 6
Dec 22 15:01:06 pve-ws2 corosync[1156]:   [TOTEM ] A new membership (1.46e4) was formed. Members joined: 6
Dec 22 15:01:06 pve-ws2 corosync[1156]:   [QUORUM] Members[11]: 1 2 3 4 5 6 7 8 9 10 11
Dec 22 15:01:06 pve-ws2 corosync[1156]:   [MAIN  ] Completed service synchronization, ready to provide service.
Dec 22 15:43:59 pve-ws2 corosync[1156]:   [KNET  ] link: host: 1 link: 0 is down
Dec 22 15:43:59 pve-ws2 corosync[1156]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec 22 15:43:59 pve-ws2 corosync[1156]:   [KNET  ] host: host: 1 has no active links
Dec 22 15:44:02 pve-ws2 corosync[1156]:   [TOTEM ] Token has not been received in 6637 ms
Dec 22 15:44:04 pve-ws2 corosync[1156]:   [TOTEM ] A processor failed, forming new configuration: token timed out (8850ms), waiting 10620ms for consensus.
Dec 22 15:44:15 pve-ws2 corosync[1156]:   [QUORUM] Sync members[9]: 2 3 4 6 7 8 9 10 11
Dec 22 15:44:15 pve-ws2 corosync[1156]:   [QUORUM] Sync joined[3]: 3 4 6
Dec 22 15:44:15 pve-ws2 corosync[1156]:   [QUORUM] Sync left[5]: 1 3 4 5 6
Dec 22 15:44:15 pve-ws2 corosync[1156]:   [TOTEM ] A new membership (2.46ec) was formed. Members joined: 3 4 6 left: 1 3 4 5 6
Dec 22 15:44:15 pve-ws2 corosync[1156]:   [TOTEM ] Failed to receive the leave message. failed: 1 3 4 5 6
Dec 22 15:44:15 pve-ws2 corosync[1156]:   [TOTEM ] Retransmit List: 1
Dec 22 15:44:18 pve-ws2 corosync[1156]:   [KNET  ] rx: host: 1 link: 0 is up
Dec 22 15:44:18 pve-ws2 corosync[1156]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec 22 15:44:26 pve-ws2 corosync[1156]:   [QUORUM] Sync members[9]: 2 3 4 6 7 8 9 10 11
Dec 22 15:44:26 pve-ws2 corosync[1156]:   [QUORUM] Sync joined[2]: 3 4
Dec 22 15:44:26 pve-ws2 corosync[1156]:   [QUORUM] Sync left[4]: 1 3 4 5
Dec 22 15:44:26 pve-ws2 corosync[1156]:   [TOTEM ] A new membership (2.46f4) was formed. Members joined: 3 4 left: 3 4
Dec 22 15:44:26 pve-ws2 corosync[1156]:   [TOTEM ] Failed to receive the leave message. failed: 3 4
Dec 22 15:44:26 pve-ws2 corosync[1156]:   [QUORUM] Sync members[9]: 2 3 4 6 7 8 9 10 11
Dec 22 15:44:26 pve-ws2 corosync[1156]:   [QUORUM] Sync joined[8]: 3 4 6 7 8 9 10 11
Dec 22 15:44:26 pve-ws2 corosync[1156]:   [QUORUM] Sync left[10]: 1 3 4 5 6 7 8 9 10 11
Dec 22 15:44:26 pve-ws2 corosync[1156]:   [TOTEM ] A new membership (2.46f8) was formed. Members joined: 3 4 6 7 8 9 10 11 left: 3 4 6 7 8 9 10 11
Dec 22 15:44:26 pve-ws2 corosync[1156]:   [TOTEM ] Failed to receive the leave message. failed: 3 4 6 7 8 9 10 11
Dec 22 15:44:37 pve-ws2 corosync[1156]:   [QUORUM] Sync members[9]: 2 3 4 6 7 8 9 10 11
Dec 22 15:44:37 pve-ws2 corosync[1156]:   [TOTEM ] A new membership (2.46fc) was formed. Members
Dec 22 15:44:47 pve-ws2 corosync[1156]:   [QUORUM] Sync members[9]: 2 3 4 6 7 8 9 10 11
Dec 22 15:44:47 pve-ws2 corosync[1156]:   [TOTEM ] A new membership (2.4700) was formed. Members
Dec 22 15:45:01 pve-ws2 corosync[1156]:   [KNET  ] link: host: 10 link: 0 is down
Dec 22 15:45:01 pve-ws2 corosync[1156]:   [KNET  ] link: host: 9 link: 0 is down
Dec 22 15:45:01 pve-ws2 corosync[1156]:   [KNET  ] link: host: 7 link: 0 is down
Dec 22 15:45:01 pve-ws2 corosync[1156]:   [KNET  ] link: host: 1 link: 0 is down
Dec 22 15:45:01 pve-ws2 corosync[1156]:   [KNET  ] link: host: 5 link: 0 is down
[/CODE]

pve-srv1:
Code:
Dec 22 15:10:08 pve-srv1 pmxcfs[5489]: [status] notice: received log
Dec 22 15:10:14 pve-srv1 pmxcfs[5489]: [status] notice: received log
Dec 22 15:11:49 pve-srv1 pmxcfs[5489]: [status] notice: received log
Dec 22 15:11:53 pve-srv1 pmxcfs[5489]: [status] notice: received log
Dec 22 15:44:15 pve-srv1 pmxcfs[5489]: [dcdb] notice: members: 3/5489, 4/2131, 6/2821
Dec 22 15:44:15 pve-srv1 pmxcfs[5489]: [dcdb] notice: starting data syncronisation
Dec 22 15:44:16 pve-srv1 pmxcfs[5489]: [dcdb] notice: cpg_send_message retry 10
Dec 22 15:44:17 pve-srv1 pmxcfs[5489]: [dcdb] notice: cpg_send_message retry 20
Dec 22 15:44:18 pve-srv1 pmxcfs[5489]: [dcdb] notice: cpg_send_message retry 30
Dec 22 15:44:19 pve-srv1 pmxcfs[5489]: [status] notice: cpg_send_message retry 10
Dec 22 15:44:19 pve-srv1 pmxcfs[5489]: [dcdb] notice: cpg_send_message retry 40
Dec 22 15:44:20 pve-srv1 pmxcfs[5489]: [status] notice: cpg_send_message retry 20
Dec 22 15:44:20 pve-srv1 pmxcfs[5489]: [dcdb] notice: cpg_send_message retry 50
Dec 22 15:44:21 pve-srv1 pmxcfs[5489]: [status] notice: cpg_send_message retry 30
Dec 22 15:44:21 pve-srv1 pmxcfs[5489]: [dcdb] notice: cpg_send_message retry 60
Dec 22 15:44:22 pve-srv1 pmxcfs[5489]: [status] notice: cpg_send_message retry 40
Dec 22 15:44:22 pve-srv1 pmxcfs[5489]: [dcdb] notice: cpg_send_message retry 70
Dec 22 15:44:23 pve-srv1 pmxcfs[5489]: [status] notice: cpg_send_message retry 50
Dec 22 15:44:23 pve-srv1 pmxcfs[5489]: [dcdb] notice: cpg_send_message retry 80
Dec 22 15:44:24 pve-srv1 pmxcfs[5489]: [status] notice: cpg_send_message retry 60
Dec 22 15:44:24 pve-srv1 pmxcfs[5489]: [dcdb] notice: cpg_send_message retry 90
Dec 22 15:44:25 pve-srv1 pmxcfs[5489]: [status] notice: cpg_send_message retry 70
Dec 22 15:44:25 pve-srv1 pmxcfs[5489]: [dcdb] notice: cpg_send_message retry 100
Dec 22 15:44:25 pve-srv1 pmxcfs[5489]: [dcdb] notice: cpg_send_message retried 100 times
Dec 22 15:44:25 pve-srv1 pmxcfs[5489]: [dcdb] crit: failed to send SYNC_START message
Dec 22 15:44:25 pve-srv1 pmxcfs[5489]: [dcdb] crit: leaving CPG group
Code:
Dec 22 15:01:06 pve-srv1 corosync[5657]:   [QUORUM] Sync joined[1]: 6
Dec 22 15:01:06 pve-srv1 corosync[5657]:   [TOTEM ] A new membership (1.46e4) was formed. Members joined: 6
Dec 22 15:01:06 pve-srv1 corosync[5657]:   [QUORUM] Members[11]: 1 2 3 4 5 6 7 8 9 10 11
Dec 22 15:01:06 pve-srv1 corosync[5657]:   [MAIN  ] Completed service synchronization, ready to provide service.
Dec 22 15:44:00 pve-srv1 corosync[5657]:   [KNET  ] link: host: 1 link: 0 is down
Dec 22 15:44:00 pve-srv1 corosync[5657]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec 22 15:44:00 pve-srv1 corosync[5657]:   [KNET  ] host: host: 1 has no active links
Dec 22 15:44:02 pve-srv1 corosync[5657]:   [TOTEM ] Token has not been received in 6637 ms
Dec 22 15:44:04 pve-srv1 corosync[5657]:   [TOTEM ] A processor failed, forming new configuration: token timed out (8850ms), waiting 10620ms for consensus.
Dec 22 15:44:15 pve-srv1 corosync[5657]:   [QUORUM] Sync members[9]: 2 3 4 6 7 8 9 10 11
Dec 22 15:44:15 pve-srv1 corosync[5657]:   [QUORUM] Sync left[2]: 1 5
Dec 22 15:44:15 pve-srv1 corosync[5657]:   [TOTEM ] A new membership (2.46e8) was formed. Members left: 1 5
Dec 22 15:44:15 pve-srv1 corosync[5657]:   [TOTEM ] Failed to receive the leave message. failed: 1 5
Dec 22 15:44:15 pve-srv1 corosync[5657]:   [QUORUM] Sync members[9]: 2 3 4 6 7 8 9 10 11
Dec 22 15:44:15 pve-srv1 corosync[5657]:   [QUORUM] Sync joined[6]: 2 7 8 9 10 11
Dec 22 15:44:15 pve-srv1 corosync[5657]:   [QUORUM] Sync left[8]: 1 2 5 7 8 9 10 11
Dec 22 15:44:15 pve-srv1 corosync[5657]:   [TOTEM ] A new membership (2.46ec) was formed. Members joined: 2 7 8 9 10 11 left: 2 7 8 9 10 11
Dec 22 15:44:15 pve-srv1 corosync[5657]:   [TOTEM ] Failed to receive the leave message. failed: 2 7 8 9 10 11
Dec 22 15:44:26 pve-srv1 corosync[5657]:   [QUORUM] Sync members[9]: 2 3 4 6 7 8 9 10 11
Dec 22 15:44:26 pve-srv1 corosync[5657]:   [QUORUM] Sync left[2]: 1 5
Dec 22 15:44:26 pve-srv1 corosync[5657]:   [TOTEM ] A new membership (2.46f0) was formed. Members
Dec 22 15:44:26 pve-srv1 corosync[5657]:   [QUORUM] Sync members[9]: 2 3 4 6 7 8 9 10 11
Dec 22 15:44:26 pve-srv1 corosync[5657]:   [QUORUM] Sync joined[7]: 2 6 7 8 9 10 11
Dec 22 15:44:26 pve-srv1 corosync[5657]:   [QUORUM] Sync left[9]: 1 2 5 6 7 8 9 10 11
Dec 22 15:44:26 pve-srv1 corosync[5657]:   [TOTEM ] A new membership (2.46f8) was formed. Members joined: 2 6 7 8 9 10 11 left: 2 6 7 8 9 10 11
Dec 22 15:44:26 pve-srv1 corosync[5657]:   [TOTEM ] Failed to receive the leave message. failed: 2 6 7 8 9 10 11
Dec 22 15:44:26 pve-srv1 corosync[5657]:   [TOTEM ] Retransmit List: 1 2 3 4 5 6 7 8 9 a b c d
Dec 22 15:44:37 pve-srv1 corosync[5657]:   [QUORUM] Sync members[9]: 2 3 4 6 7 8 9 10 11
Dec 22 15:44:37 pve-srv1 corosync[5657]:   [TOTEM ] A new membership (2.46fc) was formed. Members
Dec 22 15:44:47 pve-srv1 corosync[5657]:   [QUORUM] Sync members[9]: 2 3 4 6 7 8 9 10 11
Dec 22 15:44:47 pve-srv1 corosync[5657]:   [TOTEM ] A new membership (2.4700) was formed. Members
Dec 22 15:45:00 pve-srv1 corosync[5657]:   [KNET  ] link: host: 9 link: 0 is down
Dec 22 15:45:00 pve-srv1 corosync[5657]:   [KNET  ] link: host: 7 link: 0 is down
Dec 22 15:45:00 pve-srv1 corosync[5657]:   [KNET  ] host: host: 9 (passive) best link: 0 (pri: 1)
 
pve-srv2 (ceph)
Code:
Dec 22 15:20:56 pve-srv2 pmxcfs[3940]: [status] notice: received log
Dec 22 15:30:45 pve-srv2 pmxcfs[3940]: [status] notice: received log
Dec 22 15:31:10 pve-srv2 pmxcfs[3940]: [status] notice: received log
Dec 22 15:31:51 pve-srv2 pmxcfs[3940]: [status] notice: received log
Dec 22 15:33:21 pve-srv2 pmxcfs[3940]: [status] notice: received log
Dec 22 15:35:56 pve-srv2 pmxcfs[3940]: [status] notice: received log
Dec 22 15:44:15 pve-srv2 pmxcfs[3940]: [dcdb] notice: members: 2/1146, 7/2180, 8/3940, 9/4386, 10/4205, 11/4370
Dec 22 15:44:15 pve-srv2 pmxcfs[3940]: [dcdb] notice: starting data syncronisation
Dec 22 15:44:15 pve-srv2 pmxcfs[3940]: [dcdb] notice: members: 2/1146, 3/5489, 4/2131, 6/2821, 7/2180, 8/3940, 9/4386, 10/4205, 11/4370
Dec 22 15:44:15 pve-srv2 pmxcfs[3940]: [status] notice: members: 2/1146, 7/2180, 8/3940, 9/4386, 10/4205, 11/4370
Dec 22 15:44:15 pve-srv2 pmxcfs[3940]: [status] notice: starting data syncronisation
Dec 22 15:44:15 pve-srv2 pmxcfs[3940]: [status] notice: members: 2/1146, 3/5489, 4/2131, 6/2821, 7/2180, 8/3940, 9/4386, 10/4205, 11/4370
Dec 22 15:44:19 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 10
Dec 22 15:44:20 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 20
Dec 22 15:44:21 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 30
Dec 22 15:44:22 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 40
Dec 22 15:44:23 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 50
Dec 22 15:44:24 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 60
Dec 22 15:44:25 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 70
Dec 22 15:44:26 pve-srv2 pmxcfs[3940]: [dcdb] notice: members: 6/2821, 7/2180, 8/3940, 9/4386, 10/4205, 11/4370
Dec 22 15:44:26 pve-srv2 pmxcfs[3940]: [dcdb] notice: members: 2/1146, 3/5489, 4/2131, 6/2821, 7/2180, 8/3940, 9/4386, 10/4205, 11/4370
Dec 22 15:44:26 pve-srv2 pmxcfs[3940]: [status] notice: members: 6/2821, 7/2180, 8/3940, 9/4386, 10/4205, 11/4370
Dec 22 15:44:26 pve-srv2 pmxcfs[3940]: [status] notice: members: 2/1146, 3/5489, 4/2131, 6/2821, 7/2180, 8/3940, 9/4386, 10/4205, 11/4370
Dec 22 15:44:26 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 80
Dec 22 15:44:27 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 90
Dec 22 15:44:28 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 100
Dec 22 15:44:28 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retried 100 times
Dec 22 15:44:28 pve-srv2 pmxcfs[3940]: [status] crit: cpg_send_message failed: 6
Dec 22 15:44:29 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 10
Code:
Dec 22 15:18:21 pve-srv2 pmxcfs[3940]: [status] notice: received log
Dec 22 15:20:56 pve-srv2 pmxcfs[3940]: [status] notice: received log
Dec 22 15:30:45 pve-srv2 pmxcfs[3940]: [status] notice: received log
Dec 22 15:31:10 pve-srv2 pmxcfs[3940]: [status] notice: received log
Dec 22 15:31:51 pve-srv2 pmxcfs[3940]: [status] notice: received log
Dec 22 15:33:21 pve-srv2 pmxcfs[3940]: [status] notice: received log
Dec 22 15:35:56 pve-srv2 pmxcfs[3940]: [status] notice: received log
Dec 22 15:44:15 pve-srv2 pmxcfs[3940]: [dcdb] notice: members: 2/1146, 7/2180, 8/3940, 9/4386, 10/4205, 11/4370
Dec 22 15:44:15 pve-srv2 pmxcfs[3940]: [dcdb] notice: starting data syncronisation
Dec 22 15:44:15 pve-srv2 pmxcfs[3940]: [dcdb] notice: members: 2/1146, 3/5489, 4/2131, 6/2821, 7/2180, 8/3940, 9/4386, 10/4205, 11/4370
Dec 22 15:44:15 pve-srv2 pmxcfs[3940]: [status] notice: members: 2/1146, 7/2180, 8/3940, 9/4386, 10/4205, 11/4370
Dec 22 15:44:15 pve-srv2 pmxcfs[3940]: [status] notice: starting data syncronisation
Dec 22 15:44:15 pve-srv2 pmxcfs[3940]: [status] notice: members: 2/1146, 3/5489, 4/2131, 6/2821, 7/2180, 8/3940, 9/4386, 10/4205, 11/4370
Dec 22 15:44:19 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 10
Dec 22 15:44:20 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 20
Dec 22 15:44:21 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 30
Dec 22 15:44:22 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 40
Dec 22 15:44:23 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 50
Dec 22 15:44:24 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 60
Dec 22 15:44:25 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 70
Dec 22 15:44:26 pve-srv2 pmxcfs[3940]: [dcdb] notice: members: 6/2821, 7/2180, 8/3940, 9/4386, 10/4205, 11/4370
Dec 22 15:44:26 pve-srv2 pmxcfs[3940]: [dcdb] notice: members: 2/1146, 3/5489, 4/2131, 6/2821, 7/2180, 8/3940, 9/4386, 10/4205, 11/4370
Dec 22 15:44:26 pve-srv2 pmxcfs[3940]: [status] notice: members: 6/2821, 7/2180, 8/3940, 9/4386, 10/4205, 11/4370
Dec 22 15:44:26 pve-srv2 pmxcfs[3940]: [status] notice: members: 2/1146, 3/5489, 4/2131, 6/2821, 7/2180, 8/3940, 9/4386, 10/4205, 11/4370
Dec 22 15:44:26 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 80
Dec 22 15:44:27 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 90
Dec 22 15:44:28 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 100
Dec 22 15:44:28 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retried 100 times
Dec 22 15:44:28 pve-srv2 pmxcfs[3940]: [status] crit: cpg_send_message failed: 6
Dec 22 15:44:29 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 10
Dec 22 15:44:30 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 20
Dec 22 15:44:31 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 30
Dec 22 15:44:32 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 40
Dec 22 15:44:33 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 50
Dec 22 15:44:34 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 60
Dec 22 15:44:35 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 70
Dec 22 15:44:36 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 80
Dec 22 15:44:37 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 90
Dec 22 15:44:38 pve-srv2 pmxcfs[3940]: [status] notice: cpg_send_message retry 100

pve-blade-101
Code:
Dec 22 15:16:51 pve-blade-101 pmxcfs[2081]: [status] notice: received log
Dec 22 15:18:21 pve-blade-101 pmxcfs[2081]: [status] notice: received log
Dec 22 15:20:56 pve-blade-101 pmxcfs[2081]: [status] notice: received log
Dec 22 15:30:45 pve-blade-101 pmxcfs[2081]: [status] notice: received log
Dec 22 15:31:10 pve-blade-101 pmxcfs[2081]: [status] notice: received log
Dec 22 15:31:51 pve-blade-101 pmxcfs[2081]: [status] notice: received log
Dec 22 15:33:21 pve-blade-101 pmxcfs[2081]: [status] notice: received log
Dec 22 15:35:56 pve-blade-101 pmxcfs[2081]: [status] notice: received log
Dec 22 15:44:15 pve-blade-101 pmxcfs[2081]: [dcdb] notice: members: 1/3223, 5/2081
Dec 22 15:44:15 pve-blade-101 pmxcfs[2081]: [dcdb] notice: starting data syncronisation
Dec 22 15:44:15 pve-blade-101 pmxcfs[2081]: [status] notice: members: 1/3223, 5/2081
Dec 22 15:44:15 pve-blade-101 pmxcfs[2081]: [status] notice: starting data syncronisation
Dec 22 15:44:15 pve-blade-101 pmxcfs[2081]: [status] notice: node lost quorum
Dec 22 15:44:23 pve-blade-101 pmxcfs[2081]: [status] notice: cpg_send_message retry 10
Dec 22 15:44:24 pve-blade-101 pmxcfs[2081]: [status] notice: cpg_send_message retry 20
Dec 22 15:44:25 pve-blade-101 pmxcfs[2081]: [status] notice: cpg_send_message retry 30
Dec 22 15:44:26 pve-blade-101 pmxcfs[2081]: [status] notice: cpg_send_message retry 40
Dec 22 15:44:27 pve-blade-101 pmxcfs[2081]: [status] notice: cpg_send_message retry 50
Dec 22 15:44:28 pve-blade-101 pmxcfs[2081]: [status] notice: cpg_send_message retry 60
Dec 22 15:44:29 pve-blade-101 pmxcfs[2081]: [status] notice: cpg_send_message retry 70
Dec 22 15:44:30 pve-blade-101 pmxcfs[2081]: [status] notice: cpg_send_message retry 80
Dec 22 15:44:31 pve-blade-101 pmxcfs[2081]: [status] notice: cpg_send_message retry 90
Dec 22 15:44:32 pve-blade-101 pmxcfs[2081]: [status] notice: cpg_send_message retry 100
Dec 22 15:44:32 pve-blade-101 pmxcfs[2081]: [status] notice: cpg_send_message retried 100 times
Dec 22 15:44:32 pve-blade-101 pmxcfs[2081]: [status] crit: cpg_send_message failed: 6
Dec 22 15:44:33 pve-blade-101 pmxcfs[2081]: [status] notice: cpg_send_message retry 10
-- Boot 02d378631d14490cab587de4a536cb7d --
Dec 22 15:51:11 pve-blade-101 systemd[1]: Starting The Proxmox VE cluster filesystem...
Dec 22 15:51:11 pve-blade-101 pmxcfs[2075]: [quorum] crit: quorum_initialize failed: 2
Dec 22 15:51:11 pve-blade-101 pmxcfs[2075]: [quorum] crit: can't initialize service
Code:
Dec 22 15:01:06 pve-blade-101 corosync[2196]:   [TOTEM ] A new membership (1.46e4) was formed. Members joined: 6
Dec 22 15:01:06 pve-blade-101 corosync[2196]:   [QUORUM] Members[11]: 1 2 3 4 5 6 7 8 9 10 11
Dec 22 15:01:06 pve-blade-101 corosync[2196]:   [MAIN  ] Completed service synchronization, ready to provide service.
Dec 22 15:43:59 pve-blade-101 corosync[2196]:   [KNET  ] link: host: 1 link: 0 is down
Dec 22 15:43:59 pve-blade-101 corosync[2196]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec 22 15:43:59 pve-blade-101 corosync[2196]:   [KNET  ] host: host: 1 has no active links
Dec 22 15:44:02 pve-blade-101 corosync[2196]:   [TOTEM ] Token has not been received in 6637 ms
Dec 22 15:44:12 pve-blade-101 corosync[2196]:   [KNET  ] rx: host: 1 link: 0 is up
Dec 22 15:44:12 pve-blade-101 corosync[2196]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec 22 15:44:15 pve-blade-101 corosync[2196]:   [QUORUM] Sync members[2]: 1 5
Dec 22 15:44:15 pve-blade-101 corosync[2196]:   [QUORUM] Sync left[9]: 2 3 4 6 7 8 9 10 11
Dec 22 15:44:15 pve-blade-101 corosync[2196]:   [TOTEM ] A new membership (1.46e8) was formed. Members left: 2 3 4 6 7 8 9 10 11
Dec 22 15:44:15 pve-blade-101 corosync[2196]:   [TOTEM ] Failed to receive the leave message. failed: 2 3 4 6 7 8 9 10 11
Dec 22 15:44:15 pve-blade-101 corosync[2196]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Dec 22 15:44:15 pve-blade-101 corosync[2196]:   [QUORUM] Members[2]: 1 5
Dec 22 15:44:15 pve-blade-101 corosync[2196]:   [MAIN  ] Completed service synchronization, ready to provide service.
Dec 22 15:44:15 pve-blade-101 corosync[2196]:   [QUORUM] Sync members[2]: 1 5
Dec 22 15:44:15 pve-blade-101 corosync[2196]:   [TOTEM ] A new membership (1.46ec) was formed. Members
Dec 22 15:44:26 pve-blade-101 corosync[2196]:   [QUORUM] Sync members[2]: 1 5
Dec 22 15:44:26 pve-blade-101 corosync[2196]:   [TOTEM ] A new membership (1.46f0) was formed. Members
Dec 22 15:44:26 pve-blade-101 corosync[2196]:   [QUORUM] Members[2]: 1 5
Dec 22 15:44:26 pve-blade-101 corosync[2196]:   [MAIN  ] Completed service synchronization, ready to provide service.
Dec 22 15:44:26 pve-blade-101 corosync[2196]:   [QUORUM] Sync members[2]: 1 5
Dec 22 15:44:26 pve-blade-101 corosync[2196]:   [TOTEM ] A new membership (1.46f4) was formed. Members
Dec 22 15:44:26 pve-blade-101 corosync[2196]:   [QUORUM] Sync members[2]: 1 5
Dec 22 15:44:26 pve-blade-101 corosync[2196]:   [TOTEM ] A new membership (1.46f8) was formed. Members
-- Boot 02d378631d14490cab587de4a536cb7d --
Dec 22 15:51:12 pve-blade-101 systemd[1]: Starting Corosync Cluster Engine...
Dec 22 15:51:12 pve-blade-101 corosync[2186]:   [MAIN  ] Corosync Cluster Engine 3.1.5 starting up
 
well you definitely have a link down event, followed by other nodes dropping out/not joining the consensus, so the network is not fine. if this happens frequently, the next step would be to collect debug logs (warning - they get big fast! disable it again after collecting the logs), and then get the following (again, from ALL nodes for the timeframe surrounding the crash):

Code:
journalctl -u corosync -u pve-cluster --since "XXXXX" --until "YYYYYYY" > log_$(hostname)

is it always the link to host1 that is going down?
 
well you definitely have a link down event, followed by other nodes dropping out/not joining the consensus, so the network is not fine. if this happens frequently, the next step would be to collect debug logs (warning - they get big fast! disable it again after collecting the logs), and then get the following (again, from ALL nodes for the timeframe surrounding the crash):

Code:
journalctl -u corosync -u pve-cluster --since "XXXXX" --until "YYYYYYY" > log_$(hostname)

is it always the link to host1 that is going down?
looks like it, host1 all the time.
the logs you asked for yesterday : journalctl -u corosync -u pve-cluster --since yesterday >/mnt/pve/nfs_home/pve_logs/log_$(hostname)

i see that host 1 have errors and sometimes it recovers and when it not the cluster crash initiated
host1 had issues in
05:50:00 06:45:57 07:51:59 08:44:59 10:50:58 11:44:57 12:46:57 ...

how i can find out which server is host1?

yesterday i disabled on HA in the cluster and sinse then the servers are stable without any issues (we put alot of cpu\ram\iops load on them)
but host1 erros still exists:
Code:
Dec 23 10:45:50 pve-blade-101 corosync[2186]:   [TOTEM ] A new membership (1.492a) was formed. Members joined: 1 2 3 4 7 8 9 10
Dec 23 10:45:50 pve-blade-101 corosync[2186]:   [QUORUM] This node is within the primary component and will provide service.
Dec 23 10:45:50 pve-blade-101 corosync[2186]:   [QUORUM] Members[11]: 1 2 3 4 5 6 7 8 9 10 11
Dec 23 10:45:50 pve-blade-101 corosync[2186]:   [MAIN  ] Completed service synchronization, ready to provide service.
Dec 23 10:54:58 pve-blade-101 corosync[2186]:   [KNET  ] link: host: 1 link: 0 is down
Dec 23 10:54:58 pve-blade-101 corosync[2186]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec 23 10:54:58 pve-blade-101 corosync[2186]:   [KNET  ] host: host: 1 has no active links
Dec 23 10:55:03 pve-blade-101 corosync[2186]:   [TOTEM ] Token has not been received in 6637 ms
Dec 23 10:55:15 pve-blade-101 corosync[2186]:   [KNET  ] rx: host: 1 link: 0 is up
Dec 23 10:55:15 pve-blade-101 corosync[2186]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec 23 10:55:16 pve-blade-101 corosync[2186]:   [QUORUM] Sync members[10]: 2 3 4 5 6 7 8 9 10 11
Dec 23 10:55:16 pve-blade-101 corosync[2186]:   [QUORUM] Sync left[1]: 1
Dec 23 10:55:16 pve-blade-101 corosync[2186]:   [TOTEM ] A new membership (2.492e) was formed. Members left: 1
Dec 23 10:55:16 pve-blade-101 corosync[2186]:   [TOTEM ] Failed to receive the leave message. failed: 1
 

Attachments

Last edited:
I need the logs WITH debug enabled to see what's going on in detail. link to host 1 going down -> host 1 leaving the quorate partition is totally expected if the link is not stable. it should not take down the whole cluster (but of course can, if the link going down is just a symptom of a larger problem like hosts being overloaded, network not working reliable in general, ...).

corosync-cfgtool gives you the host information.
 
I need the logs WITH debug enabled to see what's going on in detail. link to host 1 going down -> host 1 leaving the quorate partition is totally expected if the link is not stable. it should not take down the whole cluster (but of course can, if the link going down is just a symptom of a larger problem like hosts being overloaded, network not working reliable in general, ...).

How i can enable debugging for logs?



I think the problematic host is pve-srv-102, ill try to inspect the network cable and card on Sunday, and replace them,
 
in corosync.conf set debug to on instead of off
 
I have tried to change the switch and cable and i could not find any improvement, around once an hour usually at any round hour and 50 minuts (01:50 02:50 .. etch)
Code:
Dec 26 04:25:57 pve-blade-102 corosync[2238]:   [KNET  ] link: host: 1 link: 0 is down
Dec 26 04:50:56 pve-blade-102 corosync[2238]:   [KNET  ] link: host: 1 link: 0 is down
Dec 26 05:56:58 pve-blade-102 corosync[2238]:   [KNET  ] link: host: 1 link: 0 is down
Dec 26 06:49:59 pve-blade-102 corosync[2238]:   [KNET  ] link: host: 1 link: 0 is down
Dec 26 07:51:57 pve-blade-102 corosync[2238]:   [KNET  ] link: host: 1 link: 0 is down
host 1 have error.
I put ping test trough the same interface and i dont have any errors reported

any idea for further tests i can do ?
 
yes, please provide the logs I ask for and not something else, else I won't spend any further time trying to get to the bottom of your issue.
 
yes, please provide the logs I ask for and not something else, else I won't spend any further time trying to get to the bottom of your issue.
sure, ill post it,
to be clear you need

Code:
journalctl -u corosync -u pve-cluster --since "XXXXX" --until "YYYYYYY" > log_$(hostname)
for all servers ? do i need to add anything else?
 
for all servers with debugging enabled for that timespan, yes.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!