[SOLVED] Cluster Issue

killmasta93

Renowned Member
Aug 13, 2017
974
59
68
31
HI,
I was wondering if someone could shed some light on the issue im having. Currently i have 7 node cluster but what happed is that on one of nodes seem to lost the corosync.
I did the following

Code:
root@prometheus5:~# pvecm status
Cannot initialize CMAP service

then i checked the last logs

Code:
Nov 11 09:21:28 prometheus5 corosync[2249]: notice  [TOTEM ] A new membership (192.168.3.99:7260) was formed. Members joined: 2 1 5 left: 2 1 5
Nov 11 09:21:28 prometheus5 corosync[2249]: notice  [TOTEM ] Failed to receive the leave message. failed: 2 1 5
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] A new membership (192.168.3.99:7260) was formed. Members joined: 2 1 5 left: 2 1 5
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Failed to receive the leave message. failed: 2 1 5
Nov 11 09:21:28 prometheus5 corosync[2249]: notice  [TOTEM ] A new membership (192.168.3.99:7268) was formed. Members joined: 2 1 left: 2 1
Nov 11 09:21:28 prometheus5 corosync[2249]: notice  [TOTEM ] Failed to receive the leave message. failed: 2 1
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] A new membership (192.168.3.99:7268) was formed. Members joined: 2 1 left: 2 1
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Failed to receive the leave message. failed: 2 1
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=2
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=2
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=1
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=4
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=6
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=1
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=3
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=6
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=4
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=5
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=2
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=3
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=6
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=4
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=1
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=3
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=5
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=3
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=6
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=4
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=6
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=7
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=4
Nov 11 09:21:28 prometheus5 corosync[2249]: notice  [TOTEM ] A new membership (192.168.3.99:7276) was formed. Members joined: 2 left: 2
Nov 11 09:21:28 prometheus5 corosync[2249]: notice  [TOTEM ] Failed to receive the leave message. failed: 2
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=6
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=1
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=3
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=3
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=6
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=4
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=5
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=2
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=3
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=6
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=4
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=3
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=5
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=3
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=6
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=2
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=3
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=6
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=1
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=4
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=1
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=3
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=4
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=6
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=4
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=3
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=2
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=5
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=2
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=5
Nov 11 09:21:28 prometheus5 corosync[2249]: warning [TOTEM ] Discarding JOIN message during flush, nodeid=7
Nov 11 09:21:28 prometheus5 corosync[2249]: corosync: totemsrp.c:2871: orf_token_rtr: Assertion `range < QUEUE_RTR_ITEMS_SIZE_MAX' failed.
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=4
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=6
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=7
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=7
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] A new membership (192.168.3.99:7276) was formed. Members joined: 2 left: 2
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Failed to receive the leave message. failed: 2
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=2
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=3
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=6
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=1
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=4
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=1
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=3
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=6
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=4
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=3
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=2
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=5
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=2
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=5
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=7
Nov 11 09:21:28 prometheus5 corosync[2249]:  [TOTEM ] Discarding JOIN message during flush, nodeid=7
Nov 11 09:21:28 prometheus5 systemd[1]: corosync.service: Main process exited, code=killed, status=6/ABRT
Nov 11 09:21:28 prometheus5 systemd[1]: corosync.service: Unit entered failed state.
Nov 11 09:21:28 prometheus5 systemd[1]: corosync.service: Failed with result 'signal'.


this is the corosync of the bad node

Code:
root@prometheus5:~# cat /etc/corosync/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: prometheus
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.3.150
  }
  node {
    name: prometheus11
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.3.216
  }
  node {
    name: prometheus12
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 192.168.3.186
  }
  node {
    name: prometheus2
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 192.168.3.152
  }
  node {
    name: prometheus4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.3.187
  }
  node {
    name: prometheus5
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 192.168.3.197
  }
  node {
    name: prometheus6
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.3.99
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: troy
  config_version: 7
  interface {
    bindnetaddr: 192.168.3.150
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

and this is corosync of a good node

Code:
root@prometheus6:~# cat /etc/corosync/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: prometheus
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.3.150
  }
  node {
    name: prometheus11
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.3.216
  }
  node {
    name: prometheus12
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 192.168.3.186
  }
  node {
    name: prometheus2
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 192.168.3.152
  }
  node {
    name: prometheus4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.3.187
  }
  node {
    name: prometheus5
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 192.168.3.197
  }
  node {
    name: prometheus6
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.3.99
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: troy
  config_version: 7
  interface {
    bindnetaddr: 192.168.3.150
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

Thank you
 
The configs are identical.

Do the logs on the other nodes show anything interesting regarding corosync?

Which versions are installed? pveversion -v
 
Thanks for the reply this is the version i have installed on all of the cluster the same, what im scared is to reboot prometheus5 because then it wont start the Vms
if the case that i would reboot and wont start the vms what consequences would i face if i run this on prometheus5 the issue of the node

pvecm expected 1

Code:
root@prometheus5:~# pveversion -v
proxmox-ve: 5.4-1 (running kernel: 4.15.18-12-pve)
pve-manager: 5.4-3 (running version: 5.4-3/0a6eaa62)
pve-kernel-4.15: 5.3-3
pve-kernel-4.15.18-12-pve: 4.15.18-35
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-50
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-41
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-25
pve-cluster: 5.0-36
pve-container: 2.0-37
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-19
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-3
pve-xtermjs: 3.12.0-1
pve-zsync: 2.0-3~bpo5
qemu-server: 5.0-50
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2

as for the logs on the other servers

Code:
Nov 11 09:21:40 prometheus12 corosync[28672]: warning [CPG   ] downlist left_list: 0 received
Nov 11 09:21:40 prometheus12 corosync[28672]:  [CPG   ] downlist left_list: 0 received
Nov 11 09:21:40 prometheus12 corosync[28672]: notice  [QUORUM] This node is within the primary component and will provide service.
Nov 11 09:21:40 prometheus12 corosync[28672]: notice  [QUORUM] Members[6]: 2 1 5 6 4 3
Nov 11 09:21:40 prometheus12 corosync[28672]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Nov 11 09:21:40 prometheus12 corosync[28672]:  [QUORUM] This node is within the primary component and will provide service.
Nov 11 09:21:40 prometheus12 corosync[28672]:  [QUORUM] Members[6]: 2 1 5 6 4 3
Nov 11 09:21:40 prometheus12 corosync[28672]:  [MAIN  ] Completed service synchronization, ready to provide service.
Nov 11 09:30:06 prometheus12 corosync[28672]: notice  [TOTEM ] Retransmit List: 1d4f
Nov 11 09:30:06 prometheus12 corosync[28672]:  [TOTEM ] Retransmit List: 1d4f
Nov 11 10:00:16 prometheus12 corosync[28672]: notice  [TOTEM ] Retransmit List: 61a3
Nov 11 10:00:16 prometheus12 corosync[28672]:  [TOTEM ] Retransmit List: 61a3
Nov 11 12:01:11 prometheus12 corosync[28672]: notice  [TOTEM ] Retransmit List: 17317 17318 17319
Nov 11 12:01:11 prometheus12 corosync[28672]:  [TOTEM ] Retransmit List: 17317 17318 17319
Nov 11 12:30:38 prometheus12 corosync[28672]: notice  [TOTEM ] Retransmit List: 1b589
Nov 11 12:30:38 prometheus12 corosync[28672]:  [TOTEM ] Retransmit List: 1b589
Nov 11 14:00:27 prometheus12 corosync[28672]: notice  [TOTEM ] Retransmit List: 28088
Nov 11 14:00:27 prometheus12 corosync[28672]:  [TOTEM ] Retransmit List: 28088
Nov 11 14:00:28 prometheus12 corosync[28672]: notice  [TOTEM ] Retransmit List: 2809c 2809d 2809f
Nov 11 14:00:28 prometheus12 corosync[28672]:  [TOTEM ] Retransmit List: 2809c 2809d 2809f
Nov 11 15:00:41 prometheus12 corosync[28672]: notice  [TOTEM ] Retransmit List: 308a4
Nov 11 15:00:41 prometheus12 corosync[28672]:  [TOTEM ] Retransmit List: 308a4
Nov 11 18:00:37 prometheus12 corosync[28672]: notice  [TOTEM ] Retransmit List: 49fad 49fae
Nov 11 18:00:37 prometheus12 corosync[28672]:  [TOTEM ] Retransmit List: 49fad 49fae
Nov 12 06:29:37 prometheus12 corosync[28672]: notice  [TOTEM ] Retransmit List: b3f83
Nov 12 06:29:37 prometheus12 corosync[28672]:  [TOTEM ] Retransmit List: b3f83
Nov 12 06:29:37 prometheus12 corosync[28672]: notice  [TOTEM ] Retransmit List: b3f83
Nov 12 06:29:37 prometheus12 corosync[28672]:  [TOTEM ] Retransmit List: b3f83
Nov 12 09:00:59 prometheus12 corosync[28672]: notice  [TOTEM ] Retransmit List: c9663 c9664
Nov 12 09:00:59 prometheus12 corosync[28672]:  [TOTEM ] Retransmit List: c9663 c9664
Nov 12 13:23:16 prometheus12 corosync[28672]: notice  [TOTEM ] Retransmit List: ee885
Nov 12 13:23:16 prometheus12 corosync[28672]:  [TOTEM ] Retransmit List: ee885
Nov 12 13:23:16 prometheus12 corosync[28672]: notice  [TOTEM ] Retransmit List: ee885 ee886
Nov 12 13:23:16 prometheus12 corosync[28672]:  [TOTEM ] Retransmit List: ee885 ee886
Nov 12 13:23:16 prometheus12 corosync[28672]: notice  [TOTEM ] Retransmit List: ee885 ee886
Nov 12 13:23:16 prometheus12 corosync[28672]:  [TOTEM ] Retransmit List: ee885 ee886
Nov 12 13:23:16 prometheus12 corosync[28672]: notice  [TOTEM ] Retransmit List: ee886
Nov 12 13:23:16 prometheus12 corosync[28672]:  [TOTEM ] Retransmit List: ee886
Nov 12 13:23:16 prometheus12 corosync[28672]: notice  [TOTEM ] Retransmit List: ee886
Nov 12 13:23:16 prometheus12 corosync[28672]:  [TOTEM ] Retransmit List: ee886
Nov 12 13:23:16 prometheus12 corosync[28672]: notice  [TOTEM ] Retransmit List: ee886
Nov 12 13:23:16 prometheus12 corosync[28672]:  [TOTEM ] Retransmit List: ee886
Nov 12 13:28:57 prometheus12 corosync[28672]: notice  [TOTEM ] Retransmit List: ef55b ef55c ef55d ef55e ef55f ef560 ef561 ef562 ef563
Nov 12 13:28:57 prometheus12 corosync[28672]:  [TOTEM ] Retransmit List: ef55b ef55c ef55d ef55e ef55f ef560 ef561 ef562 ef563
Nov 12 13:28:57 prometheus12 corosync[28672]: notice  [TOTEM ] Retransmit List: ef55b ef55c ef55d ef55e ef55f ef560 ef561 ef562 ef563 ef564 ef565
Nov 12 13:28:57 prometheus12 corosync[28672]:  [TOTEM ] Retransmit List: ef55b ef55c ef55d ef55e ef55f ef560 ef561 ef562 ef563 ef564 ef565
Nov 12 13:28:57 prometheus12 corosync[28672]: notice  [TOTEM ] Retransmit List: ef564 ef565
Nov 12 13:28:57 prometheus12 corosync[28672]:  [TOTEM ] Retransmit List: ef564 ef565
Nov 12 13:28:57 prometheus12 corosync[28672]: notice  [TOTEM ] Retransmit List: ef564 ef565
Nov 12 13:28:57 prometheus12 corosync[28672]:  [TOTEM ] Retransmit List: ef564 ef565
Nov 12 13:28:57 prometheus12 corosync[28672]: notice  [TOTEM ] Retransmit List: ef564 ef565
Nov 12 13:28:57 prometheus12 corosync[28672]:  [TOTEM ] Retransmit List: ef564 ef565
Nov 12 14:01:49 prometheus12 corosync[28672]: notice  [TOTEM ] Retransmit List: f3fce f3fd0
Nov 12 14:01:49 prometheus12 corosync[28672]:  [TOTEM ] Retransmit List: f3fce f3fd0
Nov 12 14:18:59 prometheus12 corosync[28672]: notice  [TOTEM ] Retransmit List: f669b f669e
Nov 12 14:18:59 prometheus12 corosync[28672]:  [TOTEM ] Retransmit List: f669b f669e
 
Last edited:
thanks for the reply going to turn off and first backup the VM before doing any restart of the corosync, as for the upgrade havent tried it yet on a tested enviroment as my production is pretty big to try to upgrade ill post backup any updates thank you again