In a 3 node cluster, 2 with drbd9 storage and one for quorum, I've 4-5 times a day corosync problems like those below.
Sometime I enter the web GUI and I see the other nodes in red, pvecm status says is everything ok, but if I dare run the command i.e. "qm list" it hangs and there is no way to kill it and regain console control (ok, I use screen but...).
The cluster becomes unmanagable and I have to reboot the nodes, but AFAIU the vm are then timeout killed and not correctly shutdown.
VM use vmbr1, drbd a separate 10Gbit link, so vmbr0 is pratically dedicated to cluster.
With those tests
# omping -c 10000 -i 0.001 -F -q 192.168.1.3 192.168.1.4 192.168.1.5
# omping -c 600 -i 1 -q 192.168.1.3 192.168.1.4 192.168.1.5
I've 0% loss for each node
I've used enterprise repo and upgraded until drbd9 license temporary problems and the decision by Proxmox to not support it anymore (btw the customer did not wanted to revenue the license because I don't dare to upgrade and use Drbd separate repositories, nor I can tell them that there is this problem that was very infrequent in the past... so I'm almost stuck here).
The problem, looking at older logs, seems much more frequent since I added a second Win2012R2 VM, latest virtio (now they have 2 2008R2 and 2 2012R2). The VM at the moment has no activity since has yet to be configured for it's final usage, so can't be special pressures against the system (btw, all the vm run on the same node, prox01).
Any idea?
Sometime I enter the web GUI and I see the other nodes in red, pvecm status says is everything ok, but if I dare run the command i.e. "qm list" it hangs and there is no way to kill it and regain console control (ok, I use screen but...).
The cluster becomes unmanagable and I have to reboot the nodes, but AFAIU the vm are then timeout killed and not correctly shutdown.
VM use vmbr1, drbd a separate 10Gbit link, so vmbr0 is pratically dedicated to cluster.
With those tests
# omping -c 10000 -i 0.001 -F -q 192.168.1.3 192.168.1.4 192.168.1.5
# omping -c 600 -i 1 -q 192.168.1.3 192.168.1.4 192.168.1.5
I've 0% loss for each node
I've used enterprise repo and upgraded until drbd9 license temporary problems and the decision by Proxmox to not support it anymore (btw the customer did not wanted to revenue the license because I don't dare to upgrade and use Drbd separate repositories, nor I can tell them that there is this problem that was very infrequent in the past... so I'm almost stuck here).
The problem, looking at older logs, seems much more frequent since I added a second Win2012R2 VM, latest virtio (now they have 2 2008R2 and 2 2012R2). The VM at the moment has no activity since has yet to be configured for it's final usage, so can't be special pressures against the system (btw, all the vm run on the same node, prox01).
Any idea?
Code:
root@prox01:~# pveversion -v
proxmox-ve: 4.3-71 (running kernel: 4.4.21-1-pve)
pve-manager: 4.3-9 (running version: 4.3-9/f7c6f0cd)
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.4.21-1-pve: 4.4.21-71
pve-kernel-4.2.8-1-pve: 4.2.8-41
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-46
qemu-server: 4.0-92
pve-firmware: 1.1-10
libpve-common-perl: 4.0-79
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-68
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.3-12
pve-qemu-kvm: 2.7.0-4
pve-container: 1.0-80
pve-firewall: 2.0-31
pve-ha-manager: 1.0-35
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.5-1
lxcfs: 2.0.4-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80
drbdmanage: 0.97.3-1
Node 01 IP .5, daemon.log
Aug 7 11:39:54 prox01 rrdcached[2317]: flushing old values
Aug 7 11:39:54 prox01 rrdcached[2317]: rotating journals
Aug 7 11:39:54 prox01 rrdcached[2317]: started new journal /var/lib/rrdcached/journal/rrd.journal.1502098794.845477
Aug 7 11:39:54 prox01 rrdcached[2317]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1502091594.845507
Aug 7 12:09:54 prox01 smartd[2260]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Temperature_Case changed from 66 to 67
Aug 7 12:09:54 prox01 smartd[2260]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Temperature_Case changed from 66 to 67
Aug 7 12:28:19 prox01 corosync[2539]: [TOTEM ] FAILED TO RECEIVE
Aug 7 12:28:21 prox01 corosync[2539]: [TOTEM ] A new membership (192.168.1.5:23016) was formed. Members left: 3 2
Aug 7 12:28:21 prox01 corosync[2539]: [TOTEM ] Failed to receive the leave message. failed: 3 2
Aug 7 12:28:21 prox01 corosync[2539]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
Aug 7 12:28:21 prox01 corosync[2539]: [TOTEM ] A new membership (192.168.1.3:23020) was formed. Members joined: 3 2
Aug 7 12:28:21 prox01 corosync[2539]: [QUORUM] Members[3]: 3 2 1
Aug 7 12:28:21 prox01 corosync[2539]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 7 12:39:54 prox01 rrdcached[2317]: flushing old values
Aug 7 12:39:54 prox01 rrdcached[2317]: rotating journals
Aug 7 12:39:54 prox01 rrdcached[2317]: started new journal /var/lib/rrdcached/journal/rrd.journal.1502102394.845514
Aug 7 12:39:54 prox01 rrdcached[2317]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1502095194.845499
Aug 7 12:39:54 prox01 pmxcfs[2401]: [dcdb] notice: data verification successful
Aug 7 13:39:54 prox01 pmxcfs[2401]: [dcdb] notice: data verification successful
Node 02 IP .4, daemon.log
Aug 7 11:31:20 prox02 rrdcached[2456]: flushing old values
Aug 7 11:31:20 prox02 rrdcached[2456]: rotating journals
Aug 7 11:31:20 prox02 rrdcached[2456]: started new journal /var/lib/rrdcached/journal/rrd.journal.1502098280.037047
Aug 7 11:31:20 prox02 rrdcached[2456]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1502091080.036995
Aug 7 11:39:54 prox02 pmxcfs[2538]: [dcdb] notice: data verification successful
Aug 7 12:01:20 prox02 smartd[2406]: Device: /dev/bus/0 [megaraid_disk_03] [SAT], SMART Usage Attribute: 190 Temperature_Case changed from 67 to 68
Aug 7 12:28:19 prox02 corosync[2643]: [TOTEM ] A new membership (192.168.1.3:23012) was formed. Members left: 1
Aug 7 12:28:19 prox02 corosync[2643]: [TOTEM ] Failed to receive the leave message. failed: 1
Aug 7 12:28:19 prox02 pmxcfs[2538]: [dcdb] notice: members: 2/2538, 3/1526
Aug 7 12:28:19 prox02 pmxcfs[2538]: [dcdb] notice: starting data syncronisation
Aug 7 12:28:19 prox02 corosync[2643]: [QUORUM] Members[2]: 3 2
Aug 7 12:28:19 prox02 corosync[2643]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 7 12:28:19 prox02 pmxcfs[2538]: [dcdb] notice: cpg_send_message retried 1 times
Aug 7 12:28:19 prox02 pmxcfs[2538]: [status] notice: members: 2/2538, 3/1526
Aug 7 12:28:19 prox02 pmxcfs[2538]: [status] notice: starting data syncronisation
Aug 7 12:28:21 prox02 corosync[2643]: [TOTEM ] A new membership (192.168.1.3:23016) was formed. Members
Aug 7 12:28:21 prox02 corosync[2643]: [QUORUM] Members[2]: 3 2
Aug 7 12:28:21 prox02 corosync[2643]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 7 12:28:21 prox02 pmxcfs[2538]: [dcdb] notice: received sync request (epoch 2/2538/00000004)
Aug 7 12:28:21 prox02 pmxcfs[2538]: [status] notice: received sync request (epoch 2/2538/00000004)
Aug 7 12:28:21 prox02 pmxcfs[2538]: [dcdb] notice: received all states
Aug 7 12:28:21 prox02 pmxcfs[2538]: [dcdb] notice: leader is 2/2538
Aug 7 12:28:21 prox02 pmxcfs[2538]: [dcdb] notice: synced members: 2/2538, 3/1526
Aug 7 12:28:21 prox02 pmxcfs[2538]: [dcdb] notice: start sending inode updates
Aug 7 12:28:21 prox02 pmxcfs[2538]: [dcdb] notice: sent all (0) updates
Aug 7 12:28:21 prox02 corosync[2643]: [TOTEM ] A new membership (192.168.1.3:23020) was formed. Members joined: 1
Aug 7 12:28:21 prox02 corosync[2643]: [QUORUM] Members[3]: 3 2 1
Aug 7 12:28:21 prox02 corosync[2643]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 7 12:28:21 prox02 pmxcfs[2538]: [dcdb] notice: cpg_send_message retried 1 times
Aug 7 12:28:21 prox02 pmxcfs[2538]: [dcdb] notice: all data is up to date
Aug 7 12:28:21 prox02 pmxcfs[2538]: [dcdb] notice: members: 1/2401, 2/2538, 3/1526
Aug 7 12:28:21 prox02 pmxcfs[2538]: [dcdb] notice: starting data syncronisation
Aug 7 12:28:21 prox02 pmxcfs[2538]: [status] notice: members: 1/2401, 2/2538, 3/1526
Aug 7 12:28:21 prox02 pmxcfs[2538]: [status] notice: queue not emtpy - resening 5 messages
Aug 7 12:31:20 prox02 rrdcached[2456]: flushing old values
Aug 7 12:31:20 prox02 rrdcached[2456]: rotating journals
Nodo 03 IP .3, daemon.log
Aug 7 11:22:14 prox03 rrdcached[1411]: flushing old values
Aug 7 11:22:14 prox03 rrdcached[1411]: rotating journals
Aug 7 11:22:14 prox03 rrdcached[1411]: started new journal /var/lib/rrdcached/journal/rrd.journal.1502097734.862596
Aug 7 11:22:14 prox03 rrdcached[1411]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1502090534.862603
Aug 7 11:39:54 prox03 pmxcfs[1526]: [dcdb] notice: data verification successful
Aug 7 12:22:14 prox03 rrdcached[1411]: flushing old values
Aug 7 12:22:14 prox03 rrdcached[1411]: rotating journals
Aug 7 12:22:14 prox03 rrdcached[1411]: started new journal /var/lib/rrdcached/journal/rrd.journal.1502101334.862523
Aug 7 12:22:14 prox03 rrdcached[1411]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1502094134.862528
Aug 7 12:28:16 prox03 corosync[1572]: [TOTEM ] Retransmit List: 10d55a
Aug 7 12:28:16 prox03 corosync[1572]: [TOTEM ] Retransmit List: 10d55a 10d55b
Aug 7 12:28:16 prox03 corosync[1572]: [TOTEM ] Retransmit List: 10d55a 10d55b 10d55c
[... MANY MANY TOTEM MESSAGES WITH SAME TIMESTAMP]
Aug 7 12:28:19 prox03 corosync[1572]: [TOTEM ] Retransmit List: 10d55a 10d55b 10d55c 10d55d 10d55e 10d55f 10d560 10d561 10d562 10d563 10d564 10d565 10d566
Aug 7 12:28:19 prox03 corosync[1572]: [TOTEM ] Retransmit List: 10d55a 10d55b 10d55c 10d55d 10d55e 10d55f 10d560 10d561 10d562 10d563 10d564 10d565 10d566
Aug 7 12:28:19 prox03 corosync[1572]: [TOTEM ] A new membership (192.168.1.3:23012) was formed. Members left: 1
Aug 7 12:28:19 prox03 corosync[1572]: [TOTEM ] Failed to receive the leave message. failed: 1
Aug 7 12:28:19 prox03 pmxcfs[1526]: [dcdb] notice: members: 2/2538, 3/1526
Aug 7 12:28:19 prox03 pmxcfs[1526]: [dcdb] notice: starting data syncronisation
Aug 7 12:28:19 prox03 pmxcfs[1526]: [status] notice: members: 2/2538, 3/1526
Aug 7 12:28:19 prox03 pmxcfs[1526]: [status] notice: starting data syncronisation
Aug 7 12:28:19 prox03 corosync[1572]: [QUORUM] Members[2]: 3 2
Aug 7 12:28:19 prox03 corosync[1572]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 7 12:28:21 prox03 corosync[1572]: [TOTEM ] A new membership (192.168.1.3:23016) was formed. Members
Aug 7 12:28:21 prox03 corosync[1572]: [QUORUM] Members[2]: 3 2
Aug 7 12:28:21 prox03 corosync[1572]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 7 12:28:21 prox03 pmxcfs[1526]: [dcdb] notice: received sync request (epoch 2/2538/00000004)
Aug 7 12:28:21 prox03 pmxcfs[1526]: [status] notice: received sync request (epoch 2/2538/00000004)
Aug 7 12:28:21 prox03 pmxcfs[1526]: [dcdb] notice: received all states
Aug 7 12:28:21 prox03 pmxcfs[1526]: [dcdb] notice: leader is 2/2538
Aug 7 12:28:21 prox03 pmxcfs[1526]: [dcdb] notice: synced members: 2/2538, 3/1526
Aug 7 12:28:21 prox03 pmxcfs[1526]: [dcdb] notice: all data is up to date
Aug 7 12:28:21 prox03 corosync[1572]: [TOTEM ] A new membership (192.168.1.3:23020) was formed. Members joined: 1
Aug 7 12:28:21 prox03 pmxcfs[1526]: [dcdb] notice: members: 1/2401, 2/2538, 3/1526
Aug 7 12:28:21 prox03 pmxcfs[1526]: [dcdb] notice: starting data syncronisation
Aug 7 12:28:21 prox03 pmxcfs[1526]: [status] notice: members: 1/2401, 2/2538, 3/1526
Aug 7 12:28:21 prox03 pmxcfs[1526]: [status] notice: queue not emtpy - resening 5 messages
Aug 7 12:28:21 prox03 corosync[1572]: [QUORUM] Members[3]: 3 2 1
Aug 7 12:28:21 prox03 corosync[1572]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 7 13:22:14 prox03 rrdcached[1411]: flushing old values
Aug 7 13:22:14 prox03 rrdcached[1411]: rotating journals