Proxmox v5.1 - TOTEM retransmit

TwiX

Renowned Member
Feb 3, 2015
310
22
83
Hi,

On a 3 nodes cluster (with HA VMs), since few days, I have Totem retransmits everyday at 06:00 (during few seconds).

Code:
...
Aug 08 06:00:05 dc-prox-13 corosync[2378]: notice [TOTEM ] Retransmit List: 26c62fb
Aug 08 06:00:05 dc-prox-13 corosync[2378]: [TOTEM ] Retransmit List: 26c62fb
Aug 08 06:00:05 dc-prox-13 corosync[2378]: [TOTEM ] Retransmit List: 26c62fb
Aug 08 06:00:05 dc-prox-13 corosync[2378]: notice [TOTEM ] Retransmit List: 26c62fb
Aug 08 06:00:05 dc-prox-13 corosync[2378]: notice [TOTEM ] Retransmit List: 26c62fb
Aug 08 06:00:05 dc-prox-13 corosync[2378]: [TOTEM ] Retransmit List: 26c62fb
Aug 08 06:00:05 dc-prox-13 corosync[2378]: notice [TOTEM ] Retransmit List: 26c62fb
Aug 08 06:00:05 dc-prox-13 corosync[2378]: [TOTEM ] Retransmit List: 26c62fb
Aug 08 06:00:05 dc-prox-13 corosync[2378]: notice [TOTEM ] Retransmit List: 26c62fb 26c62fc
Aug 08 06:00:05 dc-prox-13 corosync[2378]: [TOTEM ] Retransmit List: 26c62fb 26c62fc
Aug 08 06:00:05 dc-prox-13 corosync[2378]: notice [TOTEM ] Retransmit List: 26c62fb 26c62fc
...

Code:
...
Aug 08 06:00:05 dc-prox-06 corosync[2278]: notice [TOTEM ] Retransmit List: 26c62fb
Aug 08 06:00:05 dc-prox-06 corosync[2278]: notice [TOTEM ] Retransmit List: 26c62fb
Aug 08 06:00:05 dc-prox-06 corosync[2278]: [TOTEM ] Retransmit List: 26c62fb
Aug 08 06:00:05 dc-prox-06 corosync[2278]: [TOTEM ] Retransmit List: 26c62fb
Aug 08 06:00:05 dc-prox-06 corosync[2278]: notice [TOTEM ] Retransmit List: 26c62fb
Aug 08 06:00:05 dc-prox-06 corosync[2278]: [TOTEM ] Retransmit List: 26c62fb
Aug 08 06:00:05 dc-prox-06 corosync[2278]: notice [TOTEM ] Retransmit List: 26c62fb
Aug 08 06:00:05 dc-prox-06 corosync[2278]: [TOTEM ] Retransmit List: 26c62fb
Aug 08 06:00:05 dc-prox-06 corosync[2278]: notice [TOTEM ] Retransmit List: 26c62fb 26c62fc
Aug 08 06:00:05 dc-prox-06 corosync[2278]: [TOTEM ] Retransmit List: 26c62fb 26c62fc
Aug 08 06:00:05 dc-prox-06 corosync[2278]: notice [TOTEM ] Retransmit List: 26c62fb 26c62fc
...

These logs appear on 2 of 3 nodes. dc-prox-06 and dc-prox-13 run 10 VMs each. the last one, dc-prox-07, has no hosted VM.

Code:
...
Aug 08 05:32:40 dc-prox-07 rrdcached[2014]: started new journal /var/lib/rrdcached/journal/rrd.journal.1533699160.906147
Aug 08 05:32:40 dc-prox-07 rrdcached[2014]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1533691960.906145
Aug 08 05:58:22 dc-prox-07 pmxcfs[2065]: [dcdb] notice: data verification successful
Aug 08 06:17:01 dc-prox-07 CRON[25764]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 08 06:17:01 dc-prox-07 CRON[25765]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 08 06:17:01 dc-prox-07 CRON[25764]: pam_unix(cron:session): session closed for user root
Aug 08 06:25:01 dc-prox-07 CRON[27769]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 08 06:25:01 dc-prox-07 CRON[27770]: (root) CMD (test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ))
...

I have no heavy load (CPU+Net) at 06:00 and everything works as expected during the rest of the day, with high CPU/NET load as well.

I know I'm not running the latest proxmox version but I have to plan Bios update first.

All nodes are running with same packages :

root@dc-prox-06:/var/log# pveversion -v
proxmox-ve: 5.1-32 (running kernel: 4.13.13-2-pve)
pve-manager: 5.1-41 (running version: 5.1-41/0b958203)
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.13.13-2-pve: 4.13.13-32
pve-kernel-4.13.13-1-pve: 4.13.13-31
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-18
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-5
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
ceph: 12.2.2-pve1

root@dc-prox-07:~# pveversion -v
proxmox-ve: 5.1-32 (running kernel: 4.13.13-2-pve)
pve-manager: 5.1-41 (running version: 5.1-41/0b958203)
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.13.13-2-pve: 4.13.13-32
pve-kernel-4.10.17-4-pve: 4.10.17-24
pve-kernel-4.10.17-2-pve: 4.10.17-20
pve-kernel-4.10.15-1-pve: 4.10.15-15
pve-kernel-4.13.13-1-pve: 4.13.13-31
pve-kernel-4.10.17-3-pve: 4.10.17-23
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-18
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-5
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
ceph: 12.2.2-pve1

root@dc-prox-13:~# pveversion -v
proxmox-ve: 5.1-32 (running kernel: 4.13.13-2-pve)
pve-manager: 5.1-41 (running version: 5.1-41/0b958203)
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.13.13-2-pve: 4.13.13-32
pve-kernel-4.10.17-4-pve: 4.10.17-24
pve-kernel-4.10.17-2-pve: 4.10.17-20
pve-kernel-4.10.15-1-pve: 4.10.15-15
pve-kernel-4.13.13-1-pve: 4.13.13-31
pve-kernel-4.10.17-3-pve: 4.10.17-23
pve-kernel-4.10.17-1-pve: 4.10.17-18
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-18
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-5
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
ceph: 12.2.2-pve1

Corosync uses a dedicated VLAN, but yes, same interface (LACP bond 2x1G) we use for bridging VMs (vmbr), but as I said, there is no traffic and no load at 06:00.

Do you think that the issue is located on dc-prox-07 ?

What can I do ?

Thanks a lot !
 
Corosync uses a dedicated VLAN, but yes, same interface (LACP bond 2x1G) we use for bridging VMs (vmbr), but as I said, there is no traffic and no load at 06:00.
The traffic doesn't need to be on the interface, the switch can also be overloaded at that time.

When and where from/to are your backups running?
 
Thanks for your reply.
Backups begin at 23:00 or 00:00 for all clusters and finish between 01:00 and 01:30.

On the same switch (Stack of 2 M4300 Netgear (10G)) I have another '6 nodes' cluster without issue.
This stack is connected to a Cisco 4507-E which is the mrouter for management vlan used by my proxmox nodes.
 
There should be more entries in the logs from corosync, before the retransmit starts. Do the performance graphs of the switches any spike in latency or dropped packets?
 
Nothing...and I don't see any dropped packet on involved interfaces.
But I guess retransmits would also appear on the other cluster if the switch is overloaded at this time.

Ok, this is just 10s at 06:00, but it's still embarrassing...

What about changing window_size in corosync.conf for a value > 50 which is the default value ? If so, what is the best way to do (considering I have lot of HA VMs) ?

Thanks again!
 
Ok, this is just 10s at 06:00, but it's still embarrassing...

What about changing window_size in corosync.conf for a value > 50 which is the default value ? If so, what is the best way to do (considering I have lot of HA VMs) ?
From my understanding the slow processor would be even more visibel during business hours, where more load is on the nodes, from your description that's not the case. It's a reoccurring event, there needs to be a trigger and if it isn't in the logs, set the log level higher. See 'man corosync.conf' for logging options.
 
Thanks,

Processors are
- 32 x Intel(R) Xeon(R) CPU E5-2640 v3 for both dc-prox-06 and 07 with 128 Gio RAM - CPU utilization doesn't exceed 30% on highest load (for dc-prox-06, dc-prox-07 is empty).
- 40 x Intel(R) Xeon(R) CPU E5-2630 v4 for dc-prox-13 with 128 Gio RAM - CPU utilization doesn't exceed 10% on highest load.

I think I will restart the entire cluster first.

Editing /etc/pve/corosync.conf will automatically restart corosync service ?
Can it have consequences on HA VMs ?

Thanks!
 
The config file yes, but the corosync service doesn't pick it up during runtime.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!