Proxmox - 4 corosync [TOTEM ] Retransmit List: watchdog - reboot

dimon99 · Dec 3, 2015

Hi guys. I need some help with proxmox-4. I have the issue with my 4 nodes in a cluster. They have the same version of pve. Corosync has some problems with transmit
[TOTEM ] Retransmit List: 298b1 298b2 298b
3 298b4 298b5 298b6 298b7 298b8 298b9 298ba 298bb 298bc 298bd 298be 298bf 298c0
298c1

I see it in my logs before my servers go down. They just reboot. I started to dig and I know that watchdog/softdog is the reason. I did not have any problem with version 3. But now It becomes a real problem. We have our servers rebooted one in 5 days or frequently. It's posible to have them rebooted 2 times a day. We tryed to switch our servers in one separed switch in case of network problems but It did not help. We use HA with the only purpose - to have the common management interface for the all servers in a cluster. Other options we do not use. Is there any chance to turn off the watchdog/softdog? Or may be other option could help us. Because reboot drives us crazy.

dpkg --list | grep pve-

ii libpve-access-control 4.0-9 amd64 Proxmox VE access control library
ii libpve-common-perl 4.0-36 all Proxmox VE base library
ii libpve-storage-perl 4.0-29 all Proxmox VE storage management library
ii pve-cluster 4.0-24 amd64 Cluster Infrastructure for Proxmox Virtual Environment
ii pve-container 1.0-21 all Proxmox VE Container management tool
ii pve-firewall 2.0-13 amd64 Proxmox VE Firewall
ii pve-firmware 1.1-7 all Binary firmware code for the pve-kernel
ii pve-ha-manager 1.0-13 amd64 Proxmox VE HA Manager
ii pve-kernel-4.2.3-2-pve 4.2.3-22 amd64 The Proxmox PVE Kernel Image
ii pve-libspice-server1 0.12.5-2 amd64 SPICE remote display system server library
ii pve-manager 4.0-57 amd64 The Proxmox Virtual Environment
ii pve-qemu-kvm 2.4-12
corosync-pve 2.3.5-1 amd64 Standards-based cluster framework (daemon and modules)
ii libcorosync4-pve 2.3.5-1 amd64 Standards-based cluster framework (libraries)

Thank you in advance

wolfgang · Dec 3, 2015

Hi,

turning watchdog off is only possible if you deactivate HA.
you have a shared storage like NFS, ISCSI,..?
If yes is the storage network separate from cluster network?

dimon99 · Dec 3, 2015

wolfgang said:
Hi,

turning watchdog off is only possible if you deactivate HA.
you have a shared storage like NFS, ISCSI,..?
If yes is the storage network separate from cluster network?

No. We tried to use glusterfs as a storage but it's not reliable - we had some difficulties with managing vm on it. If syc brakes between glusterfs-bricks - vm does not start. If power fails there was no takeover between nodes. But manualy it worked quite ok. So we wanted to have takeover in case of power problem but now we have the case with watchdog.We desided not to use glusterfs it any more. But it still on. We have some bricks configured. We tryed to unconfigure watchdog manualy - deleting modules and etc. but no chance .
We made some changes in corosync.conf added transport updu - have some hopes that it will help.

wolfgang · Dec 3, 2015

I would say disable Ha because there is no reason for it, because for HA you need a Distributed/ shared storage.
Disable HA and you have no problems with the watchdog.
Watchdog is only active if HA is set.

dimon99 · Dec 3, 2015

wolfgang said:
I would say disable Ha because there is no reason for it, because for HA you need a Distributed/ shared storage.
Disable HA and you have no problems with the watchdog.
Watchdog is only active if HA is set.

So the only way to disable HA - destroy the cluster? I will lose ability to manage all node from one window. What about corosync Totem retransmit?

wolfgang · Dec 3, 2015

Cluster and HA are not the same.
To activate the watchdog you must set a VM/CT under Ha.
It is located in Datacenter Tab HA

dimon99 · Dec 3, 2015

wolfgang said:
Cluster and HA are not the same.
To activate the watchdog you must set a VM/CT under Ha.
It is located in Datacenter Tab HA

I don't have any configures resources there. No configured groups or vm. I have cluster but no resources. Does It mean that the cause of all my reboots not a watchdog?

dietmar · Dec 3, 2015

dimon99 said:
Does It mean that the cause of all my reboots not a watchdog?

Yes. We do not use the watchdog if HA is not configured.

dimon99 · Dec 3, 2015

dietmar said:
Yes. We do not use the watchdog if HA is not configured.

Is there any chance to find out what the problem is . Because We have 4 servers ich has it's own UPS. No information in the log before the crash only ладно, давай перенесем

[6:18]

завтра в 16:00orosync[1319]: [TOTEM ] Retransmit List: 62 63 64
Dec 3 08:03:42 warehouse corosync[1319]: [TOTEM ] Retransmit List: 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 71 72 73 74 75 76 77 78 79 7a 7b 7c 7d 7e 7f 80 81 82
Dec 3 08:03:42 warehouse corosync[1319]: [TOTEM ] Retransmit List: 83 84 85 86
Dec 3 08:03:42 warehouse corosync[1319]: [TOTEM ] Retransmit List: 87 88 89 8a 8b 8c 8d 8e 8f 90 91 92 93 94 95 96 97 98 99 9a 9b 9c 9d 9e 9f a0 a1 a2 a3 a4
Dec 3 08:03:42 warehouse corosync[1319]: [TOTEM ] Retransmit List: a5 a6 a7 a8 a9 aa ab ac ad ae af b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb bc bd be bf c0 c1 c2
Dec 3 08:03:42 warehouse corosync[1319]: [TOTEM ] Retransmit List: c3 c4 c5
Dec 3 08:03:42 warehouse corosync[1319]: [TOTEM ] A new membership (192.168.2.161:90852) was formed. Members left: 1
Dec 3 08:03:42 warehouse corosync[1319]: [TOTEM ] Failed to receive the leave message. failed: 1
Dec 3 08:03:42 warehouse pmxcfs[900]: [dcdb] notice: members: 2/1666, 3/1473, 4/900
Dec 3 08:03:42 warehouse pmxcfs[900]: [dcdb] notice: starting data syncronisation
Dec 3 08:03:42 warehouse pmxcfs[900]: [status] notice: members: 2/1666, 3/1473, 4/900
Dec 3 08:03:42 warehouse pmxcfs[900]: [status] notice: starting data syncronisation
Dec 3 08:03:42 warehouse corosync[1319]: [QUORUM] Members[3]: 3 2 4
Dec 3 08:03:42 warehouse corosync[1319]: [MAIN ] Completed service synchronization, ready to provide service.
Dec 3 08:03:43 warehouse pmxcfs[900]: [dcdb] notice: received sync request (epoch 2/1666/00000005)
Dec 3 08:03:43 warehouse pmxcfs[900]: [status] notice: received sync request (epoch 2/1666/00000005)
Dec 3 08:03:43 warehouse pmxcfs[900]: [dcdb] notice: received all states

It repeates several times and finaly

Dec 3 08:14:04 warehouse corosync[1319]: [TOTEM ] Retransmit List: 2008 1fe5 1fe6 1fe7 1fe8 1fe9 1fea 1feb 1fec 1fed 1fee 1fef 1ff0 1ff1 1ff2 1ff3 1ff7 1ff8 1ffd 1fff
Dec 3 08:14:04 warehouse corosync[1319]: [TOTEM ] Retransmit List: 1fe5 1fe6 1fe7 1fe8 1fe9 1fea 1feb 1fec 1fed 1fee 1fef 1ff0 1ff1 1ff2 1ff3 1ff7 1ff8 1ffd 1fff 200e
Dec 3 08:14:04 warehouse corosync[1319]: [TOTEM ] Retransmit List: 1fe5 1fe6 1fe7 1fe8 1fe9 1fea 1feb 1fec 1fed 1fee 1fef 1ff0 1ff1 1ff2 1ff3 1ff7 1ff8 1ffd 1fff
Server goes down - no information
Only start logs

Dec 3 08:14:04 warehouse corosync[1319]: [TOTEM ] Retransmit List: 1fe5 1fe6 1fe7 1fe8 1fe9 1fea 1feb 1fec 1fed 1fee 1fef 1ff0 1ff1 1ff2 1ff3 1ff7 1ff8 1ffd 1fff
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Dec 3 08:15:09 warehouse rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="1082" x-info="http://www.rsyslog.com"] start
Dec 3 08:15:09 warehouse rsyslogd-2307: warning: ~ action is deprecated, consider using the 'stop' statement instead [try http://www.rsyslog.com/e/2307 ]
Dec 3 08:15:09 warehouse systemd-modules-load[248]: Module 'fuse' is builtin
Dec 3 08:15:09 warehouse systemd-modules-load[248]: Module 'loop' is builtin
Dec 3 08:15:09 warehouse systemd-modules-load[248]: Inserted module 'vhost_net'

dietmar · Dec 3, 2015

dimon99 said:
Is there any chance to find out what the problem is

Those retransmit error usually results from one node on the cluster which is too slow to process the request.
Anyways, this does not explain random reboots.

dimon99 · Dec 3, 2015

dietmar said:
Those retransmit error usually results from one node on the cluster which is too slow to process the request.
Anyways, this does not explain random reboots.

I have the same errors on the other servers...In the logs... I have no idea how to debug it. I thought it was watchdog...

dietmar · Dec 3, 2015

dimon99 said:
I have the same errors on the other servers...In the logs... I have no idea how to debug it. I thought it was watchdog...

I would start with a memory test

dimon99 · Dec 4, 2015

dietmar said:
I would start with a memory test

Thank you. I will try to test. But It would be strange to have memory problems on the 4 server with proxmox with different hardware config. We are trying to use udpu for a corosync transport at the moment. Have some hopes for that.

dbtechpro · Apr 21, 2017

Have you had any luck with this? I realize that this post is over a year old but I'm seeing the same issue and would love to know if you were able to determine a way to address the issue, while still preserving HA.

Search

Search

Proxmox - 4 corosync [TOTEM ] Retransmit List: watchdog - reboot

dimon99

New Member

wolfgang

Proxmox Retired Staff

dimon99

New Member

wolfgang

Proxmox Retired Staff

dimon99

New Member

wolfgang

Proxmox Retired Staff

dimon99

New Member

dietmar

Proxmox Staff Member

dimon99

New Member

dietmar

Proxmox Staff Member

dimon99

New Member

dietmar

Proxmox Staff Member

dimon99

New Member

dbtechpro

New Member