Proxmox - 4 corosync [TOTEM ] Retransmit List: watchdog - reboot

dimon99

New Member
Mar 27, 2015
14
0
1
Ukraine,Dnepropetrovsk
Hi guys. I need some help with proxmox-4. I have the issue with my 4 nodes in a cluster. They have the same version of pve. Corosync has some problems with transmit
[TOTEM ] Retransmit List: 298b1 298b2 298b
3 298b4 298b5 298b6 298b7 298b8 298b9 298ba 298bb 298bc 298bd 298be 298bf 298c0
298c1

I see it in my logs before my servers go down. They just reboot. I started to dig and I know that watchdog/softdog is the reason. I did not have any problem with version 3. But now It becomes a real problem. We have our servers rebooted one in 5 days or frequently. It's posible to have them rebooted 2 times a day. We tryed to switch our servers in one separed switch in case of network problems but It did not help. We use HA with the only purpose - to have the common management interface for the all servers in a cluster. Other options we do not use. Is there any chance to turn off the watchdog/softdog? Or may be other option could help us. Because reboot drives us crazy.

dpkg --list | grep pve-

ii libpve-access-control 4.0-9 amd64 Proxmox VE access control library
ii libpve-common-perl 4.0-36 all Proxmox VE base library
ii libpve-storage-perl 4.0-29 all Proxmox VE storage management library
ii pve-cluster 4.0-24 amd64 Cluster Infrastructure for Proxmox Virtual Environment
ii pve-container 1.0-21 all Proxmox VE Container management tool
ii pve-firewall 2.0-13 amd64 Proxmox VE Firewall
ii pve-firmware 1.1-7 all Binary firmware code for the pve-kernel
ii pve-ha-manager 1.0-13 amd64 Proxmox VE HA Manager
ii pve-kernel-4.2.3-2-pve 4.2.3-22 amd64 The Proxmox PVE Kernel Image
ii pve-libspice-server1 0.12.5-2 amd64 SPICE remote display system server library
ii pve-manager 4.0-57 amd64 The Proxmox Virtual Environment
ii pve-qemu-kvm 2.4-12
corosync-pve 2.3.5-1 amd64 Standards-based cluster framework (daemon and modules)
ii libcorosync4-pve 2.3.5-1 amd64 Standards-based cluster framework (libraries)

Thank you in advance
 
Hi,

turning watchdog off is only possible if you deactivate HA.
you have a shared storage like NFS, ISCSI,..?
If yes is the storage network separate from cluster network?
 
Hi,

turning watchdog off is only possible if you deactivate HA.
you have a shared storage like NFS, ISCSI,..?
If yes is the storage network separate from cluster network?

No. We tried to use glusterfs as a storage but it's not reliable - we had some difficulties with managing vm on it. If syc brakes between glusterfs-bricks - vm does not start. If power fails there was no takeover between nodes. But manualy it worked quite ok. So we wanted to have takeover in case of power problem but now we have the case with watchdog.We desided not to use glusterfs it any more. But it still on. We have some bricks configured. We tryed to unconfigure watchdog manualy - deleting modules and etc. but no chance .
We made some changes in corosync.conf added transport updu - have some hopes that it will help.
 
I would say disable Ha because there is no reason for it, because for HA you need a Distributed/ shared storage.
Disable HA and you have no problems with the watchdog.
Watchdog is only active if HA is set.
 
I would say disable Ha because there is no reason for it, because for HA you need a Distributed/ shared storage.
Disable HA and you have no problems with the watchdog.
Watchdog is only active if HA is set.

So the only way to disable HA - destroy the cluster? I will lose ability to manage all node from one window. What about corosync Totem retransmit?
 
Cluster and HA are not the same.
To activate the watchdog you must set a VM/CT under Ha.
It is located in Datacenter Tab HA
 
Yes. We do not use the watchdog if HA is not configured.

Is there any chance to find out what the problem is . Because We have 4 servers ich has it's own UPS. No information in the log before the crash only ладно, давай перенесем
завтра в 16:00orosync[1319]: [TOTEM ] Retransmit List: 62 63 64
Dec 3 08:03:42 warehouse corosync[1319]: [TOTEM ] Retransmit List: 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 71 72 73 74 75 76 77 78 79 7a 7b 7c 7d 7e 7f 80 81 82
Dec 3 08:03:42 warehouse corosync[1319]: [TOTEM ] Retransmit List: 83 84 85 86
Dec 3 08:03:42 warehouse corosync[1319]: [TOTEM ] Retransmit List: 87 88 89 8a 8b 8c 8d 8e 8f 90 91 92 93 94 95 96 97 98 99 9a 9b 9c 9d 9e 9f a0 a1 a2 a3 a4
Dec 3 08:03:42 warehouse corosync[1319]: [TOTEM ] Retransmit List: a5 a6 a7 a8 a9 aa ab ac ad ae af b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb bc bd be bf c0 c1 c2
Dec 3 08:03:42 warehouse corosync[1319]: [TOTEM ] Retransmit List: c3 c4 c5
Dec 3 08:03:42 warehouse corosync[1319]: [TOTEM ] A new membership (192.168.2.161:90852) was formed. Members left: 1
Dec 3 08:03:42 warehouse corosync[1319]: [TOTEM ] Failed to receive the leave message. failed: 1
Dec 3 08:03:42 warehouse pmxcfs[900]: [dcdb] notice: members: 2/1666, 3/1473, 4/900
Dec 3 08:03:42 warehouse pmxcfs[900]: [dcdb] notice: starting data syncronisation
Dec 3 08:03:42 warehouse pmxcfs[900]: [status] notice: members: 2/1666, 3/1473, 4/900
Dec 3 08:03:42 warehouse pmxcfs[900]: [status] notice: starting data syncronisation
Dec 3 08:03:42 warehouse corosync[1319]: [QUORUM] Members[3]: 3 2 4
Dec 3 08:03:42 warehouse corosync[1319]: [MAIN ] Completed service synchronization, ready to provide service.
Dec 3 08:03:43 warehouse pmxcfs[900]: [dcdb] notice: received sync request (epoch 2/1666/00000005)
Dec 3 08:03:43 warehouse pmxcfs[900]: [status] notice: received sync request (epoch 2/1666/00000005)
Dec 3 08:03:43 warehouse pmxcfs[900]: [dcdb] notice: received all states

It repeates several times and finaly

Dec 3 08:14:04 warehouse corosync[1319]: [TOTEM ] Retransmit List: 2008 1fe5 1fe6 1fe7 1fe8 1fe9 1fea 1feb 1fec 1fed 1fee 1fef 1ff0 1ff1 1ff2 1ff3 1ff7 1ff8 1ffd 1fff
Dec 3 08:14:04 warehouse corosync[1319]: [TOTEM ] Retransmit List: 1fe5 1fe6 1fe7 1fe8 1fe9 1fea 1feb 1fec 1fed 1fee 1fef 1ff0 1ff1 1ff2 1ff3 1ff7 1ff8 1ffd 1fff 200e
Dec 3 08:14:04 warehouse corosync[1319]: [TOTEM ] Retransmit List: 1fe5 1fe6 1fe7 1fe8 1fe9 1fea 1feb 1fec 1fed 1fee 1fef 1ff0 1ff1 1ff2 1ff3 1ff7 1ff8 1ffd 1fff
Server goes down - no information
Only start logs

Dec 3 08:14:04 warehouse corosync[1319]: [TOTEM ] Retransmit List: 1fe5 1fe6 1fe7 1fe8 1fe9 1fea 1feb 1fec 1fed 1fee 1fef 1ff0 1ff1 1ff2 1ff3 1ff7 1ff8 1ffd 1fff
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Dec 3 08:15:09 warehouse rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="1082" x-info="http://www.rsyslog.com"] start
Dec 3 08:15:09 warehouse rsyslogd-2307: warning: ~ action is deprecated, consider using the 'stop' statement instead [try http://www.rsyslog.com/e/2307 ]
Dec 3 08:15:09 warehouse systemd-modules-load[248]: Module 'fuse' is builtin
Dec 3 08:15:09 warehouse systemd-modules-load[248]: Module 'loop' is builtin
Dec 3 08:15:09 warehouse systemd-modules-load[248]: Inserted module 'vhost_net'
 
Have you had any luck with this? I realize that this post is over a year old but I'm seeing the same issue and would love to know if you were able to determine a way to address the issue, while still preserving HA.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!