Radom Node Freeze/Fence and TOTEM Retransmit List problems

Discussion in 'Proxmox VE: Installation and configuration' started by mohnewald, Dec 6, 2018.

  1. mohnewald

    mohnewald New Member
    Proxmox VE Subscriber

    Joined:
    Aug 21, 2018
    Messages:
    10
    Likes Received:
    0
    Hello List,

    i have a 3 Node Cluster with ceph.
    They are connected to each other with unicast.

    node08: pve-manager/5.3-5/97ae681d (running kernel: 4.15.18-9-pve)
    node09: pve-manager/5.2-9/4b30e8f9 (running kernel: 4.15.18-7-pve)
    node10: pve-manager/5.2-10/6f892b40 (running kernel: 4.15.18-7-pve)

    Those nodes "freeze" from time to time since 29.07.2018:
    29.07.2018 | node09
    07.08.2018 | node08
    21.08.2018 | node10
    22.08.2018 | node08
    07.09.2018 | node09
    11.09.2018 | node08
    13.09.2018 | node10
    19.09.2018 | node08
    26.09.2018 | node09
    28.09.2018 | node08
    16.10.2018 | node09
    05.12.2018 | node08

    I have the EXACT 3Node Hardware running Proxmox 4.4 which does not have those problems.
    Until now, i thought its a Kernel/Nic driver problem. My thread is here:
    https://forum.proxmox.com/threads/periodic-node-crash-freeze.46407/

    However, i seem to have quite a few corosync/totem log entries like this:

    node09 corosync[5290]: [TOTEM ] Retransmit List: 18f46c9
    node09 corosync[5290]: notice [TOTEM ] Retransmit List: 17ae461
    node09 corosync[5290]: [TOTEM ] Retransmit List: 17ae461
    node09 corosync[5290]: notice [TOTEM ] Retransmit List: 168516b
    node09 corosync[5290]: [TOTEM ] Retransmit List: 168516b


    root@node08:~ # zgrep -c '\[TOTEM \] Retransmit List' /var/log/syslog*
    /var/log/syslog:14
    /var/log/syslog.1:6
    /var/log/syslog.2.gz:0
    /var/log/syslog.3.gz:2
    /var/log/syslog.4.gz:12
    /var/log/syslog.5.gz:2
    /var/log/syslog.6.gz:0
    /var/log/syslog.7.gz:26

    root@node09:~ # zgrep -c '\[TOTEM \] Retransmit List' /var/log/syslog*
    /var/log/syslog:0
    /var/log/syslog.1:48
    /var/log/syslog.2.gz:32
    /var/log/syslog.3.gz:4
    /var/log/syslog.4.gz:14
    /var/log/syslog.5.gz:2
    /var/log/syslog.6.gz:2
    /var/log/syslog.7.gz:0

    root@node10:~ # zgrep -c '\[TOTEM \] Retransmit List' /var/log/syslog*
    /var/log/syslog:2
    /var/log/syslog.1:84
    /var/log/syslog.2.gz:8
    /var/log/syslog.3.gz:70
    /var/log/syslog.4.gz:14
    /var/log/syslog.5.gz:4
    /var/log/syslog.6.gz:10
    /var/log/syslog.7.gz:16

    My healthy Cluster still running on proxmox 4.4 does not have those problems. But maybe because the load is not the same?!

    Here comes my real question:
    1.) Did node08 die or:
    Did node08 freeze AFTER some fence magic happened?
    2.) Why does node08 not log any totem/corsync retransmit stuff?
    3.) My heartbeats run over the SAN/Ceph net. Thats okay, right?
    4.) is this some weird unicast problem? Would Multicast fix that?
    5.) can i test/benchmark my UDP Unicast somehow like with omping?

    Here are the logs from my last crash-freeze of node08:
    ------------------------------------------------------------


    Node08:
    ---------
    Dec 5 15:52:31 node08 systemd[477301]: Reached target Default.
    Dec 5 15:52:31 node08 systemd[477301]: Startup finished in 32ms.
    Dec 5 15:52:31 node08 systemd[1]: Started User Manager for UID 0.
    Dec 5 15:52:31 node08 systemd[1]: Stopping User Manager for UID 0...
    Dec 5 15:52:31 node08 systemd[477301]: Stopped target Default.
    Dec 5 15:52:31 node08 systemd[477301]: Stopped target Basic System.
    Dec 5 15:52:31 node08 systemd[477301]: Stopped target Paths.
    Dec 5 15:52:31 node08 systemd[477301]: Stopped target Timers.
    Dec 5 15:52:31 node08 systemd[477301]: Stopped target Sockets.
    Dec 5 15:52:31 node08 systemd[477301]: Reached target Shutdown.
    Dec 5 15:52:31 node08 systemd[477301]: Starting Exit the Session...
    Dec 5 15:52:31 node08 systemd[477301]: Received SIGRTMIN+24 from PID 477353 (kill).
    Dec 5 15:52:31 node08 systemd[1]: Stopped User Manager for UID 0.
    Dec 5 15:52:31 node08 systemd[1]: Removed slice User Slice of root.
    ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^...



    Node09:
    ---------
    Dec 5 15:52:30 node09 systemd[1256497]: Received SIGRTMIN+24 from PID 1256530 (kill).
    Dec 5 15:52:30 node09 systemd[1]: Stopped User Manager for UID 0.
    Dec 5 15:52:30 node09 systemd[1]: Removed slice User Slice of root.
    Dec 5 15:52:41 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d10b
    Dec 5 15:52:41 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d10b
    Dec 5 15:52:42 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d10d
    Dec 5 15:52:42 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d10d
    Dec 5 15:52:42 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d110
    Dec 5 15:52:42 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d111
    Dec 5 15:52:42 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d110
    Dec 5 15:52:42 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d111
    Dec 5 15:52:42 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d112
    Dec 5 15:52:42 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d112
    Dec 5 15:52:43 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d113
    Dec 5 15:52:43 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d113
    Dec 5 15:52:44 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d115 1b7d116 1b7d117 1b7d118 1b7d119 1b7d11a 1b7d11b
    Dec 5 15:52:44 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d115 1b7d116 1b7d117 1b7d118 1b7d119 1b7d11a 1b7d11b
    Dec 5 15:52:44 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d11d
    ...SNIP...
    Dec 5 15:52:53 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d165
    Dec 5 15:52:53 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d166
    Dec 5 15:52:53 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d167
    Dec 5 15:52:53 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d167
    Dec 5 15:52:54 node09 corosync[5290]: notice [TOTEM ] A processor failed, forming new configuration.
    Dec 5 15:52:54 node09 corosync[5290]: [TOTEM ] A processor failed, forming new configuration.
    Dec 5 15:53:50 node09 pve-ha-crm[5801]: node 'node08': state changed from 'unknown' => 'fence'
    Dec 5 15:55:00 node09 pve-ha-crm[5801]: node 'node08': state changed from 'fence' => 'unknown'


    Node10:
    --------
    Dec 5 15:52:00 node10 systemd[1]: Starting Proxmox VE replication runner...
    Dec 5 15:52:01 node10 systemd[1]: Started Proxmox VE replication runner.
    Dec 5 15:52:54 node10 corosync[5295]: notice [TOTEM ] A processor failed, forming new configuration.
    Dec 5 15:52:54 node10 corosync[5295]: [TOTEM ] A processor failed, forming new configuration.
    Dec 5 15:52:56 node10 ceph-osd[6256]: 2018-12-05 15:52:56.584700 7f9a2d368700 -1 osd.9 12233 heartbeat_check: no reply from 10.15.15.8:6830 osd.2 since back 2018-12-05 15:52:36.458621 front 2018-12-05 15:52:36.458621 (cutoff 2018-12-05 15:52:36.584696)
    Dec 5 15:52:56 node10 ceph-osd[6256]: 2018-12-05 15:52:56.584717 7f9a2d368700 -1 osd.9 12233 heartbeat_check: no reply from 10.15.15.8:6817 osd.10 since back 2018-12-05 15:52:43.360369 front 2018-12-05 15:52:36.458621 (cutoff 2018-12-05 15:52:36.584696)
    Dec 5 15:52:56 node10 ceph-osd[6256]: 2018-12-05 15:52:56.584722 7f9a2d368700 -1 osd.9 12233 heartbeat_check: no reply from 10.15.15.8:6803 osd.11 since back 2018-12-05 15:52:47.961863 front 2018-12-05 15:52:36.458621 (cutoff 2018-12-05 15:52:36.584696)
    Dec 5 15:52:56 node10 ceph-osd[6256]: 2018-12-05 15:52:56.584726 7f9a2d368700 -1 osd.9 12233 heartbeat_check: no reply from 10.15.15.8:6828 osd.13 since back 2018-12-05 15:52:36.458621 front 2018-12-05 15:52:42.259818 (cutoff 2018-12-05 15:52:36.584696)
    Dec 5 15:52:56 node10 corosync[5295]: notice [TOTEM ] A new membership (10.15.15.9:10164) was formed. Members left: 1
    Dec 5 15:52:56 node10 corosync[5295]: notice [TOTEM ] Failed to receive the leave message. failed: 1
    Dec 5 15:52:56 node10 corosync[5295]: [TOTEM ] A new membership (10.15.15.9:10164) was formed. Members left: 1
    Dec 5 15:52:56 node10 corosync[5295]: [TOTEM ] Failed to receive the leave message. failed: 1
    Dec 5 15:52:56 node10 corosync[5295]: warning [CPG ] downlist left_list: 1 received

    Thanks a lot,
    Mario
     
  2. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,741
    Likes Received:
    151
    1. it was most likely fenced.
    2. it probably couldn't log anything anymore.
    3. Corosync (and Ceph for that matter) needs low latency and any other traffic interferes. Put it on its own physical network.
    4. Unicast produces way more traffic and eats way more resources then multicast. Check your switch load. Better use multicast.
    5. omping is able to check both, multicast and unicast.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. spirit

    spirit Well-Known Member
    Proxmox VE Subscriber

    Joined:
    Apr 2, 2010
    Messages:
    3,196
    Likes Received:
    110
    you can also try to increase corosync token timeout (maybe your switch latency is too high with unicast and your number of nodes)

    /etc/pve/corosync.conf

    totem {
    ..
    token: 4000


    for example
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice