Radom Node Freeze/Fence and TOTEM Retransmit List problems

Discussion in 'Proxmox VE: Installation and configuration' started by mohnewald, Dec 6, 2018.

  1. mohnewald

    mohnewald New Member
    Proxmox VE Subscriber

    Joined:
    Aug 21, 2018
    Messages:
    15
    Likes Received:
    0
    Hello List,

    i have a 3 Node Cluster with ceph.
    They are connected to each other with unicast.

    node08: pve-manager/5.3-5/97ae681d (running kernel: 4.15.18-9-pve)
    node09: pve-manager/5.2-9/4b30e8f9 (running kernel: 4.15.18-7-pve)
    node10: pve-manager/5.2-10/6f892b40 (running kernel: 4.15.18-7-pve)

    Those nodes "freeze" from time to time since 29.07.2018:
    29.07.2018 | node09
    07.08.2018 | node08
    21.08.2018 | node10
    22.08.2018 | node08
    07.09.2018 | node09
    11.09.2018 | node08
    13.09.2018 | node10
    19.09.2018 | node08
    26.09.2018 | node09
    28.09.2018 | node08
    16.10.2018 | node09
    05.12.2018 | node08

    I have the EXACT 3Node Hardware running Proxmox 4.4 which does not have those problems.
    Until now, i thought its a Kernel/Nic driver problem. My thread is here:
    https://forum.proxmox.com/threads/periodic-node-crash-freeze.46407/

    However, i seem to have quite a few corosync/totem log entries like this:

    node09 corosync[5290]: [TOTEM ] Retransmit List: 18f46c9
    node09 corosync[5290]: notice [TOTEM ] Retransmit List: 17ae461
    node09 corosync[5290]: [TOTEM ] Retransmit List: 17ae461
    node09 corosync[5290]: notice [TOTEM ] Retransmit List: 168516b
    node09 corosync[5290]: [TOTEM ] Retransmit List: 168516b


    root@node08:~ # zgrep -c '\[TOTEM \] Retransmit List' /var/log/syslog*
    /var/log/syslog:14
    /var/log/syslog.1:6
    /var/log/syslog.2.gz:0
    /var/log/syslog.3.gz:2
    /var/log/syslog.4.gz:12
    /var/log/syslog.5.gz:2
    /var/log/syslog.6.gz:0
    /var/log/syslog.7.gz:26

    root@node09:~ # zgrep -c '\[TOTEM \] Retransmit List' /var/log/syslog*
    /var/log/syslog:0
    /var/log/syslog.1:48
    /var/log/syslog.2.gz:32
    /var/log/syslog.3.gz:4
    /var/log/syslog.4.gz:14
    /var/log/syslog.5.gz:2
    /var/log/syslog.6.gz:2
    /var/log/syslog.7.gz:0

    root@node10:~ # zgrep -c '\[TOTEM \] Retransmit List' /var/log/syslog*
    /var/log/syslog:2
    /var/log/syslog.1:84
    /var/log/syslog.2.gz:8
    /var/log/syslog.3.gz:70
    /var/log/syslog.4.gz:14
    /var/log/syslog.5.gz:4
    /var/log/syslog.6.gz:10
    /var/log/syslog.7.gz:16

    My healthy Cluster still running on proxmox 4.4 does not have those problems. But maybe because the load is not the same?!

    Here comes my real question:
    1.) Did node08 die or:
    Did node08 freeze AFTER some fence magic happened?
    2.) Why does node08 not log any totem/corsync retransmit stuff?
    3.) My heartbeats run over the SAN/Ceph net. Thats okay, right?
    4.) is this some weird unicast problem? Would Multicast fix that?
    5.) can i test/benchmark my UDP Unicast somehow like with omping?

    Here are the logs from my last crash-freeze of node08:
    ------------------------------------------------------------


    Node08:
    ---------
    Dec 5 15:52:31 node08 systemd[477301]: Reached target Default.
    Dec 5 15:52:31 node08 systemd[477301]: Startup finished in 32ms.
    Dec 5 15:52:31 node08 systemd[1]: Started User Manager for UID 0.
    Dec 5 15:52:31 node08 systemd[1]: Stopping User Manager for UID 0...
    Dec 5 15:52:31 node08 systemd[477301]: Stopped target Default.
    Dec 5 15:52:31 node08 systemd[477301]: Stopped target Basic System.
    Dec 5 15:52:31 node08 systemd[477301]: Stopped target Paths.
    Dec 5 15:52:31 node08 systemd[477301]: Stopped target Timers.
    Dec 5 15:52:31 node08 systemd[477301]: Stopped target Sockets.
    Dec 5 15:52:31 node08 systemd[477301]: Reached target Shutdown.
    Dec 5 15:52:31 node08 systemd[477301]: Starting Exit the Session...
    Dec 5 15:52:31 node08 systemd[477301]: Received SIGRTMIN+24 from PID 477353 (kill).
    Dec 5 15:52:31 node08 systemd[1]: Stopped User Manager for UID 0.
    Dec 5 15:52:31 node08 systemd[1]: Removed slice User Slice of root.
    ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^...



    Node09:
    ---------
    Dec 5 15:52:30 node09 systemd[1256497]: Received SIGRTMIN+24 from PID 1256530 (kill).
    Dec 5 15:52:30 node09 systemd[1]: Stopped User Manager for UID 0.
    Dec 5 15:52:30 node09 systemd[1]: Removed slice User Slice of root.
    Dec 5 15:52:41 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d10b
    Dec 5 15:52:41 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d10b
    Dec 5 15:52:42 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d10d
    Dec 5 15:52:42 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d10d
    Dec 5 15:52:42 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d110
    Dec 5 15:52:42 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d111
    Dec 5 15:52:42 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d110
    Dec 5 15:52:42 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d111
    Dec 5 15:52:42 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d112
    Dec 5 15:52:42 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d112
    Dec 5 15:52:43 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d113
    Dec 5 15:52:43 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d113
    Dec 5 15:52:44 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d115 1b7d116 1b7d117 1b7d118 1b7d119 1b7d11a 1b7d11b
    Dec 5 15:52:44 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d115 1b7d116 1b7d117 1b7d118 1b7d119 1b7d11a 1b7d11b
    Dec 5 15:52:44 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d11d
    ...SNIP...
    Dec 5 15:52:53 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d165
    Dec 5 15:52:53 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d166
    Dec 5 15:52:53 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d167
    Dec 5 15:52:53 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d167
    Dec 5 15:52:54 node09 corosync[5290]: notice [TOTEM ] A processor failed, forming new configuration.
    Dec 5 15:52:54 node09 corosync[5290]: [TOTEM ] A processor failed, forming new configuration.
    Dec 5 15:53:50 node09 pve-ha-crm[5801]: node 'node08': state changed from 'unknown' => 'fence'
    Dec 5 15:55:00 node09 pve-ha-crm[5801]: node 'node08': state changed from 'fence' => 'unknown'


    Node10:
    --------
    Dec 5 15:52:00 node10 systemd[1]: Starting Proxmox VE replication runner...
    Dec 5 15:52:01 node10 systemd[1]: Started Proxmox VE replication runner.
    Dec 5 15:52:54 node10 corosync[5295]: notice [TOTEM ] A processor failed, forming new configuration.
    Dec 5 15:52:54 node10 corosync[5295]: [TOTEM ] A processor failed, forming new configuration.
    Dec 5 15:52:56 node10 ceph-osd[6256]: 2018-12-05 15:52:56.584700 7f9a2d368700 -1 osd.9 12233 heartbeat_check: no reply from 10.15.15.8:6830 osd.2 since back 2018-12-05 15:52:36.458621 front 2018-12-05 15:52:36.458621 (cutoff 2018-12-05 15:52:36.584696)
    Dec 5 15:52:56 node10 ceph-osd[6256]: 2018-12-05 15:52:56.584717 7f9a2d368700 -1 osd.9 12233 heartbeat_check: no reply from 10.15.15.8:6817 osd.10 since back 2018-12-05 15:52:43.360369 front 2018-12-05 15:52:36.458621 (cutoff 2018-12-05 15:52:36.584696)
    Dec 5 15:52:56 node10 ceph-osd[6256]: 2018-12-05 15:52:56.584722 7f9a2d368700 -1 osd.9 12233 heartbeat_check: no reply from 10.15.15.8:6803 osd.11 since back 2018-12-05 15:52:47.961863 front 2018-12-05 15:52:36.458621 (cutoff 2018-12-05 15:52:36.584696)
    Dec 5 15:52:56 node10 ceph-osd[6256]: 2018-12-05 15:52:56.584726 7f9a2d368700 -1 osd.9 12233 heartbeat_check: no reply from 10.15.15.8:6828 osd.13 since back 2018-12-05 15:52:36.458621 front 2018-12-05 15:52:42.259818 (cutoff 2018-12-05 15:52:36.584696)
    Dec 5 15:52:56 node10 corosync[5295]: notice [TOTEM ] A new membership (10.15.15.9:10164) was formed. Members left: 1
    Dec 5 15:52:56 node10 corosync[5295]: notice [TOTEM ] Failed to receive the leave message. failed: 1
    Dec 5 15:52:56 node10 corosync[5295]: [TOTEM ] A new membership (10.15.15.9:10164) was formed. Members left: 1
    Dec 5 15:52:56 node10 corosync[5295]: [TOTEM ] Failed to receive the leave message. failed: 1
    Dec 5 15:52:56 node10 corosync[5295]: warning [CPG ] downlist left_list: 1 received

    Thanks a lot,
    Mario
     
  2. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,899
    Likes Received:
    163
    1. it was most likely fenced.
    2. it probably couldn't log anything anymore.
    3. Corosync (and Ceph for that matter) needs low latency and any other traffic interferes. Put it on its own physical network.
    4. Unicast produces way more traffic and eats way more resources then multicast. Check your switch load. Better use multicast.
    5. omping is able to check both, multicast and unicast.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. spirit

    spirit Well-Known Member

    Joined:
    Apr 2, 2010
    Messages:
    3,233
    Likes Received:
    119
    you can also try to increase corosync token timeout (maybe your switch latency is too high with unicast and your number of nodes)

    /etc/pve/corosync.conf

    totem {
    ..
    token: 4000


    for example
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  4. mohnewald

    mohnewald New Member
    Proxmox VE Subscriber

    Joined:
    Aug 21, 2018
    Messages:
    15
    Likes Received:
    0
    Hello,

    i am still struggling with those random Node Freezes/Crashed.
    I run a 3 Node cluster with Unicast (NO SWITCH) in a full meshed network.

    [​IMG]

    I think i was abele to reproduced the problem with a VMWare 3Node Cluster Setup by setting some packet loss on the corosync network. The fencing jumps in and i get those @@@@@@@@ in the logs.

    I then switched my corosync network away from the ceph/san network.

    CEPH/SAN: 10.15.15.0/24 (10Gbit dedicated just for ceph)
    COROSYNC: 10.15.16.0/24 (10Gbit dedicated just for corosync)

    I read somewhere: never run corosync on the san/ceph network. I also changed the corosync token to 8000 (8 seconds).

    But i sill get those:
    corosync[1861153]: notice [TOTEM ] Retransmit List: 23a3c4 23a3c5
    errors from time to time. No crash/fencing so far. Here are the numbers of Retransmit errors after i put corosync on a dedicated network:


    Node08:
    ---------
    /var/log/syslog: 0
    /var/log/syslog.1: 2
    /var/log/syslog.2.gz: 0

    Node09:
    --------
    /var/log/syslog: 10
    /var/log/syslog.1: 2


    Node10:
    ------------
    /var/log/syslog:0
    /var/log/syslog.1: 34
    /var/log/syslog.2.gz:0


    On node09 i got some Retransmit errors today about 10:38:47.
    So here are the logs from 10:37 and 10:38 using grep like:
    grep -e 'Jan 16 10:37' -e 'Jan 16 10:38' /var/log/*.log



    root@node08:
    ----------------
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351104]: rexec line 17: Deprecated option KeyRegenerationInterval
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351104]: rexec line 18: Deprecated option ServerKeyBits
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351104]: rexec line 29: Deprecated option RSAAuthentication
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351104]: rexec line 37: Deprecated option RhostsRSAAuthentication
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351104]: reprocess config line 29: Deprecated option RSAAuthentication
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351104]: reprocess config line 37: Deprecated option RhostsRSAAuthentication
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351104]: Accepted publickey for root from 192.168.41.5 port 36761 ssh2: RSA SHA256:YjAqGC9Pj1iFQjIOvaDLz0e7T4ze0onEy6tnSYmVn/w
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351104]: pam_unix(sshd:session): session opened for user root by (uid=0)
    /var/log/auth.log:Jan 16 10:37:47 node08 systemd-logind[1403]: New session 31238 of user root.
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351104]: Received disconnect from 192.168.41.5 port 36761:11: disconnected by user
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351104]: Disconnected from 192.168.41.5 port 36761
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351104]: pam_unix(sshd:session): session closed for user root
    /var/log/auth.log:Jan 16 10:37:47 node08 systemd-logind[1403]: Removed session 31238.
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351127]: rexec line 17: Deprecated option KeyRegenerationInterval
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351127]: rexec line 18: Deprecated option ServerKeyBits
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351127]: rexec line 29: Deprecated option RSAAuthentication
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351127]: rexec line 37: Deprecated option RhostsRSAAuthentication
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351127]: reprocess config line 29: Deprecated option RSAAuthentication
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351127]: reprocess config line 37: Deprecated option RhostsRSAAuthentication
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351127]: Accepted publickey for root from 192.168.41.5 port 36765 ssh2: RSA SHA256:YjAqGC9Pj1iFQjIOvaDLz0e7T4ze0onEy6tnSYmVn/w
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351127]: pam_unix(sshd:session): session opened for user root by (uid=0)
    /var/log/auth.log:Jan 16 10:37:47 node08 systemd-logind[1403]: New session 31239 of user root.
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351127]: Received disconnect from 192.168.41.5 port 36765:11: disconnected by user
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351127]: Disconnected from 192.168.41.5 port 36765
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351127]: pam_unix(sshd:session): session closed for user root
    /var/log/auth.log:Jan 16 10:37:47 node08 systemd-logind[1403]: Removed session 31239.
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351155]: rexec line 17: Deprecated option KeyRegenerationInterval
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351155]: rexec line 18: Deprecated option ServerKeyBits
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351155]: rexec line 29: Deprecated option RSAAuthentication
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351155]: rexec line 37: Deprecated option RhostsRSAAuthentication
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351155]: reprocess config line 29: Deprecated option RSAAuthentication
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351155]: reprocess config line 37: Deprecated option RhostsRSAAuthentication
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351155]: Accepted publickey for root from 192.168.41.5 port 36768 ssh2: RSA SHA256:YjAqGC9Pj1iFQjIOvaDLz0e7T4ze0onEy6tnSYmVn/w
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351155]: pam_unix(sshd:session): session opened for user root by (uid=0)
    /var/log/auth.log:Jan 16 10:37:47 node08 systemd-logind[1403]: New session 31240 of user root.
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351155]: Received disconnect from 192.168.41.5 port 36768:11: disconnected by user
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351155]: Disconnected from 192.168.41.5 port 36768
    /var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351155]: pam_unix(sshd:session): session closed for user root
    /var/log/auth.log:Jan 16 10:37:47 node08 systemd-logind[1403]: Removed session 31240.
    /var/log/auth.log:Jan 16 10:38:01 node08 CRON[2351272]: pam_unix(cron:session): session opened for user root by (uid=0)
    /var/log/auth.log:Jan 16 10:38:01 node08 CRON[2351272]: pam_unix(cron:session): session closed for user root
    /var/log/auth.log:Jan 16 10:38:54 node08 sshd[2351662]: rexec line 17: Deprecated option KeyRegenerationInterval
    /var/log/auth.log:Jan 16 10:38:54 node08 sshd[2351662]: rexec line 18: Deprecated option ServerKeyBits
    /var/log/auth.log:Jan 16 10:38:54 node08 sshd[2351662]: rexec line 29: Deprecated option RSAAuthentication
    /var/log/auth.log:Jan 16 10:38:54 node08 sshd[2351662]: rexec line 37: Deprecated option RhostsRSAAuthentication
    /var/log/auth.log:Jan 16 10:38:54 node08 sshd[2351662]: Connection closed by 192.168.41.5 port 37017 [preauth]
    /var/log/daemon.log:Jan 16 10:37:00 node08 systemd[1]: Starting Proxmox VE replication runner...
    /var/log/daemon.log:Jan 16 10:37:00 node08 systemd[1]: Started Proxmox VE replication runner.
    /var/log/daemon.log:Jan 16 10:37:47 node08 systemd[1]: Started Session 31238 of user root.
    /var/log/daemon.log:Jan 16 10:37:47 node08 systemd[1]: Started Session 31239 of user root.
    /var/log/daemon.log:Jan 16 10:37:47 node08 systemd[1]: Started Session 31240 of user root.
    /var/log/daemon.log:Jan 16 10:38:00 node08 systemd[1]: Starting Proxmox VE replication runner...
    /var/log/daemon.log:Jan 16 10:38:00 node08 systemd[1]: Started Proxmox VE replication runner.





    root@node09:
    ----------------
    /var/log/auth.log:Jan 16 10:37:49 node09 sshd[2998257]: Accepted publickey for root from 192.168.41.5 port 49708 ssh2: RSA SHA256:YjAqGC9Pj1iFQjIOvaDLz0e7T4ze0onEy6tnSYmVn/w
    /var/log/auth.log:Jan 16 10:37:49 node09 sshd[2998257]: pam_unix(sshd:session): session opened for user root by (uid=0)
    /var/log/auth.log:Jan 16 10:37:49 node09 systemd-logind[1390]: New session 51793 of user root.
    /var/log/auth.log:Jan 16 10:37:49 node09 sshd[2998257]: Received disconnect from 192.168.41.5 port 49708:11: disconnected by user
    /var/log/auth.log:Jan 16 10:37:49 node09 sshd[2998257]: Disconnected from 192.168.41.5 port 49708
    /var/log/auth.log:Jan 16 10:37:49 node09 sshd[2998257]: pam_unix(sshd:session): session closed for user root
    /var/log/auth.log:Jan 16 10:37:49 node09 systemd-logind[1390]: Removed session 51793.
    /var/log/auth.log:Jan 16 10:38:02 node09 sshd[2998364]: Accepted publickey for root from 192.168.41.5 port 49728 ssh2: RSA SHA256:YjAqGC9Pj1iFQjIOvaDLz0e7T4ze0onEy6tnSYmVn/w
    /var/log/auth.log:Jan 16 10:38:02 node09 sshd[2998364]: pam_unix(sshd:session): session opened for user root by (uid=0)
    /var/log/auth.log:Jan 16 10:38:02 node09 systemd-logind[1390]: New session 51794 of user root.
    /var/log/auth.log:Jan 16 10:38:02 node09 sshd[2998364]: Received disconnect from 192.168.41.5 port 49728:11: disconnected by user
    /var/log/auth.log:Jan 16 10:38:02 node09 sshd[2998364]: Disconnected from 192.168.41.5 port 49728
    /var/log/auth.log:Jan 16 10:38:02 node09 sshd[2998364]: pam_unix(sshd:session): session closed for user root
    /var/log/auth.log:Jan 16 10:38:02 node09 systemd-logind[1390]: Removed session 51794.
    /var/log/auth.log:Jan 16 10:38:54 node09 sshd[2998594]: Connection closed by 192.168.41.5 port 49951 [preauth]
    /var/log/daemon.log:Jan 16 10:37:00 node09 systemd[1]: Starting Proxmox VE replication runner...
    /var/log/daemon.log:Jan 16 10:37:01 node09 systemd[1]: Started Proxmox VE replication runner.
    /var/log/daemon.log:Jan 16 10:37:49 node09 systemd[1]: Started Session 51793 of user root.
    /var/log/daemon.log:Jan 16 10:38:00 node09 systemd[1]: Starting Proxmox VE replication runner...
    /var/log/daemon.log:Jan 16 10:38:01 node09 systemd[1]: Started Proxmox VE replication runner.
    /var/log/daemon.log:Jan 16 10:38:02 node09 systemd[1]: Started Session 51794 of user root.
    /var/log/daemon.log:Jan 16 10:38:47 node09 corosync[1861153]: notice [TOTEM ] Retransmit List: 23a3c4
    /var/log/daemon.log:Jan 16 10:38:47 node09 corosync[1861153]: notice [TOTEM ] Retransmit List: 23a3c4 23a3c5
    /var/log/daemon.log:Jan 16 10:38:47 node09 corosync[1861153]: [TOTEM ] Retransmit List: 23a3c4
    /var/log/daemon.log:Jan 16 10:38:47 node09 corosync[1861153]: notice [TOTEM ] Retransmit List: 23a3c4 23a3c5
    /var/log/daemon.log:Jan 16 10:38:47 node09 corosync[1861153]: notice [TOTEM ] Retransmit List: 23a3c4 23a3c5
    /var/log/daemon.log:Jan 16 10:38:47 node09 corosync[1861153]: [TOTEM ] Retransmit List: 23a3c4 23a3c5
    /var/log/daemon.log:Jan 16 10:38:47 node09 corosync[1861153]: notice [TOTEM ] Retransmit List: 23a3c4 23a3c5
    /var/log/daemon.log:Jan 16 10:38:47 node09 corosync[1861153]: [TOTEM ] Retransmit List: 23a3c4 23a3c5
    /var/log/daemon.log:Jan 16 10:38:47 node09 corosync[1861153]: [TOTEM ] Retransmit List: 23a3c4 23a3c5
    /var/log/daemon.log:Jan 16 10:38:47 node09 corosync[1861153]: [TOTEM ] Retransmit List: 23a3c4 23a3c5


    Node10:
    ---------------
    /var/log/auth.log:Jan 16 10:37:53 node10 sshd[2288027]: rexec line 17: Deprecated option KeyRegenerationInterval
    /var/log/auth.log:Jan 16 10:37:53 node10 sshd[2288027]: rexec line 18: Deprecated option ServerKeyBits
    /var/log/auth.log:Jan 16 10:37:53 node10 sshd[2288027]: rexec line 29: Deprecated option RSAAuthentication
    /var/log/auth.log:Jan 16 10:37:53 node10 sshd[2288027]: rexec line 37: Deprecated option RhostsRSAAuthentication
    /var/log/auth.log:Jan 16 10:37:53 node10 sshd[2288027]: reprocess config line 29: Deprecated option RSAAuthentication
    /var/log/auth.log:Jan 16 10:37:53 node10 sshd[2288027]: reprocess config line 37: Deprecated option RhostsRSAAuthentication
    /var/log/auth.log:Jan 16 10:37:53 node10 sshd[2288027]: Accepted publickey for root from 192.168.41.5 port 52020 ssh2: RSA SHA256:YjAqGC9Pj1iFQjIOvaDLz0e7T4ze0onEy6tnSYmVn/w
    /var/log/auth.log:Jan 16 10:37:53 node10 sshd[2288027]: pam_unix(sshd:session): session opened for user root by (uid=0)
    /var/log/auth.log:Jan 16 10:37:53 node10 systemd-logind[1432]: New session 29732 of user root.
    /var/log/auth.log:Jan 16 10:37:53 node10 sshd[2288027]: Received disconnect from 192.168.41.5 port 52020:11: disconnected by user
    /var/log/auth.log:Jan 16 10:37:53 node10 sshd[2288027]: Disconnected from 192.168.41.5 port 52020
    /var/log/auth.log:Jan 16 10:37:53 node10 sshd[2288027]: pam_unix(sshd:session): session closed for user root
    /var/log/auth.log:Jan 16 10:37:53 node10 systemd-logind[1432]: Removed session 29732.
    /var/log/auth.log:Jan 16 10:38:01 node10 CRON[2288128]: pam_unix(cron:session): session opened for user root by (uid=0)
    /var/log/auth.log:Jan 16 10:38:01 node10 CRON[2288128]: pam_unix(cron:session): session closed for user root
    /var/log/auth.log:Jan 16 10:38:28 node10 sshd[2288326]: rexec line 17: Deprecated option KeyRegenerationInterval
    /var/log/auth.log:Jan 16 10:38:28 node10 sshd[2288326]: rexec line 18: Deprecated option ServerKeyBits
    /var/log/auth.log:Jan 16 10:38:28 node10 sshd[2288326]: rexec line 29: Deprecated option RSAAuthentication
    /var/log/auth.log:Jan 16 10:38:28 node10 sshd[2288326]: rexec line 37: Deprecated option RhostsRSAAuthentication
    /var/log/auth.log:Jan 16 10:38:28 node10 sshd[2288326]: Connection closed by 192.168.41.5 port 52127 [preauth]
    /var/log/daemon.log:Jan 16 10:37:00 node10 systemd[1]: Starting Proxmox VE replication runner...
    /var/log/daemon.log:Jan 16 10:37:01 node10 systemd[1]: Started Proxmox VE replication runner.
    /var/log/daemon.log:Jan 16 10:38:00 node10 systemd[1]: Starting Proxmox VE replication runner...
    /var/log/daemon.log:Jan 16 10:38:01 node10 systemd[1]: Started Proxmox VE replication runner.


    A few more infos here:


    /etc/pve/corosync.conf:

    logging {
    debug: off
    to_syslog: yes
    }

    nodelist {
    node {
    name: node08
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.15.16.8
    }

    node {
    name: node10
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.15.16.10
    }

    node {
    name: node09
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.15.16.9
    }

    }

    quorum {
    provider: corosync_votequorum
    }

    totem {
    cluster_name: cluster3
    config_version: 7
    ip_version: ipv4
    secauth: on
    transport: udpu
    version: 2
    token: 8000
    interface {
    bindnetaddr: 10.15.16.8
    ringnumber: 0
    }

    }


    pvecm status
    Quorum information
    ------------------
    Date: Wed Jan 16 14:14:22 2019
    Quorum provider: corosync_votequorum
    Nodes: 3
    Node ID: 0x00000003
    Ring ID: 1/10248
    Quorate: Yes

    Votequorum information
    ----------------------
    Expected votes: 3
    Highest expected: 3
    Total votes: 3
    Quorum: 2
    Flags: Quorate

    Membership information
    ----------------------
    Nodeid Votes Name
    0x00000001 1 10.15.16.8
    0x00000002 1 10.15.16.9
    0x00000003 1 10.15.16.10 (local)


    root@node10:~ # ceph -s
    cluster:
    id: 1e5c2d93-43b6-418a-a272-35308fc4a761
    health: HEALTH_OK

    services:
    mon: 3 daemons, quorum 0,1,2
    mgr: node08(active), standbys: node09, node10
    osd: 20 osds: 20 up, 20 in

    data:
    pools: 2 pools, 612 pgs
    objects: 1.82M objects, 6.86TiB
    usage: 20.6TiB used, 51.9TiB / 72.5TiB avail
    pgs: 612 active+clean

    io:
    client: 7.46MiB/s rd, 12.1MiB/s wr, 381op/s rd, 755op/s wr

    root@node10:~ # ceph mon stat
    e6: 3 mons at {0=10.15.15.8:6789/0,1=10.15.15.9:6789/0,2=10.15.15.10:6789/0}, election epoch 816, leader 0 0, quorum 0,1,2 0,1,2



    root@node08:~ # route -n
    Kernel IP routing table
    Destination Gateway Genmask Flags Metric Ref Use Iface
    0.0.0.0 192.168.41.5 0.0.0.0 UG 0 0 0 eth2
    10.15.15.0 0.0.0.0 255.255.255.0 U 0 0 0 ens4
    10.15.15.0 0.0.0.0 255.255.255.0 U 0 0 0 ens3
    10.15.15.9 0.0.0.0 255.255.255.255 UH 0 0 0 ens3
    10.15.15.10 0.0.0.0 255.255.255.255 UH 0 0 0 ens4
    10.15.16.0 0.0.0.0 255.255.255.0 U 0 0 0 eth4
    10.15.16.0 0.0.0.0 255.255.255.0 U 0 0 0 eth5
    10.15.16.9 0.0.0.0 255.255.255.255 UH 0 0 0 eth5
    10.15.16.10 0.0.0.0 255.255.255.255 UH 0 0 0 eth4
    192.168.41.0 0.0.0.0 255.255.255.0 U 0 0 0 eth2


    root@node09:~ # route -n
    Kernel IP routing table
    Destination Gateway Genmask Flags Metric Ref Use Iface
    0.0.0.0 192.168.41.5 0.0.0.0 UG 0 0 0 eth2
    10.15.15.0 0.0.0.0 255.255.255.0 U 0 0 0 ens3
    10.15.15.0 0.0.0.0 255.255.255.0 U 0 0 0 ens4
    10.15.15.8 0.0.0.0 255.255.255.255 UH 0 0 0 ens3
    10.15.15.10 0.0.0.0 255.255.255.255 UH 0 0 0 ens4
    10.15.16.0 0.0.0.0 255.255.255.0 U 0 0 0 eth4
    10.15.16.0 0.0.0.0 255.255.255.0 U 0 0 0 eth5
    10.15.16.8 0.0.0.0 255.255.255.255 UH 0 0 0 eth4
    10.15.16.10 0.0.0.0 255.255.255.255 UH 0 0 0 eth5
    192.168.41.0 0.0.0.0 255.255.255.0 U 0 0 0 eth2



    root@node10:~ # route -n
    Kernel IP routing table
    Destination Gateway Genmask Flags Metric Ref Use Iface
    0.0.0.0 192.168.41.5 0.0.0.0 UG 0 0 0 eth2
    10.15.15.0 0.0.0.0 255.255.255.0 U 0 0 0 ens3
    10.15.15.0 0.0.0.0 255.255.255.0 U 0 0 0 ens4
    10.15.15.8 0.0.0.0 255.255.255.255 UH 0 0 0 ens4
    10.15.15.9 0.0.0.0 255.255.255.255 UH 0 0 0 ens3
    10.15.16.0 0.0.0.0 255.255.255.0 U 0 0 0 eth4
    10.15.16.0 0.0.0.0 255.255.255.0 U 0 0 0 eth5
    10.15.16.8 0.0.0.0 255.255.255.255 UH 0 0 0 eth5
    10.15.16.9 0.0.0.0 255.255.255.255 UH 0 0 0 eth4
    192.168.41.0 0.0.0.0 255.255.255.0 U 0 0 0 eth2



    I was told i can test uniccast with omping, but i get:

    root@node09:~ # omping -c 10000 -i 0.001 -F -q 10.15.16.9 10.15.16.10
    omping: Multiple local interfaces (eth5 and eth4) match parameters.
    => obviously on a full meshed unicast network
     
  5. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,899
    Likes Received:
    163
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  6. mohnewald

    mohnewald New Member
    Proxmox VE Subscriber

    Joined:
    Aug 21, 2018
    Messages:
    15
    Likes Received:
    0
    Hello Alwin,

    that wiki link seems new to me. But yes, i use "Method 2"
    I think there was a official proxmox unicast howto which described Method 2. But i cant find it anymore.

    Will Method 1 (multicast) fix my problem?

    Thanks,
    Mario
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice