Radom Node Freeze/Fence and TOTEM Retransmit List problems

mohnewald · Dec 6, 2018

Hello List,

i have a 3 Node Cluster with ceph.
They are connected to each other with unicast.

node08: pve-manager/5.3-5/97ae681d (running kernel: 4.15.18-9-pve)
node09: pve-manager/5.2-9/4b30e8f9 (running kernel: 4.15.18-7-pve)
node10: pve-manager/5.2-10/6f892b40 (running kernel: 4.15.18-7-pve)

Those nodes "freeze" from time to time since 29.07.2018:
29.07.2018 | node09
07.08.2018 | node08
21.08.2018 | node10
22.08.2018 | node08
07.09.2018 | node09
11.09.2018 | node08
13.09.2018 | node10
19.09.2018 | node08
26.09.2018 | node09
28.09.2018 | node08
16.10.2018 | node09
05.12.2018 | node08

I have the EXACT 3Node Hardware running Proxmox 4.4 which does not have those problems.
Until now, i thought its a Kernel/Nic driver problem. My thread is here:
https://forum.proxmox.com/threads/periodic-node-crash-freeze.46407/

However, i seem to have quite a few corosync/totem log entries like this:

node09 corosync[5290]: [TOTEM ] Retransmit List: 18f46c9
node09 corosync[5290]: notice [TOTEM ] Retransmit List: 17ae461
node09 corosync[5290]: [TOTEM ] Retransmit List: 17ae461
node09 corosync[5290]: notice [TOTEM ] Retransmit List: 168516b
node09 corosync[5290]: [TOTEM ] Retransmit List: 168516b

root@node08:~ # zgrep -c '\[TOTEM \] Retransmit List' /var/log/syslog*
/var/log/syslog:14
/var/log/syslog.1:6
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:2
/var/log/syslog.4.gz:12
/var/log/syslog.5.gz:2
/var/log/syslog.6.gz:0
/var/log/syslog.7.gz:26

root@node09:~ # zgrep -c '\[TOTEM \] Retransmit List' /var/log/syslog*
/var/log/syslog:0
/var/log/syslog.1:48
/var/log/syslog.2.gz:32
/var/log/syslog.3.gz:4
/var/log/syslog.4.gz:14
/var/log/syslog.5.gz:2
/var/log/syslog.6.gz:2
/var/log/syslog.7.gz:0

root@node10:~ # zgrep -c '\[TOTEM \] Retransmit List' /var/log/syslog*
/var/log/syslog:2
/var/log/syslog.1:84
/var/log/syslog.2.gz:8
/var/log/syslog.3.gz:70
/var/log/syslog.4.gz:14
/var/log/syslog.5.gz:4
/var/log/syslog.6.gz:10
/var/log/syslog.7.gz:16

My healthy Cluster still running on proxmox 4.4 does not have those problems. But maybe because the load is not the same?!

Here comes my real question:
1.) Did node08 die or:
Did node08 freeze AFTER some fence magic happened?
2.) Why does node08 not log any totem/corsync retransmit stuff?
3.) My heartbeats run over the SAN/Ceph net. Thats okay, right?
4.) is this some weird unicast problem? Would Multicast fix that?
5.) can i test/benchmark my UDP Unicast somehow like with omping?

Here are the logs from my last crash-freeze of node08:
------------------------------------------------------------

Node08:
---------
Dec 5 15:52:31 node08 systemd[477301]: Reached target Default.
Dec 5 15:52:31 node08 systemd[477301]: Startup finished in 32ms.
Dec 5 15:52:31 node08 systemd[1]: Started User Manager for UID 0.
Dec 5 15:52:31 node08 systemd[1]: Stopping User Manager for UID 0...
Dec 5 15:52:31 node08 systemd[477301]: Stopped target Default.
Dec 5 15:52:31 node08 systemd[477301]: Stopped target Basic System.
Dec 5 15:52:31 node08 systemd[477301]: Stopped target Paths.
Dec 5 15:52:31 node08 systemd[477301]: Stopped target Timers.
Dec 5 15:52:31 node08 systemd[477301]: Stopped target Sockets.
Dec 5 15:52:31 node08 systemd[477301]: Reached target Shutdown.
Dec 5 15:52:31 node08 systemd[477301]: Starting Exit the Session...
Dec 5 15:52:31 node08 systemd[477301]: Received SIGRTMIN+24 from PID 477353 (kill).
Dec 5 15:52:31 node08 systemd[1]: Stopped User Manager for UID 0.
Dec 5 15:52:31 node08 systemd[1]: Removed slice User Slice of root.
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^...

Node09:
---------
Dec 5 15:52:30 node09 systemd[1256497]: Received SIGRTMIN+24 from PID 1256530 (kill).
Dec 5 15:52:30 node09 systemd[1]: Stopped User Manager for UID 0.
Dec 5 15:52:30 node09 systemd[1]: Removed slice User Slice of root.
Dec 5 15:52:41 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d10b
Dec 5 15:52:41 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d10b
Dec 5 15:52:42 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d10d
Dec 5 15:52:42 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d10d
Dec 5 15:52:42 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d110
Dec 5 15:52:42 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d111
Dec 5 15:52:42 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d110
Dec 5 15:52:42 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d111
Dec 5 15:52:42 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d112
Dec 5 15:52:42 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d112
Dec 5 15:52:43 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d113
Dec 5 15:52:43 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d113
Dec 5 15:52:44 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d115 1b7d116 1b7d117 1b7d118 1b7d119 1b7d11a 1b7d11b
Dec 5 15:52:44 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d115 1b7d116 1b7d117 1b7d118 1b7d119 1b7d11a 1b7d11b
Dec 5 15:52:44 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d11d
...SNIP...
Dec 5 15:52:53 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d165
Dec 5 15:52:53 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d166
Dec 5 15:52:53 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d167
Dec 5 15:52:53 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d167
Dec 5 15:52:54 node09 corosync[5290]: notice [TOTEM ] A processor failed, forming new configuration.
Dec 5 15:52:54 node09 corosync[5290]: [TOTEM ] A processor failed, forming new configuration.
Dec 5 15:53:50 node09 pve-ha-crm[5801]: node 'node08': state changed from 'unknown' => 'fence'
Dec 5 15:55:00 node09 pve-ha-crm[5801]: node 'node08': state changed from 'fence' => 'unknown'

Node10:
--------
Dec 5 15:52:00 node10 systemd[1]: Starting Proxmox VE replication runner...
Dec 5 15:52:01 node10 systemd[1]: Started Proxmox VE replication runner.
Dec 5 15:52:54 node10 corosync[5295]: notice [TOTEM ] A processor failed, forming new configuration.
Dec 5 15:52:54 node10 corosync[5295]: [TOTEM ] A processor failed, forming new configuration.
Dec 5 15:52:56 node10 ceph-osd[6256]: 2018-12-05 15:52:56.584700 7f9a2d368700 -1 osd.9 12233 heartbeat_check: no reply from 10.15.15.8:6830 osd.2 since back 2018-12-05 15:52:36.458621 front 2018-12-05 15:52:36.458621 (cutoff 2018-12-05 15:52:36.584696)
Dec 5 15:52:56 node10 ceph-osd[6256]: 2018-12-05 15:52:56.584717 7f9a2d368700 -1 osd.9 12233 heartbeat_check: no reply from 10.15.15.8:6817 osd.10 since back 2018-12-05 15:52:43.360369 front 2018-12-05 15:52:36.458621 (cutoff 2018-12-05 15:52:36.584696)
Dec 5 15:52:56 node10 ceph-osd[6256]: 2018-12-05 15:52:56.584722 7f9a2d368700 -1 osd.9 12233 heartbeat_check: no reply from 10.15.15.8:6803 osd.11 since back 2018-12-05 15:52:47.961863 front 2018-12-05 15:52:36.458621 (cutoff 2018-12-05 15:52:36.584696)
Dec 5 15:52:56 node10 ceph-osd[6256]: 2018-12-05 15:52:56.584726 7f9a2d368700 -1 osd.9 12233 heartbeat_check: no reply from 10.15.15.8:6828 osd.13 since back 2018-12-05 15:52:36.458621 front 2018-12-05 15:52:42.259818 (cutoff 2018-12-05 15:52:36.584696)
Dec 5 15:52:56 node10 corosync[5295]: notice [TOTEM ] A new membership (10.15.15.9:10164) was formed. Members left: 1
Dec 5 15:52:56 node10 corosync[5295]: notice [TOTEM ] Failed to receive the leave message. failed: 1
Dec 5 15:52:56 node10 corosync[5295]: [TOTEM ] A new membership (10.15.15.9:10164) was formed. Members left: 1
Dec 5 15:52:56 node10 corosync[5295]: [TOTEM ] Failed to receive the leave message. failed: 1
Dec 5 15:52:56 node10 corosync[5295]: warning [CPG ] downlist left_list: 1 received

Thanks a lot,
Mario

Alwin · Dec 6, 2018

mohnewald said:
1.) Did node08 die or:
Did node08 freeze AFTER some fence magic happened?
2.) Why does node08 not log any totem/corsync retransmit stuff?
3.) My heartbeats run over the SAN/Ceph net. Thats okay, right?
4.) is this some weird unicast problem? Would Multicast fix that?
5.) can i test/benchmark my UDP Unicast somehow like with omping?

1. it was most likely fenced.
2. it probably couldn't log anything anymore.
3. Corosync (and Ceph for that matter) needs low latency and any other traffic interferes. Put it on its own physical network.
4. Unicast produces way more traffic and eats way more resources then multicast. Check your switch load. Better use multicast.
5. omping is able to check both, multicast and unicast.

spirit · Dec 7, 2018

you can also try to increase corosync token timeout (maybe your switch latency is too high with unicast and your number of nodes)

/etc/pve/corosync.conf

totem {
..
token: 4000

for example

mohnewald · Jan 16, 2019

Hello,

i am still struggling with those random Node Freezes/Crashed.
I run a 3 Node cluster with Unicast (NO SWITCH) in a full meshed network.

I think i was abele to reproduced the problem with a VMWare 3Node Cluster Setup by setting some packet loss on the corosync network. The fencing jumps in and i get those @@@@@@@@ in the logs.

I then switched my corosync network away from the ceph/san network.

CEPH/SAN: 10.15.15.0/24 (10Gbit dedicated just for ceph)
COROSYNC: 10.15.16.0/24 (10Gbit dedicated just for corosync)

I read somewhere: never run corosync on the san/ceph network. I also changed the corosync token to 8000 (8 seconds).

But i sill get those:
corosync[1861153]: notice [TOTEM ] Retransmit List: 23a3c4 23a3c5
errors from time to time. No crash/fencing so far. Here are the numbers of Retransmit errors after i put corosync on a dedicated network:

Node08:
---------
/var/log/syslog: 0
/var/log/syslog.1: 2
/var/log/syslog.2.gz: 0

Node09:
--------
/var/log/syslog: 10
/var/log/syslog.1: 2

Node10:
------------
/var/log/syslog:0
/var/log/syslog.1: 34
/var/log/syslog.2.gz:0

On node09 i got some Retransmit errors today about 10:38:47.
So here are the logs from 10:37 and 10:38 using grep like:
grep -e 'Jan 16 10:37' -e 'Jan 16 10:38' /var/log/*.log

root@node08:
----------------
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351104]: rexec line 17: Deprecated option KeyRegenerationInterval
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351104]: rexec line 18: Deprecated option ServerKeyBits
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351104]: rexec line 29: Deprecated option RSAAuthentication
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351104]: rexec line 37: Deprecated option RhostsRSAAuthentication
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351104]: reprocess config line 29: Deprecated option RSAAuthentication
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351104]: reprocess config line 37: Deprecated option RhostsRSAAuthentication
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351104]: Accepted publickey for root from 192.168.41.5 port 36761 ssh2: RSA SHA256:YjAqGC9Pj1iFQjIOvaDLz0e7T4ze0onEy6tnSYmVn/w
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351104]: pam_unix(sshd:session): session opened for user root by (uid=0)
/var/log/auth.log:Jan 16 10:37:47 node08 systemd-logind[1403]: New session 31238 of user root.
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351104]: Received disconnect from 192.168.41.5 port 36761:11: disconnected by user
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351104]: Disconnected from 192.168.41.5 port 36761
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351104]: pam_unix(sshd:session): session closed for user root
/var/log/auth.log:Jan 16 10:37:47 node08 systemd-logind[1403]: Removed session 31238.
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351127]: rexec line 17: Deprecated option KeyRegenerationInterval
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351127]: rexec line 18: Deprecated option ServerKeyBits
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351127]: rexec line 29: Deprecated option RSAAuthentication
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351127]: rexec line 37: Deprecated option RhostsRSAAuthentication
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351127]: reprocess config line 29: Deprecated option RSAAuthentication
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351127]: reprocess config line 37: Deprecated option RhostsRSAAuthentication
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351127]: Accepted publickey for root from 192.168.41.5 port 36765 ssh2: RSA SHA256:YjAqGC9Pj1iFQjIOvaDLz0e7T4ze0onEy6tnSYmVn/w
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351127]: pam_unix(sshd:session): session opened for user root by (uid=0)
/var/log/auth.log:Jan 16 10:37:47 node08 systemd-logind[1403]: New session 31239 of user root.
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351127]: Received disconnect from 192.168.41.5 port 36765:11: disconnected by user
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351127]: Disconnected from 192.168.41.5 port 36765
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351127]: pam_unix(sshd:session): session closed for user root
/var/log/auth.log:Jan 16 10:37:47 node08 systemd-logind[1403]: Removed session 31239.
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351155]: rexec line 17: Deprecated option KeyRegenerationInterval
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351155]: rexec line 18: Deprecated option ServerKeyBits
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351155]: rexec line 29: Deprecated option RSAAuthentication
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351155]: rexec line 37: Deprecated option RhostsRSAAuthentication
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351155]: reprocess config line 29: Deprecated option RSAAuthentication
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351155]: reprocess config line 37: Deprecated option RhostsRSAAuthentication
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351155]: Accepted publickey for root from 192.168.41.5 port 36768 ssh2: RSA SHA256:YjAqGC9Pj1iFQjIOvaDLz0e7T4ze0onEy6tnSYmVn/w
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351155]: pam_unix(sshd:session): session opened for user root by (uid=0)
/var/log/auth.log:Jan 16 10:37:47 node08 systemd-logind[1403]: New session 31240 of user root.
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351155]: Received disconnect from 192.168.41.5 port 36768:11: disconnected by user
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351155]: Disconnected from 192.168.41.5 port 36768
/var/log/auth.log:Jan 16 10:37:47 node08 sshd[2351155]: pam_unix(sshd:session): session closed for user root
/var/log/auth.log:Jan 16 10:37:47 node08 systemd-logind[1403]: Removed session 31240.
/var/log/auth.log:Jan 16 10:38:01 node08 CRON[2351272]: pam_unix(cron:session): session opened for user root by (uid=0)
/var/log/auth.log:Jan 16 10:38:01 node08 CRON[2351272]: pam_unix(cron:session): session closed for user root
/var/log/auth.log:Jan 16 10:38:54 node08 sshd[2351662]: rexec line 17: Deprecated option KeyRegenerationInterval
/var/log/auth.log:Jan 16 10:38:54 node08 sshd[2351662]: rexec line 18: Deprecated option ServerKeyBits
/var/log/auth.log:Jan 16 10:38:54 node08 sshd[2351662]: rexec line 29: Deprecated option RSAAuthentication
/var/log/auth.log:Jan 16 10:38:54 node08 sshd[2351662]: rexec line 37: Deprecated option RhostsRSAAuthentication
/var/log/auth.log:Jan 16 10:38:54 node08 sshd[2351662]: Connection closed by 192.168.41.5 port 37017 [preauth]
/var/log/daemon.log:Jan 16 10:37:00 node08 systemd[1]: Starting Proxmox VE replication runner...
/var/log/daemon.log:Jan 16 10:37:00 node08 systemd[1]: Started Proxmox VE replication runner.
/var/log/daemon.log:Jan 16 10:37:47 node08 systemd[1]: Started Session 31238 of user root.
/var/log/daemon.log:Jan 16 10:37:47 node08 systemd[1]: Started Session 31239 of user root.
/var/log/daemon.log:Jan 16 10:37:47 node08 systemd[1]: Started Session 31240 of user root.
/var/log/daemon.log:Jan 16 10:38:00 node08 systemd[1]: Starting Proxmox VE replication runner...
/var/log/daemon.log:Jan 16 10:38:00 node08 systemd[1]: Started Proxmox VE replication runner.

root@node09:
----------------
/var/log/auth.log:Jan 16 10:37:49 node09 sshd[2998257]: Accepted publickey for root from 192.168.41.5 port 49708 ssh2: RSA SHA256:YjAqGC9Pj1iFQjIOvaDLz0e7T4ze0onEy6tnSYmVn/w
/var/log/auth.log:Jan 16 10:37:49 node09 sshd[2998257]: pam_unix(sshd:session): session opened for user root by (uid=0)
/var/log/auth.log:Jan 16 10:37:49 node09 systemd-logind[1390]: New session 51793 of user root.
/var/log/auth.log:Jan 16 10:37:49 node09 sshd[2998257]: Received disconnect from 192.168.41.5 port 49708:11: disconnected by user
/var/log/auth.log:Jan 16 10:37:49 node09 sshd[2998257]: Disconnected from 192.168.41.5 port 49708
/var/log/auth.log:Jan 16 10:37:49 node09 sshd[2998257]: pam_unix(sshd:session): session closed for user root
/var/log/auth.log:Jan 16 10:37:49 node09 systemd-logind[1390]: Removed session 51793.
/var/log/auth.log:Jan 16 10:38:02 node09 sshd[2998364]: Accepted publickey for root from 192.168.41.5 port 49728 ssh2: RSA SHA256:YjAqGC9Pj1iFQjIOvaDLz0e7T4ze0onEy6tnSYmVn/w
/var/log/auth.log:Jan 16 10:38:02 node09 sshd[2998364]: pam_unix(sshd:session): session opened for user root by (uid=0)
/var/log/auth.log:Jan 16 10:38:02 node09 systemd-logind[1390]: New session 51794 of user root.
/var/log/auth.log:Jan 16 10:38:02 node09 sshd[2998364]: Received disconnect from 192.168.41.5 port 49728:11: disconnected by user
/var/log/auth.log:Jan 16 10:38:02 node09 sshd[2998364]: Disconnected from 192.168.41.5 port 49728
/var/log/auth.log:Jan 16 10:38:02 node09 sshd[2998364]: pam_unix(sshd:session): session closed for user root
/var/log/auth.log:Jan 16 10:38:02 node09 systemd-logind[1390]: Removed session 51794.
/var/log/auth.log:Jan 16 10:38:54 node09 sshd[2998594]: Connection closed by 192.168.41.5 port 49951 [preauth]
/var/log/daemon.log:Jan 16 10:37:00 node09 systemd[1]: Starting Proxmox VE replication runner...
/var/log/daemon.log:Jan 16 10:37:01 node09 systemd[1]: Started Proxmox VE replication runner.
/var/log/daemon.log:Jan 16 10:37:49 node09 systemd[1]: Started Session 51793 of user root.
/var/log/daemon.log:Jan 16 10:38:00 node09 systemd[1]: Starting Proxmox VE replication runner...
/var/log/daemon.log:Jan 16 10:38:01 node09 systemd[1]: Started Proxmox VE replication runner.
/var/log/daemon.log:Jan 16 10:38:02 node09 systemd[1]: Started Session 51794 of user root.
/var/log/daemon.log:Jan 16 10:38:47 node09 corosync[1861153]: notice [TOTEM ] Retransmit List: 23a3c4
/var/log/daemon.log:Jan 16 10:38:47 node09 corosync[1861153]: notice [TOTEM ] Retransmit List: 23a3c4 23a3c5
/var/log/daemon.log:Jan 16 10:38:47 node09 corosync[1861153]: [TOTEM ] Retransmit List: 23a3c4
/var/log/daemon.log:Jan 16 10:38:47 node09 corosync[1861153]: notice [TOTEM ] Retransmit List: 23a3c4 23a3c5
/var/log/daemon.log:Jan 16 10:38:47 node09 corosync[1861153]: notice [TOTEM ] Retransmit List: 23a3c4 23a3c5
/var/log/daemon.log:Jan 16 10:38:47 node09 corosync[1861153]: [TOTEM ] Retransmit List: 23a3c4 23a3c5
/var/log/daemon.log:Jan 16 10:38:47 node09 corosync[1861153]: notice [TOTEM ] Retransmit List: 23a3c4 23a3c5
/var/log/daemon.log:Jan 16 10:38:47 node09 corosync[1861153]: [TOTEM ] Retransmit List: 23a3c4 23a3c5
/var/log/daemon.log:Jan 16 10:38:47 node09 corosync[1861153]: [TOTEM ] Retransmit List: 23a3c4 23a3c5
/var/log/daemon.log:Jan 16 10:38:47 node09 corosync[1861153]: [TOTEM ] Retransmit List: 23a3c4 23a3c5

Node10:
---------------
/var/log/auth.log:Jan 16 10:37:53 node10 sshd[2288027]: rexec line 17: Deprecated option KeyRegenerationInterval
/var/log/auth.log:Jan 16 10:37:53 node10 sshd[2288027]: rexec line 18: Deprecated option ServerKeyBits
/var/log/auth.log:Jan 16 10:37:53 node10 sshd[2288027]: rexec line 29: Deprecated option RSAAuthentication
/var/log/auth.log:Jan 16 10:37:53 node10 sshd[2288027]: rexec line 37: Deprecated option RhostsRSAAuthentication
/var/log/auth.log:Jan 16 10:37:53 node10 sshd[2288027]: reprocess config line 29: Deprecated option RSAAuthentication
/var/log/auth.log:Jan 16 10:37:53 node10 sshd[2288027]: reprocess config line 37: Deprecated option RhostsRSAAuthentication
/var/log/auth.log:Jan 16 10:37:53 node10 sshd[2288027]: Accepted publickey for root from 192.168.41.5 port 52020 ssh2: RSA SHA256:YjAqGC9Pj1iFQjIOvaDLz0e7T4ze0onEy6tnSYmVn/w
/var/log/auth.log:Jan 16 10:37:53 node10 sshd[2288027]: pam_unix(sshd:session): session opened for user root by (uid=0)
/var/log/auth.log:Jan 16 10:37:53 node10 systemd-logind[1432]: New session 29732 of user root.
/var/log/auth.log:Jan 16 10:37:53 node10 sshd[2288027]: Received disconnect from 192.168.41.5 port 52020:11: disconnected by user
/var/log/auth.log:Jan 16 10:37:53 node10 sshd[2288027]: Disconnected from 192.168.41.5 port 52020
/var/log/auth.log:Jan 16 10:37:53 node10 sshd[2288027]: pam_unix(sshd:session): session closed for user root
/var/log/auth.log:Jan 16 10:37:53 node10 systemd-logind[1432]: Removed session 29732.
/var/log/auth.log:Jan 16 10:38:01 node10 CRON[2288128]: pam_unix(cron:session): session opened for user root by (uid=0)
/var/log/auth.log:Jan 16 10:38:01 node10 CRON[2288128]: pam_unix(cron:session): session closed for user root
/var/log/auth.log:Jan 16 10:38:28 node10 sshd[2288326]: rexec line 17: Deprecated option KeyRegenerationInterval
/var/log/auth.log:Jan 16 10:38:28 node10 sshd[2288326]: rexec line 18: Deprecated option ServerKeyBits
/var/log/auth.log:Jan 16 10:38:28 node10 sshd[2288326]: rexec line 29: Deprecated option RSAAuthentication
/var/log/auth.log:Jan 16 10:38:28 node10 sshd[2288326]: rexec line 37: Deprecated option RhostsRSAAuthentication
/var/log/auth.log:Jan 16 10:38:28 node10 sshd[2288326]: Connection closed by 192.168.41.5 port 52127 [preauth]
/var/log/daemon.log:Jan 16 10:37:00 node10 systemd[1]: Starting Proxmox VE replication runner...
/var/log/daemon.log:Jan 16 10:37:01 node10 systemd[1]: Started Proxmox VE replication runner.
/var/log/daemon.log:Jan 16 10:38:00 node10 systemd[1]: Starting Proxmox VE replication runner...
/var/log/daemon.log:Jan 16 10:38:01 node10 systemd[1]: Started Proxmox VE replication runner.

A few more infos here:

/etc/pve/corosync.conf:

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: node08
nodeid: 1
quorum_votes: 1
ring0_addr: 10.15.16.8
}

node {
name: node10
nodeid: 3
quorum_votes: 1
ring0_addr: 10.15.16.10
}

node {
name: node09
nodeid: 2
quorum_votes: 1
ring0_addr: 10.15.16.9
}

}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: cluster3
config_version: 7
ip_version: ipv4
secauth: on
transport: udpu
version: 2
token: 8000
interface {
bindnetaddr: 10.15.16.8
ringnumber: 0
}

}

pvecm status
Quorum information
------------------
Date: Wed Jan 16 14:14:22 2019
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000003
Ring ID: 1/10248
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.15.16.8
0x00000002 1 10.15.16.9
0x00000003 1 10.15.16.10 (local)

root@node10:~ # ceph -s
cluster:
id: 1e5c2d93-43b6-418a-a272-35308fc4a761
health: HEALTH_OK

services:
mon: 3 daemons, quorum 0,1,2
mgr: node08(active), standbys: node09, node10
osd: 20 osds: 20 up, 20 in

data:
pools: 2 pools, 612 pgs
objects: 1.82M objects, 6.86TiB
usage: 20.6TiB used, 51.9TiB / 72.5TiB avail
pgs: 612 active+clean

io:
client: 7.46MiB/s rd, 12.1MiB/s wr, 381op/s rd, 755op/s wr

root@node10:~ # ceph mon stat
e6: 3 mons at {0=10.15.15.8:6789/0,1=10.15.15.9:6789/0,2=10.15.15.10:6789/0}, election epoch 816, leader 0 0, quorum 0,1,2 0,1,2

root@node08:~ # route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 192.168.41.5 0.0.0.0 UG 0 0 0 eth2
10.15.15.0 0.0.0.0 255.255.255.0 U 0 0 0 ens4
10.15.15.0 0.0.0.0 255.255.255.0 U 0 0 0 ens3
10.15.15.9 0.0.0.0 255.255.255.255 UH 0 0 0 ens3
10.15.15.10 0.0.0.0 255.255.255.255 UH 0 0 0 ens4
10.15.16.0 0.0.0.0 255.255.255.0 U 0 0 0 eth4
10.15.16.0 0.0.0.0 255.255.255.0 U 0 0 0 eth5
10.15.16.9 0.0.0.0 255.255.255.255 UH 0 0 0 eth5
10.15.16.10 0.0.0.0 255.255.255.255 UH 0 0 0 eth4
192.168.41.0 0.0.0.0 255.255.255.0 U 0 0 0 eth2

root@node09:~ # route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 192.168.41.5 0.0.0.0 UG 0 0 0 eth2
10.15.15.0 0.0.0.0 255.255.255.0 U 0 0 0 ens3
10.15.15.0 0.0.0.0 255.255.255.0 U 0 0 0 ens4
10.15.15.8 0.0.0.0 255.255.255.255 UH 0 0 0 ens3
10.15.15.10 0.0.0.0 255.255.255.255 UH 0 0 0 ens4
10.15.16.0 0.0.0.0 255.255.255.0 U 0 0 0 eth4
10.15.16.0 0.0.0.0 255.255.255.0 U 0 0 0 eth5
10.15.16.8 0.0.0.0 255.255.255.255 UH 0 0 0 eth4
10.15.16.10 0.0.0.0 255.255.255.255 UH 0 0 0 eth5
192.168.41.0 0.0.0.0 255.255.255.0 U 0 0 0 eth2

root@node10:~ # route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 192.168.41.5 0.0.0.0 UG 0 0 0 eth2
10.15.15.0 0.0.0.0 255.255.255.0 U 0 0 0 ens3
10.15.15.0 0.0.0.0 255.255.255.0 U 0 0 0 ens4
10.15.15.8 0.0.0.0 255.255.255.255 UH 0 0 0 ens4
10.15.15.9 0.0.0.0 255.255.255.255 UH 0 0 0 ens3
10.15.16.0 0.0.0.0 255.255.255.0 U 0 0 0 eth4
10.15.16.0 0.0.0.0 255.255.255.0 U 0 0 0 eth5
10.15.16.8 0.0.0.0 255.255.255.255 UH 0 0 0 eth5
10.15.16.9 0.0.0.0 255.255.255.255 UH 0 0 0 eth4
192.168.41.0 0.0.0.0 255.255.255.0 U 0 0 0 eth2

I was told i can test uniccast with omping, but i get:

root@node09:~ # omping -c 10000 -i 0.001 -F -q 10.15.16.9 10.15.16.10
omping: Multiple local interfaces (eth5 and eth4) match parameters.
=> obviously on a full meshed unicast network

Alwin · Jan 16, 2019

Did you see this wiki entry?
https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server

mohnewald · Jan 16, 2019

Hello Alwin,

that wiki link seems new to me. But yes, i use "Method 2"
I think there was a official proxmox unicast howto which described Method 2. But i cant find it anymore.

Will Method 1 (multicast) fix my problem?

Thanks,
Mario

Search

Search

Radom Node Freeze/Fence and TOTEM Retransmit List problems

mohnewald

Well-Known Member

Alwin

Proxmox Retired Staff

spirit

Distinguished Member

mohnewald

Well-Known Member

Alwin

Proxmox Retired Staff

mohnewald

Well-Known Member