Hello List,
i have a 3 Node Cluster with ceph.
They are connected to each other with unicast.
node08: pve-manager/5.3-5/97ae681d (running kernel: 4.15.18-9-pve)
node09: pve-manager/5.2-9/4b30e8f9 (running kernel: 4.15.18-7-pve)
node10: pve-manager/5.2-10/6f892b40 (running kernel: 4.15.18-7-pve)
Those nodes "freeze" from time to time since 29.07.2018:
29.07.2018 | node09
07.08.2018 | node08
21.08.2018 | node10
22.08.2018 | node08
07.09.2018 | node09
11.09.2018 | node08
13.09.2018 | node10
19.09.2018 | node08
26.09.2018 | node09
28.09.2018 | node08
16.10.2018 | node09
05.12.2018 | node08
I have the EXACT 3Node Hardware running Proxmox 4.4 which does not have those problems.
Until now, i thought its a Kernel/Nic driver problem. My thread is here:
https://forum.proxmox.com/threads/periodic-node-crash-freeze.46407/
However, i seem to have quite a few corosync/totem log entries like this:
node09 corosync[5290]: [TOTEM ] Retransmit List: 18f46c9
node09 corosync[5290]: notice [TOTEM ] Retransmit List: 17ae461
node09 corosync[5290]: [TOTEM ] Retransmit List: 17ae461
node09 corosync[5290]: notice [TOTEM ] Retransmit List: 168516b
node09 corosync[5290]: [TOTEM ] Retransmit List: 168516b
root@node08:~ # zgrep -c '\[TOTEM \] Retransmit List' /var/log/syslog*
/var/log/syslog:14
/var/log/syslog.1:6
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:2
/var/log/syslog.4.gz:12
/var/log/syslog.5.gz:2
/var/log/syslog.6.gz:0
/var/log/syslog.7.gz:26
root@node09:~ # zgrep -c '\[TOTEM \] Retransmit List' /var/log/syslog*
/var/log/syslog:0
/var/log/syslog.1:48
/var/log/syslog.2.gz:32
/var/log/syslog.3.gz:4
/var/log/syslog.4.gz:14
/var/log/syslog.5.gz:2
/var/log/syslog.6.gz:2
/var/log/syslog.7.gz:0
root@node10:~ # zgrep -c '\[TOTEM \] Retransmit List' /var/log/syslog*
/var/log/syslog:2
/var/log/syslog.1:84
/var/log/syslog.2.gz:8
/var/log/syslog.3.gz:70
/var/log/syslog.4.gz:14
/var/log/syslog.5.gz:4
/var/log/syslog.6.gz:10
/var/log/syslog.7.gz:16
My healthy Cluster still running on proxmox 4.4 does not have those problems. But maybe because the load is not the same?!
Here comes my real question:
1.) Did node08 die or:
Did node08 freeze AFTER some fence magic happened?
2.) Why does node08 not log any totem/corsync retransmit stuff?
3.) My heartbeats run over the SAN/Ceph net. Thats okay, right?
4.) is this some weird unicast problem? Would Multicast fix that?
5.) can i test/benchmark my UDP Unicast somehow like with omping?
Here are the logs from my last crash-freeze of node08:
------------------------------------------------------------
Node08:
---------
Dec 5 15:52:31 node08 systemd[477301]: Reached target Default.
Dec 5 15:52:31 node08 systemd[477301]: Startup finished in 32ms.
Dec 5 15:52:31 node08 systemd[1]: Started User Manager for UID 0.
Dec 5 15:52:31 node08 systemd[1]: Stopping User Manager for UID 0...
Dec 5 15:52:31 node08 systemd[477301]: Stopped target Default.
Dec 5 15:52:31 node08 systemd[477301]: Stopped target Basic System.
Dec 5 15:52:31 node08 systemd[477301]: Stopped target Paths.
Dec 5 15:52:31 node08 systemd[477301]: Stopped target Timers.
Dec 5 15:52:31 node08 systemd[477301]: Stopped target Sockets.
Dec 5 15:52:31 node08 systemd[477301]: Reached target Shutdown.
Dec 5 15:52:31 node08 systemd[477301]: Starting Exit the Session...
Dec 5 15:52:31 node08 systemd[477301]: Received SIGRTMIN+24 from PID 477353 (kill).
Dec 5 15:52:31 node08 systemd[1]: Stopped User Manager for UID 0.
Dec 5 15:52:31 node08 systemd[1]: Removed slice User Slice of root.
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^...
Node09:
---------
Dec 5 15:52:30 node09 systemd[1256497]: Received SIGRTMIN+24 from PID 1256530 (kill).
Dec 5 15:52:30 node09 systemd[1]: Stopped User Manager for UID 0.
Dec 5 15:52:30 node09 systemd[1]: Removed slice User Slice of root.
Dec 5 15:52:41 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d10b
Dec 5 15:52:41 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d10b
Dec 5 15:52:42 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d10d
Dec 5 15:52:42 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d10d
Dec 5 15:52:42 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d110
Dec 5 15:52:42 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d111
Dec 5 15:52:42 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d110
Dec 5 15:52:42 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d111
Dec 5 15:52:42 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d112
Dec 5 15:52:42 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d112
Dec 5 15:52:43 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d113
Dec 5 15:52:43 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d113
Dec 5 15:52:44 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d115 1b7d116 1b7d117 1b7d118 1b7d119 1b7d11a 1b7d11b
Dec 5 15:52:44 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d115 1b7d116 1b7d117 1b7d118 1b7d119 1b7d11a 1b7d11b
Dec 5 15:52:44 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d11d
...SNIP...
Dec 5 15:52:53 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d165
Dec 5 15:52:53 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d166
Dec 5 15:52:53 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d167
Dec 5 15:52:53 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d167
Dec 5 15:52:54 node09 corosync[5290]: notice [TOTEM ] A processor failed, forming new configuration.
Dec 5 15:52:54 node09 corosync[5290]: [TOTEM ] A processor failed, forming new configuration.
Dec 5 15:53:50 node09 pve-ha-crm[5801]: node 'node08': state changed from 'unknown' => 'fence'
Dec 5 15:55:00 node09 pve-ha-crm[5801]: node 'node08': state changed from 'fence' => 'unknown'
Node10:
--------
Dec 5 15:52:00 node10 systemd[1]: Starting Proxmox VE replication runner...
Dec 5 15:52:01 node10 systemd[1]: Started Proxmox VE replication runner.
Dec 5 15:52:54 node10 corosync[5295]: notice [TOTEM ] A processor failed, forming new configuration.
Dec 5 15:52:54 node10 corosync[5295]: [TOTEM ] A processor failed, forming new configuration.
Dec 5 15:52:56 node10 ceph-osd[6256]: 2018-12-05 15:52:56.584700 7f9a2d368700 -1 osd.9 12233 heartbeat_check: no reply from 10.15.15.8:6830 osd.2 since back 2018-12-05 15:52:36.458621 front 2018-12-05 15:52:36.458621 (cutoff 2018-12-05 15:52:36.584696)
Dec 5 15:52:56 node10 ceph-osd[6256]: 2018-12-05 15:52:56.584717 7f9a2d368700 -1 osd.9 12233 heartbeat_check: no reply from 10.15.15.8:6817 osd.10 since back 2018-12-05 15:52:43.360369 front 2018-12-05 15:52:36.458621 (cutoff 2018-12-05 15:52:36.584696)
Dec 5 15:52:56 node10 ceph-osd[6256]: 2018-12-05 15:52:56.584722 7f9a2d368700 -1 osd.9 12233 heartbeat_check: no reply from 10.15.15.8:6803 osd.11 since back 2018-12-05 15:52:47.961863 front 2018-12-05 15:52:36.458621 (cutoff 2018-12-05 15:52:36.584696)
Dec 5 15:52:56 node10 ceph-osd[6256]: 2018-12-05 15:52:56.584726 7f9a2d368700 -1 osd.9 12233 heartbeat_check: no reply from 10.15.15.8:6828 osd.13 since back 2018-12-05 15:52:36.458621 front 2018-12-05 15:52:42.259818 (cutoff 2018-12-05 15:52:36.584696)
Dec 5 15:52:56 node10 corosync[5295]: notice [TOTEM ] A new membership (10.15.15.9:10164) was formed. Members left: 1
Dec 5 15:52:56 node10 corosync[5295]: notice [TOTEM ] Failed to receive the leave message. failed: 1
Dec 5 15:52:56 node10 corosync[5295]: [TOTEM ] A new membership (10.15.15.9:10164) was formed. Members left: 1
Dec 5 15:52:56 node10 corosync[5295]: [TOTEM ] Failed to receive the leave message. failed: 1
Dec 5 15:52:56 node10 corosync[5295]: warning [CPG ] downlist left_list: 1 received
Thanks a lot,
Mario
i have a 3 Node Cluster with ceph.
They are connected to each other with unicast.
node08: pve-manager/5.3-5/97ae681d (running kernel: 4.15.18-9-pve)
node09: pve-manager/5.2-9/4b30e8f9 (running kernel: 4.15.18-7-pve)
node10: pve-manager/5.2-10/6f892b40 (running kernel: 4.15.18-7-pve)
Those nodes "freeze" from time to time since 29.07.2018:
29.07.2018 | node09
07.08.2018 | node08
21.08.2018 | node10
22.08.2018 | node08
07.09.2018 | node09
11.09.2018 | node08
13.09.2018 | node10
19.09.2018 | node08
26.09.2018 | node09
28.09.2018 | node08
16.10.2018 | node09
05.12.2018 | node08
I have the EXACT 3Node Hardware running Proxmox 4.4 which does not have those problems.
Until now, i thought its a Kernel/Nic driver problem. My thread is here:
https://forum.proxmox.com/threads/periodic-node-crash-freeze.46407/
However, i seem to have quite a few corosync/totem log entries like this:
node09 corosync[5290]: [TOTEM ] Retransmit List: 18f46c9
node09 corosync[5290]: notice [TOTEM ] Retransmit List: 17ae461
node09 corosync[5290]: [TOTEM ] Retransmit List: 17ae461
node09 corosync[5290]: notice [TOTEM ] Retransmit List: 168516b
node09 corosync[5290]: [TOTEM ] Retransmit List: 168516b
root@node08:~ # zgrep -c '\[TOTEM \] Retransmit List' /var/log/syslog*
/var/log/syslog:14
/var/log/syslog.1:6
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:2
/var/log/syslog.4.gz:12
/var/log/syslog.5.gz:2
/var/log/syslog.6.gz:0
/var/log/syslog.7.gz:26
root@node09:~ # zgrep -c '\[TOTEM \] Retransmit List' /var/log/syslog*
/var/log/syslog:0
/var/log/syslog.1:48
/var/log/syslog.2.gz:32
/var/log/syslog.3.gz:4
/var/log/syslog.4.gz:14
/var/log/syslog.5.gz:2
/var/log/syslog.6.gz:2
/var/log/syslog.7.gz:0
root@node10:~ # zgrep -c '\[TOTEM \] Retransmit List' /var/log/syslog*
/var/log/syslog:2
/var/log/syslog.1:84
/var/log/syslog.2.gz:8
/var/log/syslog.3.gz:70
/var/log/syslog.4.gz:14
/var/log/syslog.5.gz:4
/var/log/syslog.6.gz:10
/var/log/syslog.7.gz:16
My healthy Cluster still running on proxmox 4.4 does not have those problems. But maybe because the load is not the same?!
Here comes my real question:
1.) Did node08 die or:
Did node08 freeze AFTER some fence magic happened?
2.) Why does node08 not log any totem/corsync retransmit stuff?
3.) My heartbeats run over the SAN/Ceph net. Thats okay, right?
4.) is this some weird unicast problem? Would Multicast fix that?
5.) can i test/benchmark my UDP Unicast somehow like with omping?
Here are the logs from my last crash-freeze of node08:
------------------------------------------------------------
Node08:
---------
Dec 5 15:52:31 node08 systemd[477301]: Reached target Default.
Dec 5 15:52:31 node08 systemd[477301]: Startup finished in 32ms.
Dec 5 15:52:31 node08 systemd[1]: Started User Manager for UID 0.
Dec 5 15:52:31 node08 systemd[1]: Stopping User Manager for UID 0...
Dec 5 15:52:31 node08 systemd[477301]: Stopped target Default.
Dec 5 15:52:31 node08 systemd[477301]: Stopped target Basic System.
Dec 5 15:52:31 node08 systemd[477301]: Stopped target Paths.
Dec 5 15:52:31 node08 systemd[477301]: Stopped target Timers.
Dec 5 15:52:31 node08 systemd[477301]: Stopped target Sockets.
Dec 5 15:52:31 node08 systemd[477301]: Reached target Shutdown.
Dec 5 15:52:31 node08 systemd[477301]: Starting Exit the Session...
Dec 5 15:52:31 node08 systemd[477301]: Received SIGRTMIN+24 from PID 477353 (kill).
Dec 5 15:52:31 node08 systemd[1]: Stopped User Manager for UID 0.
Dec 5 15:52:31 node08 systemd[1]: Removed slice User Slice of root.
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^...
Node09:
---------
Dec 5 15:52:30 node09 systemd[1256497]: Received SIGRTMIN+24 from PID 1256530 (kill).
Dec 5 15:52:30 node09 systemd[1]: Stopped User Manager for UID 0.
Dec 5 15:52:30 node09 systemd[1]: Removed slice User Slice of root.
Dec 5 15:52:41 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d10b
Dec 5 15:52:41 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d10b
Dec 5 15:52:42 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d10d
Dec 5 15:52:42 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d10d
Dec 5 15:52:42 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d110
Dec 5 15:52:42 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d111
Dec 5 15:52:42 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d110
Dec 5 15:52:42 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d111
Dec 5 15:52:42 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d112
Dec 5 15:52:42 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d112
Dec 5 15:52:43 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d113
Dec 5 15:52:43 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d113
Dec 5 15:52:44 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d115 1b7d116 1b7d117 1b7d118 1b7d119 1b7d11a 1b7d11b
Dec 5 15:52:44 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d115 1b7d116 1b7d117 1b7d118 1b7d119 1b7d11a 1b7d11b
Dec 5 15:52:44 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d11d
...SNIP...
Dec 5 15:52:53 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d165
Dec 5 15:52:53 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d166
Dec 5 15:52:53 node09 corosync[5290]: [TOTEM ] Retransmit List: 1b7d167
Dec 5 15:52:53 node09 corosync[5290]: notice [TOTEM ] Retransmit List: 1b7d167
Dec 5 15:52:54 node09 corosync[5290]: notice [TOTEM ] A processor failed, forming new configuration.
Dec 5 15:52:54 node09 corosync[5290]: [TOTEM ] A processor failed, forming new configuration.
Dec 5 15:53:50 node09 pve-ha-crm[5801]: node 'node08': state changed from 'unknown' => 'fence'
Dec 5 15:55:00 node09 pve-ha-crm[5801]: node 'node08': state changed from 'fence' => 'unknown'
Node10:
--------
Dec 5 15:52:00 node10 systemd[1]: Starting Proxmox VE replication runner...
Dec 5 15:52:01 node10 systemd[1]: Started Proxmox VE replication runner.
Dec 5 15:52:54 node10 corosync[5295]: notice [TOTEM ] A processor failed, forming new configuration.
Dec 5 15:52:54 node10 corosync[5295]: [TOTEM ] A processor failed, forming new configuration.
Dec 5 15:52:56 node10 ceph-osd[6256]: 2018-12-05 15:52:56.584700 7f9a2d368700 -1 osd.9 12233 heartbeat_check: no reply from 10.15.15.8:6830 osd.2 since back 2018-12-05 15:52:36.458621 front 2018-12-05 15:52:36.458621 (cutoff 2018-12-05 15:52:36.584696)
Dec 5 15:52:56 node10 ceph-osd[6256]: 2018-12-05 15:52:56.584717 7f9a2d368700 -1 osd.9 12233 heartbeat_check: no reply from 10.15.15.8:6817 osd.10 since back 2018-12-05 15:52:43.360369 front 2018-12-05 15:52:36.458621 (cutoff 2018-12-05 15:52:36.584696)
Dec 5 15:52:56 node10 ceph-osd[6256]: 2018-12-05 15:52:56.584722 7f9a2d368700 -1 osd.9 12233 heartbeat_check: no reply from 10.15.15.8:6803 osd.11 since back 2018-12-05 15:52:47.961863 front 2018-12-05 15:52:36.458621 (cutoff 2018-12-05 15:52:36.584696)
Dec 5 15:52:56 node10 ceph-osd[6256]: 2018-12-05 15:52:56.584726 7f9a2d368700 -1 osd.9 12233 heartbeat_check: no reply from 10.15.15.8:6828 osd.13 since back 2018-12-05 15:52:36.458621 front 2018-12-05 15:52:42.259818 (cutoff 2018-12-05 15:52:36.584696)
Dec 5 15:52:56 node10 corosync[5295]: notice [TOTEM ] A new membership (10.15.15.9:10164) was formed. Members left: 1
Dec 5 15:52:56 node10 corosync[5295]: notice [TOTEM ] Failed to receive the leave message. failed: 1
Dec 5 15:52:56 node10 corosync[5295]: [TOTEM ] A new membership (10.15.15.9:10164) was formed. Members left: 1
Dec 5 15:52:56 node10 corosync[5295]: [TOTEM ] Failed to receive the leave message. failed: 1
Dec 5 15:52:56 node10 corosync[5295]: warning [CPG ] downlist left_list: 1 received
Thanks a lot,
Mario