[SOLVED] periodic Node Crash/freeze

mohnewald · Aug 21, 2018

Hello,

i run a couple of 3 Node Clusters and since half a year i have a problem with one of them.

The Cluster has the node "node08", "node09" and "node10"

All seem to have the same problem. They crash after about two weeks with the following in syslog.
They then just sit there with a freeze. Not reachable by network or tty login.

:~# ssh node10 -C "pveversion"
pve-manager/5.2-5/eb24855a (running kernel: 4.15.18-1-pve)
:~# ssh node09 -C "pveversion"
pve-manager/5.2-5/eb24855a (running kernel: 4.15.18-1-pve)
:~# ssh node08 -C "pveversion"
pve-manager/5.2-5/eb24855a (running kernel: 4.15.18-1-pve)

I am Up2Date, too. Here is the log:

Aug 21 02:30:33 node10 systemd[1866522]: Reached target Shutdown.
Aug 21 02:30:33 node10 systemd[1866522]: Starting Exit the Session...
Aug 21 02:30:33 node10 systemd[1866522]: Received SIGRTMIN+24 from PID 1866552 (kill).
Aug 21 02:30:33 node10 systemd[1]: Stopped User Manager for UID 0.
Aug 21 02:30:33 node10 systemd[1]: Removed slice User Slice of root.
Aug 21 02:31:00 node10 systemd[1]: Starting Proxmox VE replication runner...
Aug 21 02:31:01 node10 systemd[1]: Started Proxmox VE replication runner.
Aug 21 02:31:01 node10 CRON[1866775]: (root) CMD (/usr/bin/puppet agent -vt --color false --logdest /var/log/puppet/agent.log 1>/dev/null)
Aug 21 02:31:04 node10 systemd[1]: Created slice User Slice of root.
Aug 21 02:31:04 node10 systemd[1]: Starting User Manager for UID 0...
Aug 21 02:31:04 node10 systemd[1]: Started Session 31394 of user root.
Aug 21 02:31:04 node10 systemd[1866849]: Listening on GnuPG network certificate management daemon.
Aug 21 02:31:04 node10 systemd[1866849]: Reached target Timers.
Aug 21 02:31:04 node10 systemd[1866849]: Reached target Paths.
Aug 21 02:31:04 node10 systemd[1866849]: Listening on GnuPG cryptographic agent and passphrase cache.
Aug 21 02:31:04 node10 systemd[1866849]: Listening on GnuPG cryptographic agent (access for web browsers).
Aug 21 02:31:04 node10 systemd[1866849]: Listening on GnuPG cryptographic agent (ssh-agent emulation).
Aug 21 02:31:04 node10 systemd[1866849]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
Aug 21 02:31:04 node10 systemd[1866849]: Reached target Sockets.
Aug 21 02:31:04 node10 systemd[1866849]: Reached target Basic System.
Aug 21 02:31:04 node10 systemd[1866849]: Reached target Default.
Aug 21 02:31:04 node10 systemd[1866849]: Startup finished in 31ms.
Aug 21 02:31:04 node10 systemd[1]: Started User Manager for UID 0.
Aug 21 02:31:04 node10 pmxcfs[5112]: [status] notice: received log
Aug 21 02:31:04 node10 systemd[1]: Stopping User Manager for UID 0...
Aug 21 02:31:04 node10 systemd[1866849]: Stopped target Default.
Aug 21 02:31:04 node10 systemd[1866849]: Stopped target Basic System.
Aug 21 02:31:04 node10 systemd[1866849]: Stopped target Paths.
Aug 21 02:31:04 node10 systemd[1866849]: Stopped target Timers.
Aug 21 02:31:04 node10 systemd[1866849]: Stopped target Sockets.
Aug 21 02:31:04 node10 systemd[1866849]: Closed GnuPG cryptographic agent (access for web browsers).
Aug 21 02:31:04 node10 systemd[1866849]: Closed GnuPG network certificate management daemon.
Aug 21 02:31:04 node10 systemd[1866849]: Closed GnuPG cryptographic agent (ssh-agent emulation).
Aug 21 02:31:04 node10 systemd[1866849]: Closed GnuPG cryptographic agent and passphrase cache (restricted).
Aug 21 02:31:04 node10 systemd[1866849]: Closed GnuPG cryptographic agent and passphrase cache.
Aug 21 02:31:04 node10 systemd[1866849]: Reached target Shutdown.
Aug 21 02:31:04 node10 systemd[1866849]: Starting Exit the Session...
Aug 21 02:31:04 node10 systemd[1866849]: Received SIGRTMIN+24 from PID 1866991 (kill).
Aug 21 02:31:04 node10 systemd[1]: Stopped User Manager for UID 0.
Aug 21 02:31:04 node10 systemd[1]: Removed slice User Slice of root.
Aug 21 02:31:51 node10 pmxcfs[5112]: [status] notice: received log
Aug 21 02:32:00 node10 systemd[1]: Starting Proxmox VE replication runner...
Aug 21 02:32:01 node10 CRON[1868645]: (root) CMD (/usr/bin/puppet agent -vt --color false --logdest /var/log/puppet/agent.log 1>/dev/null)
Aug 21 02:32:01 node10 systemd[1]: Started Proxmox VE replication runner.
Aug 21 02:33:00 node10 systemd[1]: Starting Proxmox VE replication runner...
Aug 21 02:33:01 node10 systemd[1]: Started Proxmox VE replication runner.
Aug 21 02:33:01 node10 CRON[1870491]: (root) CMD (/usr/bin/puppet agent -vt --color false --logdest /var/log/puppet/agent.log 1>/dev/null)
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@

Any idea about this?

dcsapak · Aug 21, 2018

mohnewald said:
Aug 21 02:31:04 node10 systemd[1866849]: Reached target Shutdown.

seems something initiated a poweroff

mohnewald · Aug 21, 2018

dcsapak said:
seems something initiated a poweroff

I get this a lot:

root@node09:~ # grep "Reached target Shutdown." /var/log/syslog
...
Aug 21 09:25:17 node09 systemd[1503148]: Reached target Shutdown.
Aug 21 09:25:18 node09 systemd[1503202]: Reached target Shutdown.
Aug 21 09:27:52 node09 systemd[1504300]: Reached target Shutdown.
Aug 21 09:27:53 node09 systemd[1504492]: Reached target Shutdown.
Aug 21 09:27:55 node09 systemd[1504559]: Reached target Shutdown.
Aug 21 09:29:55 node09 systemd[1505519]: Reached target Shutdown.
Aug 21 09:30:17 node09 systemd[1505784]: Reached target Shutdown.
...
Aug 21 09:53:35 node09 systemd[1516404]: Reached target Shutdown.

Without the nodes rebooting. I googled: The log message does not indicate that there was a system shutdown or reboot. This message can be ignored.

dcsapak · Aug 21, 2018

yes, i see it comes from your cron job of puppet, maybe there is something wrong (because it is the last entry in the log)?

mohnewald · Aug 22, 2018

i removed puppet from the three nodes and just now one node got stuck again (!!!):

Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] Retransmit List: 4b5fc 4b5fd 4b5fe 4b5ff 4b600 4b601 4b602 4b604
Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] Retransmit List: 4b5fc 4b5fd 4b5fe 4b5ff 4b600 4b601 4b602 4b604
Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] Retransmit List: 4b5fc 4b5fd 4b5fe 4b5ff 4b600 4b601 4b602 4b604
Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] Retransmit List: 4b5fc 4b5fd 4b5fe 4b5ff 4b600 4b601 4b602 4b604
Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] Retransmit List: 4b5fc 4b5fd 4b5fe 4b5ff 4b600 4b601 4b602 4b604
Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] Retransmit List: 4b5fc 4b5fd 4b5fe 4b5ff 4b600 4b601 4b602 4b604
Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] Retransmit List: 4b5fc 4b5fd 4b5fe 4b5ff 4b600 4b601 4b602 4b604
Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] Retransmit List: 4b5fc 4b5fd 4b5fe 4b5ff 4b600 4b601 4b602 4b604
Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] Retransmit List: 4b5fc 4b5fd 4b5fe 4b5ff 4b600 4b601 4b602 4b604
Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] Retransmit List: 4b5fc 4b5fd 4b5fe 4b5ff 4b600 4b601 4b602 4b604
Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] Retransmit List: 4b5fc 4b5fd 4b5fe 4b5ff 4b600 4b601 4b602 4b604
Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] Retransmit List: 4b5fc 4b5fd 4b5fe 4b5ff 4b600 4b601 4b602 4b604
Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] Retransmit List: 4b5fc 4b5fd 4b5fe 4b5ff 4b600 4b601 4b602 4b604
Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] Retransmit List: 4b5fc 4b5fd 4b5fe 4b5ff 4b600 4b601 4b602 4b604
Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] Retransmit List: 4b5fc 4b5fd 4b5fe 4b5ff 4b600 4b601 4b602 4b604
Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] Retransmit List: 4b5fc 4b5fd 4b5fe 4b5ff 4b600 4b601 4b602 4b604
Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] Retransmit List: 4b5fc 4b5fd 4b5fe 4b5ff 4b600 4b601 4b602 4b604
Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] Retransmit List: 4b5fc 4b5fd 4b5fe 4b5ff 4b600 4b601 4b602 4b604
Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] Retransmit List: 4b5fc 4b5fd 4b5fe 4b5ff 4b600 4b601 4b602 4b604
Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] Retransmit List: 4b5fc 4b5fd 4b5fe 4b5ff 4b600 4b601 4b602 4b604
Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] Retransmit List: 4b5fc 4b5fd 4b5fe 4b5ff 4b600 4b601 4b602 4b604
Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] Retransmit List: 4b5fc 4b5fd 4b5fe 4b5ff 4b600 4b601 4b602 4b604
Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] Retransmit List: 4b5fc 4b5fd 4b5fe 4b5ff 4b600 4b601 4b602 4b604
Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] Retransmit List: 4b5fc 4b5fd 4b5fe 4b5ff 4b600 4b601 4b602 4b604
Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] Retransmit List: 4b5fc 4b5fd 4b5fe 4b5ff 4b600 4b601 4b602 4b604
Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] Retransmit List: 4b5fc 4b5fd 4b5fe 4b5ff 4b600 4b601 4b602 4b604
Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] Retransmit List: 4b5fc 4b5fd 4b5fe 4b5ff 4b600 4b601 4b602 4b604
Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] Retransmit List: 4b5fc 4b5fd 4b5fe 4b5ff 4b600 4b601 4b602 4b604
Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] A new membership (10.15.15.8:9000) was formed. Members left: 3
Aug 22 16:11:12 node08 corosync[5428]: notice [TOTEM ] Failed to receive the leave message. failed: 3
Aug 22 16:11:12 node08 corosync[5428]: warning [CPG ] downlist left_list: 1 received
Aug 22 16:11:12 node08 corosync[5428]: warning [CPG ] downlist left_list: 1 received
Aug 22 16:11:12 node08 pmxcfs[5227]: [dcdb] notice: members: 1/5227, 2/5058
Aug 22 16:11:12 node08 pmxcfs[5227]: [dcdb] notice: starting data syncronisation
Aug 22 16:11:12 node08 corosync[5428]: notice [QUORUM] Members[2]: 1 2
Aug 22 16:11:12 node08 corosync[5428]: notice [MAIN ] Completed service synchronization, ready to provide service.
Aug 22 16:11:12 node08 pmxcfs[5227]: [dcdb] notice: cpg_send_message retried 1 times
Aug 22 16:11:12 node08 pmxcfs[5227]: [status] notice: members: 1/5227, 2/5058
Aug 22 16:11:12 node08 pmxcfs[5227]: [status] notice: starting data syncronisation
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@

Just a power reset helped from here.

dcsapak · Aug 23, 2018

those message indicate a problem with multicast and/or the network see with omping if multicast works correctly and your network is not overloaded/faulty

mohnewald · Aug 23, 2018

Hello dcsapak,

thanks for your reply.

dcsapak said:
those message indicate a problem with multicast and/or the network see with omping if multicast works correctly and your network is not overloaded/faulty

I use unicast. All three nodes connected to each other.
Using the manual here: https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server

/etc/network/interfaces (from node08)

auto lo
iface lo inet loopback

allow-hotplug eth2

iface eth2 inet static
address 192.168.41.157
netmask 255.255.255.0
gateway 192.168.41.5
broadcast 192.168.41.255
network 192.168.41.0
dns-nameservers 192.168.41.5
dns-search cluster3
# dns-* options are implemented by the resolvconf package, if installed

auto eth4
iface eth4 inet static
address 10.15.15.8
netmask 255.255.255.0
up route add -net 10.15.15.10 netmask 255.255.255.255 dev eth4
down route del -net 10.15.15.10 netmask 255.255.255.255 dev eth4
mtu 9000

auto eth5
iface eth5 inet static
address 10.15.15.8
netmask 255.255.255.0
up route add -net 10.15.15.9 netmask 255.255.255.255 dev eth5
down route del -net 10.15.15.9 netmask 255.255.255.255 dev eth5
mtu 9000

/etc/pve/corosync.conf (from node08)

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: node08
nodeid: 1
quorum_votes: 1
ring0_addr: node08
}

node {
name: node10
nodeid: 3
quorum_votes: 1
ring0_addr: node10
}

node {
name: node09
nodeid: 2
quorum_votes: 1
ring0_addr: node09
}

}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: cluster3
config_version: 6
ip_version: ipv4
secauth: on
transport: udpu
version: 2
interface {
bindnetaddr: 10.15.15.8
ringnumber: 0
}

}

dcsapak · Aug 23, 2018

my point still stands, check that the network works and is not overloaded

mohnewald · Sep 3, 2018

Hello dcsapak,

my network seems fine. I did a 268 Hour test with a ping flood:

root@node09:~# ping -f -i 0.2 node10
PING node10.cluster3.stuttgart.local (10.15.15.10) 56(84) bytes of data.
--- node10.cluster3.stuttgart.local ping statistics ---
4826063 packets transmitted, 4826063 received, 0% packet loss, time 965906293ms
rtt min/avg/max/mdev = 0.045/0.115/2.073/0.028 ms, ipg/ewma 200.143/0.113 ms

root@node09:~# ping -f -i 0.2 node08
PING node08.cluster3.stuttgart.local (10.15.15.8) 56(84) bytes of data.
--- node08.cluster3.stuttgart.local ping statistics ---
4824948 packets transmitted, 4824948 received, 0% packet loss, time 965927920ms
rtt min/avg/max/mdev = 0.044/0.114/1.984/0.025 ms, ipg/ewma 200.194/0.113 ms

I also went back to Kernel 4.4.98-6-pve (i was on 4.15.18 before). No issues since then.

mohnewald · Sep 7, 2018

it just crashed again:

Sep 7 14:01:03 node09 systemd[1309404]: Stopped target Paths.
Sep 7 14:01:03 node09 systemd[1309404]: Stopped target Sockets.
Sep 7 14:01:03 node09 systemd[1309404]: Reached target Shutdown.
Sep 7 14:01:03 node09 systemd[1309404]: Starting Exit the Session...
Sep 7 14:01:03 node09 systemd[1309404]: Received SIGRTMIN+24 from PID 1309416 (kill).
Sep 7 14:01:03 node09 systemd[1]: Stopped User Manager for UID 0.
Sep 7 14:01:03 node09 systemd[1]: Removed slice User Slice of root.
Sep 7 14:01:08 node09 systemd[1]: Created slice User Slice of root.
Sep 7 14:01:08 node09 systemd[1]: Starting User Manager for UID 0...
Sep 7 14:01:08 node09 systemd[1]: Started Session 32652 of user root.
Sep 7 14:01:08 node09 systemd[1309452]: Reached target Paths.
Sep 7 14:01:08 node09 systemd[1309452]: Reached target Timers.
Sep 7 14:01:08 node09 systemd[1309452]: Reached target Sockets.
Sep 7 14:01:08 node09 systemd[1309452]: Reached target Basic System.
Sep 7 14:01:08 node09 systemd[1309452]: Reached target Default.
Sep 7 14:01:08 node09 systemd[1309452]: Startup finished in 32ms.
Sep 7 14:01:08 node09 systemd[1]: Started User Manager for UID 0.
Sep 7 14:01:09 node09 systemd[1]: Stopping User Manager for UID 0...
Sep 7 14:01:09 node09 systemd[1309452]: Stopped target Default.
Sep 7 14:01:09 node09 systemd[1309452]: Stopped target Basic System.
Sep 7 14:01:09 node09 systemd[1309452]: Stopped target Timers.
Sep 7 14:01:09 node09 systemd[1309452]: Stopped target Sockets.
Sep 7 14:01:09 node09 systemd[1309452]: Stopped target Paths.
Sep 7 14:01:09 node09 systemd[1309452]: Reached target Shutdown.
Sep 7 14:01:09 node09 systemd[1309452]: Starting Exit the Session...
Sep 7 14:01:09 node09 systemd[1309452]: Received SIGRTMIN+24 from PID 1309477 (kill).
Sep 7 14:01:09 node09 systemd[1]: Stopped User Manager for UID 0.
Sep 7 14:01:09 node09 systemd[1]: Removed slice User Slice of root.
Sep 7 14:02:00 node09 systemd[1]: Starting Proxmox VE replication runner...
Sep 7 14:02:00 node09 systemd[1]: Started Proxmox VE replication runner.
Sep 7 14:03:00 node09 systemd[1]: Starting Proxmox VE replication runner...
Sep 7 14:03:00 node09 systemd[1]: Started Proxmox VE replication runner.
Sep 7 14:04:00 node09 systemd[1]: Starting Proxmox VE replication runner...
Sep 7 14:04:01 node09 systemd[1]: Started Proxmox VE replication runner.
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^

mohnewald · Nov 30, 2018

Hello dcsapak,

thanks a lot for your help. It turned out to be a "Kernel Intel Nic driver problem".
I once saw an error indicating a network driver problem.
Since i switched to some old mellanox connectx-3 nics the problems disappeared.

The supermicro hardware is quite common with the 10G onboard dual 10GBit nics:

03:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
Subsystem: Super Micro Computer Inc Ethernet Controller 10-Gigabit X540-AT2
Physical Slot: 0
Flags: bus master, fast devsel, latency 0, IRQ 71, NUMA node 0
Memory at c5a00000 (64-bit, prefetchable) [size=2M]
I/O ports at 5020
Memory at c5c04000 (64-bit, prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
Capabilities: [70] MSI-X: Enable+ Count=64 Masked-
Capabilities: [a0] Express Endpoint, MSI 00
Capabilities: [100] Advanced Error Reporting
Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
Capabilities: [1d0] Access Control Services
Kernel driver in use: ixgbe
Kernel modules: ixgbe

I wonder if someone else had the same issues?

How can i mark this thread as "Sloved"?

tom · Dec 1, 2018

mohnewald said:
How can i mark this thread as "Sloved"?

Edit the first post of this thread (set it to "SOLVED")

mohnewald · Dec 5, 2018

Hmm...it crashed again in the same style :-(

zorg · Dec 26, 2018

Hello got same problem
ramdon reboot of my 4 nodes (look like i'm not the only one)

nothing special in the log
but also got Ethernet controller: Intel Corporation 82598EB 10-Gigabit AF Dual Port Network Connection

so wonder i someone have a clue

thanks

proxmox-ve: 5.3-1 (running kernel: 4.15.18-9-pve)
pve-manager: 5.3-5 (running version: 5.3-5/97ae681d)
pve-kernel-4.15: 5.2-12
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.17-3-pve: 4.15.17-14
ceph: 12.2.10-1~bpo90+1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-33
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-5
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-31
pve-container: 2.0-31
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-16
pve-firmware: 2.0-6
pve-ha-manager: 2.0-5
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-43
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3

sb-jw · Dec 26, 2018

@zorg please do not take over other peoples thread.

This is the original Thread: https://forum.proxmox.com/posts/233265/

mohnewald · Jan 16, 2019

Thread can be closed, i will carry on in this one:
https://forum.proxmox.com/threads/radom-node-freeze-fence-and-totem-retransmit-list-problems.49493/

Search

Search

[SOLVED] periodic Node Crash/freeze

mohnewald

Well-Known Member

dcsapak

Proxmox Staff Member

mohnewald

Well-Known Member

dcsapak

Proxmox Staff Member

mohnewald

Well-Known Member

dcsapak

Proxmox Staff Member

mohnewald

Well-Known Member

dcsapak

Proxmox Staff Member

mohnewald

Well-Known Member

mohnewald

Well-Known Member

mohnewald

Well-Known Member

tom

Proxmox Staff Member

mohnewald

Well-Known Member

zorg

New Member

sb-jw

Famous Member

mohnewald

Well-Known Member