[SOLVED] DL360 Reboot every once in a while

b345

New Member
Feb 6, 2024
12
7
3
i have a cluster of 3 node, two dl360p and one Dell PC Optiplex 9020.
every now and then, probably like twice a day, one of the DL360 will just reboot, am not sure why, i don't see anything useful. drives and power supply all is fine. anyone ever experience this issue.

Ps. the backupserver1 is my proxmox Backup Server which i don't have turn on all the time, i only power it on when im doing a backup.

see attached logs
 

Attachments

one of the node rebooted again, see logs before the host initiated the Reboot



Feb 06 05:09:39 server2 corosync[1689]: [KNET ] link: host: 3 link: 0 is down

Feb 06 05:09:39 server2 corosync[1689]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)

Feb 06 05:09:39 server2 corosync[1689]: [KNET ] host: host: 3 has no active links

Feb 06 05:09:39 server2 corosync[1689]: [KNET ] link: Resetting MTU for link 0 because host 3 joined

Feb 06 05:09:39 server2 corosync[1689]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)

Feb 06 05:09:39 server2 corosync[1689]: [KNET ] pmtud: Global data MTU changed to: 1397

Feb 06 05:17:01 server2 CRON[363953]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)

Feb 06 05:17:01 server2 CRON[363954]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)

Feb 06 05:17:01 server2 CRON[363953]: pam_unix(cron:session): session closed for user root

Feb 06 05:18:46 server2 corosync[1689]: [KNET ] link: host: 3 link: 0 is down

Feb 06 05:18:46 server2 corosync[1689]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)

Feb 06 05:18:46 server2 corosync[1689]: [KNET ] host: host: 3 has no active links

Feb 06 05:18:48 server2 corosync[1689]: [KNET ] rx: host: 3 link: 0 is up

Feb 06 05:18:48 server2 corosync[1689]: [KNET ] link: Resetting MTU for link 0 because host 3 joined

Feb 06 05:18:48 server2 corosync[1689]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)

Feb 06 05:18:48 server2 corosync[1689]: [KNET ] pmtud: Global data MTU changed to: 1397

Feb 06 05:24:43 server2 pmxcfs[1606]: [dcdb] notice: data verification successful

Feb 06 05:27:56 server2 corosync[1689]: [TOTEM ] Retransmit List: 2849d

Feb 06 05:28:22 server2 corosync[1689]: [TOTEM ] Retransmit List: 2855f

Feb 06 05:31:40 server2 pmxcfs[1606]: [status] notice: received log

Feb 06 05:31:46 server2 pmxcfs[1606]: [status] notice: received log

Feb 06 05:31:46 server2 pmxcfs[1606]: [status] notice: received log

Feb 06 05:31:50 server2 pmxcfs[1606]: [status] notice: received log

Feb 06 05:31:50 server2 pmxcfs[1606]: [status] notice: received log

Feb 06 05:31:50 server2 pmxcfs[1606]: [status] notice: received log

Feb 06 05:31:51 server2 sshd[376951]: Accepted publickey for root from 10.10.12.2 port 36916 ssh2: RSA SHA256:xxxxxxxxxxxxxxxx

Feb 06 05:31:51 server2 sshd[376951]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)

Feb 06 05:31:51 server2 systemd-logind[1206]: New session 43 of user root.

Feb 06 05:31:51 server2 systemd[1]: Created slice user-0.slice - User Slice of UID 0.

Feb 06 05:31:51 server2 systemd[1]: Starting user-runtime-dir@0.service - User Runtime Directory /run/user/0...

Feb 06 05:31:51 server2 systemd[1]: Finished user-runtime-dir@0.service - User Runtime Directory /run/user/0.

Feb 06 05:31:51 server2 systemd[1]: Starting user@0.service - User Manager for UID 0...

Feb 06 05:31:51 server2 (systemd)[376955]: pam_unix(systemd-user:session): session opened for user root(uid=0) by (uid=0)

Feb 06 05:31:51 server2 systemd[376955]: Queued start job for default target default.target.

Feb 06 05:31:51 server2 systemd[376955]: Created slice app.slice - User Application Slice.

Feb 06 05:31:51 server2 systemd[376955]: Reached target paths.target - Paths.

Feb 06 05:31:51 server2 systemd[376955]: Reached target timers.target - Timers.

Feb 06 05:31:51 server2 systemd[376955]: Listening on dirmngr.socket - GnuPG network certificate management daemon.

Feb 06 05:31:51 server2 systemd[376955]: Listening on gpg-agent-browser.socket - GnuPG cryptographic agent and passphrase cache (access for web browsers).

Feb 06 05:31:51 server2 systemd[376955]: Listening on gpg-agent-extra.socket - GnuPG cryptographic agent and passphrase cache (restricted).

Feb 06 05:31:51 server2 systemd[376955]: Listening on gpg-agent-ssh.socket - GnuPG cryptographic agent (ssh-agent emulation).

Feb 06 05:31:51 server2 systemd[376955]: Listening on gpg-agent.socket - GnuPG cryptographic agent and passphrase cache.

Feb 06 05:31:51 server2 systemd[376955]: Reached target sockets.target - Sockets.

Feb 06 05:31:51 server2 systemd[376955]: Reached target basic.target - Basic System.

Feb 06 05:31:51 server2 systemd[376955]: Reached target default.target - Main User Target.

Feb 06 05:31:51 server2 systemd[376955]: Startup finished in 432ms.

Feb 06 05:31:51 server2 systemd[1]: Started user@0.service - User Manager for UID 0.

Feb 06 05:31:51 server2 systemd[1]: Started session-43.scope - Session 43 of User root.

Feb 06 05:31:51 server2 sshd[376951]: pam_env(sshd:session): deprecated reading of user environment enabled

Feb 06 05:31:51 server2 login[376975]: pam_unix(login:session): session opened for user root(uid=0) by root(uid=0)

Feb 06 05:31:51 server2 login[376980]: ROOT LOGIN on '/dev/pts/0' from '10.10.12.2'

Feb 06 05:32:05 server2 sshd[376951]: Received disconnect from 10.10.12.2 port 36916:11: disconnected by user

Feb 06 05:32:05 server2 sshd[376951]: Disconnected from user root 10.10.12.2 port 36916

Feb 06 05:32:05 server2 sshd[376951]: pam_unix(sshd:session): session closed for user root

Feb 06 05:32:05 server2 systemd-logind[1206]: Session 43 logged out. Waiting for processes to exit.

Feb 06 05:32:05 server2 systemd[1]: session-43.scope: Deactivated successfully.

Feb 06 05:32:05 server2 systemd-logind[1206]: Removed session 43.

Feb 06 05:32:05 server2 pmxcfs[1606]: [status] notice: received log

Feb 06 05:32:05 server2 pmxcfs[1606]: [status] notice: received log

Feb 06 05:32:05 server2 pmxcfs[1606]: [status] notice: received log

Feb 06 05:32:09 server2 pmxcfs[1606]: [status] notice: received log

Feb 06 05:32:15 server2 systemd[1]: Stopping user@0.service - User Manager for UID 0...

Feb 06 05:32:15 server2 systemd[376955]: Activating special unit exit.target...

Feb 06 05:32:15 server2 systemd[376955]: Stopped target default.target - Main User Target.

Feb 06 05:32:15 server2 systemd[376955]: Stopped target basic.target - Basic System.

Feb 06 05:32:15 server2 systemd[376955]: Stopped target paths.target - Paths.

Feb 06 05:32:15 server2 systemd[376955]: Stopped target sockets.target - Sockets.

Feb 06 05:32:15 server2 systemd[376955]: Stopped target timers.target - Timers.

Feb 06 05:32:15 server2 systemd[376955]: Closed dirmngr.socket - GnuPG network certificate management daemon.

Feb 06 05:32:15 server2 systemd[376955]: Closed gpg-agent-browser.socket - GnuPG cryptographic agent and passphrase cache (access for web browsers).

Feb 06 05:32:15 server2 systemd[376955]: Closed gpg-agent-extra.socket - GnuPG cryptographic agent and passphrase cache (restricted).

Feb 06 05:32:15 server2 systemd[376955]: Closed gpg-agent-ssh.socket - GnuPG cryptographic agent (ssh-agent emulation).

Feb 06 05:32:15 server2 systemd[376955]: Closed gpg-agent.socket - GnuPG cryptographic agent and passphrase cache.

Feb 06 05:32:15 server2 systemd[376955]: Removed slice app.slice - User Application Slice.

Feb 06 05:32:15 server2 systemd[376955]: Reached target shutdown.target - Shutdown.

Feb 06 05:32:15 server2 systemd[376955]: Finished systemd-exit.service - Exit the Session.

Feb 06 05:32:15 server2 systemd[376955]: Reached target exit.target - Exit the Session.

Feb 06 05:32:15 server2 systemd[1]: user@0.service: Deactivated successfully.

Feb 06 05:32:15 server2 systemd[1]: Stopped user@0.service - User Manager for UID 0.

Feb 06 05:32:15 server2 systemd[1]: Stopping user-runtime-dir@0.service - User Runtime Directory /run/user/0...

Feb 06 05:32:15 server2 systemd[1]: run-user-0.mount: Deactivated successfully.

Feb 06 05:32:15 server2 systemd[1]: user-runtime-dir@0.service: Deactivated successfully.

Feb 06 05:32:15 server2 systemd[1]: Stopped user-runtime-dir@0.service - User Runtime Directory /run/user/0.

Feb 06 05:32:15 server2 systemd[1]: Removed slice user-0.slice - User Slice of UID 0.

Feb 06 05:33:07 server2 postfix/qmgr[1673]: 8D8AC247A8: from=<root@server1.test.com>, size=32993, nrcpt=1 (queue active)

Feb 06 05:33:08 server2 postfix/smtp[378042]: 8D8AC247A8: replace: header From: vzdump backup tool <root@server1.test.com>: From: server2 info@test.com

Feb 06 05:33:08 server2 postfix/smtp[378042]: 8D8AC247A8: to=<info@test.com>, relay=smtp.gmail.com[142.250.105.108]:587, delay=4692, delays=4691/0.05/0.79/0.83, dsn=2.0.0, status=sent (250 2.0.0 OK 1707215588 l191-20020a8157c8000000b006040a5496adsm234439ywb.145 - gsmtp)

Feb 06 05:33:08 server2 postfix/qmgr[1673]: 8D8AC247A8: removed

Feb 06 05:34:01 server2 corosync[1689]: [TOTEM ] Retransmit List: 28e31

Feb 06 05:38:29 server2 kernel: perf: interrupt took too long (5026 > 4992), lowering kernel.perf_event_max_sample_rate to 39750

Feb 06 05:40:51 server2 corosync[1689]: [TOTEM ] Retransmit List: 298ce

Feb 06 05:46:06 server2 pmxcfs[1606]: [status] notice: received log

Feb 06 05:55:02 server2 corosync[1689]: [TOTEM ] Retransmit List: 2aede

Feb 06 05:59:14 server2 corosync[1689]: [TOTEM ] Retransmit List: 2b56e

Feb 06 06:01:06 server2 pmxcfs[1606]: [status] notice: received log

Feb 06 06:03:47 server2 corosync[1689]: [TOTEM ] Retransmit List: 2bc87

Feb 06 06:07:36 server2 corosync[1689]: [TOTEM ] Retransmit List: 2c28e

Feb 06 06:09:26 server2 corosync[1689]: [TOTEM ] Retransmit List: 2c56f

-- Reboot --
 
Last edited:
seems issue was relating to ceph. not sure what the actual solution was, but once i rever to normal zfs storage without ceph , i have not had any reboot for weeks.