Proxmox Cluster node02 unerwarteter Neustart

quickmo · Jan 30, 2024

Hallo zusammen,

vielleicht kann mir ja jemand weiterhelfen, ich habe einen Proxmox Cluster mit 3 nodes, einer davon (node02) hat jetzt das zweite Mal in 3 Monaten einen unerwarteten Neustart:

Code:

Jan 30 03:01:21 node02 sshd[2651334]: Accepted publickey for root from 192.168.100.10 port 59442 ssh2: RSA SHA256:htNllc9TyDIY0JCn3OAsmE7vIsgw/hwb1vD3DQGSMM4

Jan 30 03:01:21 node02 sshd[2651334]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)

Jan 30 03:01:21 node02 systemd-logind[1809]: New session 42818 of user root.

Jan 30 03:01:21 node02 systemd[1]: Started session-42818.scope - Session 42818 of User root.

Jan 30 03:01:21 node02 sshd[2651334]: pam_env(sshd:session): deprecated reading of user environment enabled

Jan 30 03:01:23 node02 sshd[2651334]: Received disconnect from 192.168.100.10 port 59442:11: disconnected by user

Jan 30 03:01:23 node02 sshd[2651334]: Disconnected from user root 192.168.100.10 port 59442

Jan 30 03:01:23 node02 sshd[2651334]: pam_unix(sshd:session): session closed for user root

Jan 30 03:01:23 node02 systemd[1]: session-42818.scope: Deactivated successfully.

Jan 30 03:01:23 node02 systemd-logind[1809]: Session 42818 logged out. Waiting for processes to exit.

Jan 30 03:01:23 node02 systemd-logind[1809]: Removed session 42818.

Jan 30 03:01:23 node02 sshd[2651394]: Accepted publickey for root from 192.168.100.10 port 59458 ssh2: RSA SHA256:htNllc9TyDIY0JCn3OAsmE7vIsgw/hwb1vD3DQGSMM4

Jan 30 03:01:23 node02 sshd[2651394]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)

Jan 30 03:01:23 node02 systemd-logind[1809]: New session 42819 of user root.

Jan 30 03:01:23 node02 systemd[1]: Started session-42819.scope - Session 42819 of User root.

Jan 30 03:01:23 node02 sshd[2651394]: pam_env(sshd:session): deprecated reading of user environment enabled

Jan 30 03:01:25 node02 sshd[2651394]: Received disconnect from 192.168.100.10 port 59458:11: disconnected by user

Jan 30 03:01:25 node02 sshd[2651394]: Disconnected from user root 192.168.100.10 port 59458

Jan 30 03:01:25 node02 sshd[2651394]: pam_unix(sshd:session): session closed for user root

Jan 30 03:01:25 node02 systemd[1]: session-42819.scope: Deactivated successfully.

Jan 30 03:01:25 node02 systemd[1]: session-42819.scope: Consumed 1.189s CPU time.

Jan 30 03:01:25 node02 systemd-logind[1809]: Session 42819 logged out. Waiting for processes to exit.

Jan 30 03:01:25 node02 systemd-logind[1809]: Removed session 42819.

Jan 30 03:01:35 node02 systemd[1]: Stopping user@0.service - User Manager for UID 0...

Jan 30 03:01:35 node02 systemd[2651017]: Activating special unit exit.target...

Jan 30 03:01:35 node02 systemd[2651017]: Stopped target default.target - Main User Target.

Jan 30 03:01:35 node02 systemd[2651017]: Stopped target basic.target - Basic System.

Jan 30 03:01:35 node02 systemd[2651017]: Stopped target paths.target - Paths.

Jan 30 03:01:35 node02 systemd[2651017]: Stopped target sockets.target - Sockets.

Jan 30 03:01:35 node02 systemd[2651017]: Stopped target timers.target - Timers.

Jan 30 03:01:35 node02 systemd[2651017]: Closed dirmngr.socket - GnuPG network certificate management daemon.

Jan 30 03:01:35 node02 systemd[2651017]: Closed gpg-agent-browser.socket - GnuPG cryptographic agent and passphrase cache (access for web browsers).

Jan 30 03:01:35 node02 systemd[2651017]: Closed gpg-agent-extra.socket - GnuPG cryptographic agent and passphrase cache (restricted).

Jan 30 03:01:35 node02 systemd[2651017]: Closed gpg-agent-ssh.socket - GnuPG cryptographic agent (ssh-agent emulation).

Jan 30 03:01:35 node02 systemd[2651017]: Closed gpg-agent.socket - GnuPG cryptographic agent and passphrase cache.

Jan 30 03:01:35 node02 systemd[2651017]: Removed slice app.slice - User Application Slice.

Jan 30 03:01:35 node02 systemd[2651017]: Reached target shutdown.target - Shutdown.

Jan 30 03:01:35 node02 systemd[2651017]: Finished systemd-exit.service - Exit the Session.

Jan 30 03:01:35 node02 systemd[2651017]: Reached target exit.target - Exit the Session.

Jan 30 03:01:35 node02 systemd[1]: user@0.service: Deactivated successfully.

Jan 30 03:01:35 node02 systemd[1]: Stopped user@0.service - User Manager for UID 0.

Jan 30 03:01:35 node02 systemd[1]: Stopping user-runtime-dir@0.service - User Runtime Directory /run/user/0...

Jan 30 03:01:35 node02 systemd[1]: run-user-0.mount: Deactivated successfully.

Jan 30 03:01:35 node02 systemd[1]: user-runtime-dir@0.service: Deactivated successfully.

Jan 30 03:01:35 node02 systemd[1]: Stopped user-runtime-dir@0.service - User Runtime Directory /run/user/0.

Jan 30 03:01:35 node02 systemd[1]: Removed slice user-0.slice - User Slice of UID 0.

Jan 30 03:01:35 node02 systemd[1]: user-0.slice: Consumed 11.111s CPU time.

Jan 30 03:02:02 node02 kernel: ixgbe 0000:05:00.1 enp5s0f1: NIC Link is Down

Jan 30 03:02:02 node02 kernel: vmbr1: port 1(enp5s0f1) entered disabled state

Jan 30 03:02:03 node02 corosync[2263]:   [KNET  ] link: host: 3 link: 0 is down

Jan 30 03:02:03 node02 corosync[2263]:   [KNET  ] link: host: 1 link: 0 is down

Jan 30 03:02:03 node02 corosync[2263]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)

Jan 30 03:02:03 node02 corosync[2263]:   [KNET  ] host: host: 3 has no active links

Jan 30 03:02:03 node02 corosync[2263]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)

Jan 30 03:02:03 node02 corosync[2263]:   [KNET  ] host: host: 1 has no active links

Jan 30 03:02:04 node02 corosync[2263]:   [TOTEM ] Token has not been received in 2737 ms

Jan 30 03:02:05 node02 corosync[2263]:   [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.

Jan 30 03:02:09 node02 kernel: ixgbe 0000:05:00.1 enp5s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX

Jan 30 03:02:09 node02 kernel: vmbr1: port 1(enp5s0f1) entered blocking state

Jan 30 03:02:09 node02 kernel: vmbr1: port 1(enp5s0f1) entered forwarding state

Jan 30 03:02:10 node02 corosync[2263]:   [QUORUM] Sync members[1]: 2

Jan 30 03:02:10 node02 corosync[2263]:   [QUORUM] Sync left[2]: 1 3

Jan 30 03:02:10 node02 corosync[2263]:   [TOTEM ] A new membership (2.166) was formed. Members left: 1 3

Jan 30 03:02:10 node02 corosync[2263]:   [TOTEM ] Failed to receive the leave message. failed: 1 3

Jan 30 03:02:10 node02 corosync[2263]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.

Jan 30 03:02:10 node02 corosync[2263]:   [QUORUM] Members[1]: 2

Jan 30 03:02:10 node02 pmxcfs[2167]: [dcdb] notice: members: 2/2167

Jan 30 03:02:10 node02 corosync[2263]:   [MAIN  ] Completed service synchronization, ready to provide service.

Jan 30 03:02:10 node02 pmxcfs[2167]: [status] notice: node lost quorum

Jan 30 03:02:10 node02 pmxcfs[2167]: [status] notice: members: 2/2167

Jan 30 03:02:10 node02 pmxcfs[2167]: [dcdb] crit: received write while not quorate - trigger resync

Jan 30 03:02:10 node02 pmxcfs[2167]: [dcdb] crit: leaving CPG group

Jan 30 03:02:10 node02 pve-ha-lrm[2330]: lost lock 'ha_agent_node02_lock - cfs lock update failed - Operation not permitted

Jan 30 03:02:10 node02 pve-ha-lrm[2330]: status change active => lost_agent_lock

Jan 30 03:02:10 node02 pmxcfs[2167]: [dcdb] notice: start cluster connection

Jan 30 03:02:10 node02 pmxcfs[2167]: [dcdb] crit: cpg_join failed: 14

Jan 30 03:02:10 node02 pmxcfs[2167]: [dcdb] crit: can't initialize service

Jan 30 03:02:10 node02 pve-ha-crm[2318]: lost lock 'ha_manager_lock - cfs lock update failed - Device or resource busy

Jan 30 03:02:10 node02 pve-ha-crm[2318]: status change master => lost_manager_lock

Jan 30 03:02:10 node02 pve-ha-crm[2318]: watchdog closed (disabled)

Jan 30 03:02:10 node02 pve-ha-crm[2318]: status change lost_manager_lock => wait_for_quorum

Jan 30 03:02:16 node02 pmxcfs[2167]: [dcdb] notice: members: 2/2167

Jan 30 03:02:16 node02 pmxcfs[2167]: [dcdb] notice: all data is up to date

Jan 30 03:02:17 node02 kernel: ixgbe 0000:05:00.1 enp5s0f1: NIC Link is Down

Jan 30 03:02:17 node02 kernel: vmbr1: port 1(enp5s0f1) entered disabled state

Jan 30 03:02:30 node02 pvestatd[2283]: pbs01-node02: error fetching datastores - 500 Can't connect to +:8007 (Temporary failure in name resolution)

Jan 30 03:02:31 node02 pvestatd[2283]: status update time (20.278 seconds)

Jan 30 03:02:42 node02 kernel: ixgbe 0000:05:00.1 enp5s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX

Jan 30 03:02:42 node02 kernel: vmbr1: port 1(enp5s0f1) entered blocking state

Jan 30 03:02:42 node02 kernel: vmbr1: port 1(enp5s0f1) entered forwarding state

Jan 30 03:02:51 node02 pvestatd[2283]: pbs01-node02: error fetching datastores - 500 Can't connect to +:8007 (Temporary failure in name resolution)

Jan 30 03:02:51 node02 pvestatd[2283]: status update time (20.282 seconds)

Jan 30 03:02:51 node02 corosync[2263]:   [KNET  ] rx: host: 3 link: 0 is up

Jan 30 03:02:51 node02 corosync[2263]:   [KNET  ] link: Resetting MTU for link 0 because host 3 joined

Jan 30 03:02:51 node02 corosync[2263]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)

Jan 30 03:02:51 node02 corosync[2263]:   [KNET  ] pmtud: Global data MTU changed to: 1397

Jan 30 03:02:51 node02 corosync[2263]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined

Jan 30 03:02:51 node02 corosync[2263]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)

Jan 30 03:02:51 node02 corosync[2263]:   [KNET  ] pmtud: Global data MTU changed to: 1397

Jan 30 03:02:51 node02 corosync[2263]:   [QUORUM] Sync members[3]: 1 2 3

Jan 30 03:02:51 node02 corosync[2263]:   [QUORUM] Sync joined[2]: 1 3

Jan 30 03:02:51 node02 corosync[2263]:   [TOTEM ] A new membership (1.16e) was formed. Members joined: 1 3

Jan 30 03:02:51 node02 pmxcfs[2167]: [dcdb] notice: members: 1/2502, 2/2167, 3/931

Jan 30 03:02:51 node02 pmxcfs[2167]: [dcdb] notice: starting data syncronisation

Jan 30 03:02:51 node02 pmxcfs[2167]: [status] notice: members: 1/2502, 2/2167, 3/931

Jan 30 03:02:51 node02 pmxcfs[2167]: [status] notice: starting data syncronisation

Jan 30 03:02:51 node02 corosync[2263]:   [QUORUM] This node is within the primary component and will provide service.

Jan 30 03:02:51 node02 corosync[2263]:   [QUORUM] Members[3]: 1 2 3

Jan 30 03:02:51 node02 pmxcfs[2167]: [status] notice: node has quorum

Jan 30 03:02:51 node02 corosync[2263]:   [MAIN  ] Completed service synchronization, ready to provide service.

Jan 30 03:02:52 node02 pmxcfs[2167]: [dcdb] notice: received sync request (epoch 1/2502/00000006)

Jan 30 03:02:52 node02 pmxcfs[2167]: [status] notice: received sync request (epoch 1/2502/00000004)

Jan 30 03:02:52 node02 pmxcfs[2167]: [dcdb] notice: received all states

Jan 30 03:02:52 node02 pmxcfs[2167]: [dcdb] notice: leader is 1/2502

Jan 30 03:02:52 node02 pmxcfs[2167]: [dcdb] notice: synced members: 1/2502, 3/931

Jan 30 03:02:52 node02 pmxcfs[2167]: [dcdb] notice: waiting for updates from leader

Jan 30 03:02:52 node02 pmxcfs[2167]: [dcdb] notice: dfsm_deliver_queue: queue length 4

Jan 30 03:02:52 node02 pmxcfs[2167]: [status] notice: received all states

Jan 30 03:02:52 node02 pmxcfs[2167]: [status] notice: all data is up to date

Jan 30 03:02:52 node02 pmxcfs[2167]: [dcdb] notice: update complete - trying to commit (got 4 inode updates)

Jan 30 03:02:52 node02 pmxcfs[2167]: [dcdb] notice: all data is up to date

Jan 30 03:02:52 node02 pmxcfs[2167]: [dcdb] notice: dfsm_deliver_sync_queue: queue length 4

Jan 30 03:02:53 node02 watchdog-mux[1810]: client watchdog expired - disable watchdog updates

Jan 30 03:02:55 node02 pve-ha-lrm[2330]: successfully acquired lock 'ha_agent_node02_lock'

Jan 30 03:02:55 node02 pve-ha-lrm[2330]: status change lost_agent_lock => active

Jan 30 03:02:55 node02 watchdog-mux[1810]: exit watchdog-mux with active connections

Jan 30 03:02:55 node02 systemd[1]: watchdog-mux.service: Deactivated successfully.

Jan 30 03:02:55 node02 systemd-journald[622]: Received client request to sync journal.

Jan 30 03:02:55 node02 kernel: watchdog: watchdog0: watchdog did not stop!

-- Reboot --

Hat einer eine Idee woran das liegen kann?

quickmo · Jan 30, 2024

Ich glaub ich hab den Fehler gefunden, der Coreswitch hat sich heute nacht um 3:06 geupdatet.

sb-jw · Jan 30, 2024

quickmo said:
der Coreswitch hat sich heute nacht um 3:06 geupdatet.

Uihh, das würde ich aber deaktivieren. Finde ich gefährlich bei einem Coreswitch den unbewacht updaten zu lassen.

Aber ja, zeitlich könnte es in etwa hinkommen, je nach Routine.

quickmo · Jan 30, 2024

sb-jw said:
Uihh, das würde ich aber deaktivieren. Finde ich gefährlich bei einem Coreswitch den unbewacht updaten zu lassen.

Aber ja, zeitlich könnte es in etwa hinkommen, je nach Routine.

Hab ich direkt ausgeschaltet, hab ich wohl übersehen

Falk R. · Jan 30, 2024

Da hast du aber 2 Fehlerquellen. Ich kenne keine Core Switches die automatische Updates machen und dein Corosync Netzwerk ist nicht Redundant. Wenn das Corosync Netzwerk redundant wäre, hättest du keinen Reboot gehabt.

quickmo · Jan 30, 2024

Falk R. said:
Da hast du aber 2 Fehlerquellen. Ich kenne keine Core Switches die automatische Updates machen und dein Corosync Netzwerk ist nicht Redundant. Wenn das Corosync Netzwerk redundant wäre, hättest du keinen Reboot gehabt.

Ja ich hab bisher nur eine 10G Netzwerkkarte und einen 10G Switch, dort werde ich nochmal nachrüsten! Damit sollte dann mehr Redudanz gegeben sein.

Search

Search

Proxmox Cluster node02 unerwarteter Neustart

quickmo

New Member

quickmo

New Member

sb-jw

Famous Member

quickmo

New Member

Falk R.

Distinguished Member

quickmo

New Member