For some reason most of the cluster is crashed (servers rebooted) it became stable after the reboot but there was a small downtime .
i tried to find the reason in the loges but i could not understand what caused it
here are the logs of the cluster from one of the nodes that was not rebooted (on the rebooted nodes i can see any loges before the reboot)
the beginning of the crash: ( more lines in the file attached )
i tried to find the reason in the loges but i could not understand what caused it
here are the logs of the cluster from one of the nodes that was not rebooted (on the rebooted nodes i can see any loges before the reboot)
the beginning of the crash: ( more lines in the file attached )
Code:
Nov 05 09:05:00 pve-srv2 systemd[1]: Starting Proxmox VE replication runner...
Nov 05 09:05:01 pve-srv2 systemd[1]: pvesr.service: Succeeded.
Nov 05 09:05:01 pve-srv2 systemd[1]: Finished Proxmox VE replication runner.
Nov 05 09:05:52 pve-srv2 corosync[4973]: [TOTEM ] Token has not been received in 6637 ms
Nov 05 09:05:58 pve-srv2 corosync[4973]: [QUORUM] Sync members[11]: 1 2 3 4 5 6 7 8 9 10 11
Nov 05 09:05:58 pve-srv2 corosync[4973]: [TOTEM ] A new membership (1.357c) was formed. Members
Nov 05 09:06:00 pve-srv2 systemd[1]: Starting Proxmox VE replication runner...
Nov 05 09:06:03 pve-srv2 pmxcfs[4609]: [status] notice: cpg_send_message retry 10
Nov 05 09:06:04 pve-srv2 pmxcfs[4609]: [status] notice: cpg_send_message retry 20
Nov 05 09:06:05 pve-srv2 corosync[4973]: [TOTEM ] Token has not been received in 6637 ms
Nov 05 09:06:05 pve-srv2 pmxcfs[4609]: [status] notice: cpg_send_message retry 30
Nov 05 09:06:06 pve-srv2 pmxcfs[4609]: [status] notice: cpg_send_message retry 40
Nov 05 09:06:07 pve-srv2 corosync[4973]: [QUORUM] Members[11]: 1 2 3 4 5 6 7 8 9 10 11
Nov 05 09:06:07 pve-srv2 corosync[4973]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 05 09:06:07 pve-srv2 pmxcfs[4609]: [status] notice: cpg_send_message retried 41 times
Nov 05 09:06:09 pve-srv2 pvesr[3942067]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 05 09:06:10 pve-srv2 pvesr[3942067]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 05 09:06:18 pve-srv2 corosync[4973]: [TOTEM ] Token has not been received in 6637 ms
Nov 05 09:06:19 pve-srv2 pvesr[3942067]: cfs-lock 'file-replication_cfg' error: got lock request timeout
Nov 05 09:06:19 pve-srv2 systemd[1]: pvesr.service: Main process exited, code=exited, status=17/n/a
Nov 05 09:06:19 pve-srv2 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Nov 05 09:06:19 pve-srv2 systemd[1]: Failed to start Proxmox VE replication runner.
Nov 05 09:06:22 pve-srv2 corosync[4973]: [KNET ] link: host: 2 link: 0 is down
Nov 05 09:06:22 pve-srv2 corosync[4973]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 05 09:06:22 pve-srv2 corosync[4973]: [KNET ] host: host: 2 has no active links
Nov 05 09:06:27 pve-srv2 corosync[4973]: [KNET ] rx: host: 2 link: 0 is up
Nov 05 09:06:27 pve-srv2 corosync[4973]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 05 09:06:28 pve-srv2 corosync[4973]: [TOTEM ] Token has not been received in 6637 ms
Nov 05 09:06:34 pve-srv2 corosync[4973]: [KNET ] link: host: 2 link: 0 is down
Nov 05 09:06:34 pve-srv2 corosync[4973]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 05 09:06:34 pve-srv2 corosync[4973]: [KNET ] host: host: 2 has no active links
Nov 05 09:06:39 pve-srv2 corosync[4973]: [KNET ] rx: host: 2 link: 0 is up
Nov 05 09:06:39 pve-srv2 corosync[4973]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 05 09:06:50 pve-srv2 corosync[4973]: [QUORUM] Sync members[10]: 2 3 4 5 6 7 8 9 10 11
Nov 05 09:06:50 pve-srv2 corosync[4973]: [QUORUM] Sync left[1]: 1
Nov 05 09:06:50 pve-srv2 corosync[4973]: [TOTEM ] A new membership (2.3584) was formed. Members left: 1
Nov 05 09:06:50 pve-srv2 corosync[4973]: [TOTEM ] Failed to receive the leave message. failed: 1
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: members: 2/1068, 3/5510, 4/1933, 5/2000, 6/2873, 7/2076, 8/4609, 9/4036, 10/3760, 11/4185
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: starting data syncronisation
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [status] notice: members: 2/1068, 3/5510, 4/1933, 5/2000, 6/2873, 7/2076, 8/4609, 9/4036, 10/3760, 11/4185
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [status] notice: starting data syncronisation
Nov 05 09:06:50 pve-srv2 corosync[4973]: [QUORUM] Members[10]: 2 3 4 5 6 7 8 9 10 11
Nov 05 09:06:50 pve-srv2 corosync[4973]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: received sync request (epoch 2/1068/00000052)
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [status] notice: received sync request (epoch 2/1068/00000046)
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: received all states
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: leader is 2/1068
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: synced members: 2/1068, 3/5510, 4/1933, 5/2000, 6/2873, 7/2076, 8/4609, 9/4036, 10/3760, 11/4185
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: all data is up to date
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: dfsm_deliver_queue: queue length 28
Nov 05 09:06:50 pve-srv2 pve-ha-crm[6362]: loop take too long (31 seconds)
Nov 05 09:06:55 pve-srv2 pve-ha-lrm[6758]: loop take too long (31 seconds)
Nov 05 09:07:00 pve-srv2 systemd[1]: Starting Proxmox VE replication runner...
Nov 05 09:07:11 pve-srv2 corosync[4973]: [TOTEM ] Retransmit List: 7 8 b d e f 11
Nov 05 09:07:11 pve-srv2 corosync[4973]: [TOTEM ] Retransmit List: 2a 2b 2c 2d 2e 2f 30 31 32
Nov 05 09:07:13 pve-srv2 corosync[4973]: [TOTEM ] Retransmit List: 39 3a 3b 3c 40 41 42 43 44 45 5b 5c 5d
Nov 05 09:07:14 pve-srv2 corosync[4973]: [TOTEM ] Retransmit List: 50 51 52 53 54 55 88 89 8a 8b
Nov 05 09:07:17 pve-srv2 corosync[4973]: [KNET ] link: host: 1 link: 0 is down
Nov 05 09:07:17 pve-srv2 corosync[4973]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 05 09:07:17 pve-srv2 corosync[4973]: [KNET ] host: host: 1 has no active links
Nov 05 09:07:18 pve-srv2 corosync[4973]: [TOTEM ] Retransmit List: 68 69 6a 6b 6c 6d 6e 6f 70 b5 b6 b7 b8
Nov 05 09:07:25 pve-srv2 corosync[4973]: [TOTEM ] Retransmit List: 6a 6b 6c 6d 6e 6f 70 71 d4 d5 d6 d7
Nov 05 09:07:25 pve-srv2 corosync[4973]: [KNET ] rx: host: 1 link: 0 is up
Nov 05 09:07:25 pve-srv2 corosync[4973]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 05 09:07:29 pve-srv2 corosync[4973]: [TOTEM ] Retransmit List: 85 87 e4 e5 e6 e7
Nov 05 09:07:29 pve-srv2 corosync[4973]: [TOTEM ] Retransmit List: 9f a3 a4 a5 a6
Nov 05 09:07:31 pve-srv2 corosync[4973]: [TOTEM ] Retransmit List: a7 a8 a9 aa ab ac
Nov 05 09:07:33 pve-srv2 corosync[4973]: [TOTEM ] Retransmit List: 106 107 108
Nov 05 09:07:36 pve-srv2 corosync[4973]: [TOTEM ] Retransmit List: cf d1 110 111
Nov 05 09:07:36 pve-srv2 corosync[4973]: [TOTEM ] Retransmit List: 11d 11e
Nov 05 09:07:38 pve-srv2 corosync[4973]: [TOTEM ] Retransmit List: 12c 12d
Nov 05 09:07:44 pve-srv2 corosync[4973]: [KNET ] link: host: 1 link: 0 is down
Nov 05 09:07:44 pve-srv2 corosync[4973]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 05 09:07:44 pve-srv2 corosync[4973]: [KNET ] host: host: 1 has no active links
Nov 05 09:07:49 pve-srv2 corosync[4973]: [KNET ] rx: host: 1 link: 0 is up
Nov 05 09:07:49 pve-srv2 corosync[4973]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 05 09:07:55 pve-srv2 corosync[4973]: [TOTEM ] Retransmit List: 1ff 200
Nov 05 09:07:55 pve-srv2 corosync[4973]: [TOTEM ] Retransmit List: 22d 22e 22f 230 231
Nov 05 09:07:55 pve-srv2 corosync[4973]: [TOTEM ] Retransmit List: 236 237
Nov 05 09:07:58 pve-srv2 pmxcfs[4609]: [status] notice: received all states
Nov 05 09:07:58 pve-srv2 corosync[4973]: [QUORUM] Sync members[10]: 2 3 4 5 6 7 8 9 10 11
Nov 05 09:07:58 pve-srv2 corosync[4973]: [QUORUM] Sync joined[5]: 2 3 4 5 6
Nov 05 09:07:58 pve-srv2 corosync[4973]: [QUORUM] Sync left[5]: 2 3 4 5 6