cluster crash with unknown reason. logs attached

ilia987 · Nov 5, 2021

For some reason most of the cluster is crashed (servers rebooted) it became stable after the reboot but there was a small downtime .
i tried to find the reason in the loges but i could not understand what caused it
here are the logs of the cluster from one of the nodes that was not rebooted (on the rebooted nodes i can see any loges before the reboot)

the beginning of the crash: ( more lines in the file attached )

Code:

Nov 05 09:05:00 pve-srv2 systemd[1]: Starting Proxmox VE replication runner...
Nov 05 09:05:01 pve-srv2 systemd[1]: pvesr.service: Succeeded.
Nov 05 09:05:01 pve-srv2 systemd[1]: Finished Proxmox VE replication runner.
Nov 05 09:05:52 pve-srv2 corosync[4973]:   [TOTEM ] Token has not been received in 6637 ms
Nov 05 09:05:58 pve-srv2 corosync[4973]:   [QUORUM] Sync members[11]: 1 2 3 4 5 6 7 8 9 10 11
Nov 05 09:05:58 pve-srv2 corosync[4973]:   [TOTEM ] A new membership (1.357c) was formed. Members
Nov 05 09:06:00 pve-srv2 systemd[1]: Starting Proxmox VE replication runner...
Nov 05 09:06:03 pve-srv2 pmxcfs[4609]: [status] notice: cpg_send_message retry 10
Nov 05 09:06:04 pve-srv2 pmxcfs[4609]: [status] notice: cpg_send_message retry 20
Nov 05 09:06:05 pve-srv2 corosync[4973]:   [TOTEM ] Token has not been received in 6637 ms
Nov 05 09:06:05 pve-srv2 pmxcfs[4609]: [status] notice: cpg_send_message retry 30
Nov 05 09:06:06 pve-srv2 pmxcfs[4609]: [status] notice: cpg_send_message retry 40
Nov 05 09:06:07 pve-srv2 corosync[4973]:   [QUORUM] Members[11]: 1 2 3 4 5 6 7 8 9 10 11
Nov 05 09:06:07 pve-srv2 corosync[4973]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov 05 09:06:07 pve-srv2 pmxcfs[4609]: [status] notice: cpg_send_message retried 41 times
Nov 05 09:06:09 pve-srv2 pvesr[3942067]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 05 09:06:10 pve-srv2 pvesr[3942067]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 05 09:06:18 pve-srv2 corosync[4973]:   [TOTEM ] Token has not been received in 6637 ms
Nov 05 09:06:19 pve-srv2 pvesr[3942067]: cfs-lock 'file-replication_cfg' error: got lock request timeout
Nov 05 09:06:19 pve-srv2 systemd[1]: pvesr.service: Main process exited, code=exited, status=17/n/a
Nov 05 09:06:19 pve-srv2 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Nov 05 09:06:19 pve-srv2 systemd[1]: Failed to start Proxmox VE replication runner.
Nov 05 09:06:22 pve-srv2 corosync[4973]:   [KNET  ] link: host: 2 link: 0 is down
Nov 05 09:06:22 pve-srv2 corosync[4973]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 05 09:06:22 pve-srv2 corosync[4973]:   [KNET  ] host: host: 2 has no active links
Nov 05 09:06:27 pve-srv2 corosync[4973]:   [KNET  ] rx: host: 2 link: 0 is up
Nov 05 09:06:27 pve-srv2 corosync[4973]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 05 09:06:28 pve-srv2 corosync[4973]:   [TOTEM ] Token has not been received in 6637 ms
Nov 05 09:06:34 pve-srv2 corosync[4973]:   [KNET  ] link: host: 2 link: 0 is down
Nov 05 09:06:34 pve-srv2 corosync[4973]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 05 09:06:34 pve-srv2 corosync[4973]:   [KNET  ] host: host: 2 has no active links
Nov 05 09:06:39 pve-srv2 corosync[4973]:   [KNET  ] rx: host: 2 link: 0 is up
Nov 05 09:06:39 pve-srv2 corosync[4973]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 05 09:06:50 pve-srv2 corosync[4973]:   [QUORUM] Sync members[10]: 2 3 4 5 6 7 8 9 10 11
Nov 05 09:06:50 pve-srv2 corosync[4973]:   [QUORUM] Sync left[1]: 1
Nov 05 09:06:50 pve-srv2 corosync[4973]:   [TOTEM ] A new membership (2.3584) was formed. Members left: 1
Nov 05 09:06:50 pve-srv2 corosync[4973]:   [TOTEM ] Failed to receive the leave message. failed: 1
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: members: 2/1068, 3/5510, 4/1933, 5/2000, 6/2873, 7/2076, 8/4609, 9/4036, 10/3760, 11/4185
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: starting data syncronisation
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [status] notice: members: 2/1068, 3/5510, 4/1933, 5/2000, 6/2873, 7/2076, 8/4609, 9/4036, 10/3760, 11/4185
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [status] notice: starting data syncronisation
Nov 05 09:06:50 pve-srv2 corosync[4973]:   [QUORUM] Members[10]: 2 3 4 5 6 7 8 9 10 11
Nov 05 09:06:50 pve-srv2 corosync[4973]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: received sync request (epoch 2/1068/00000052)
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [status] notice: received sync request (epoch 2/1068/00000046)
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: received all states
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: leader is 2/1068
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: synced members: 2/1068, 3/5510, 4/1933, 5/2000, 6/2873, 7/2076, 8/4609, 9/4036, 10/3760, 11/4185
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: all data is up to date
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: dfsm_deliver_queue: queue length 28
Nov 05 09:06:50 pve-srv2 pve-ha-crm[6362]: loop take too long (31 seconds)
Nov 05 09:06:55 pve-srv2 pve-ha-lrm[6758]: loop take too long (31 seconds)
Nov 05 09:07:00 pve-srv2 systemd[1]: Starting Proxmox VE replication runner...
Nov 05 09:07:11 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 7 8 b d e f 11
Nov 05 09:07:11 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 2a 2b 2c 2d 2e 2f 30 31 32
Nov 05 09:07:13 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 39 3a 3b 3c 40 41 42 43 44 45 5b 5c 5d
Nov 05 09:07:14 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 50 51 52 53 54 55 88 89 8a 8b
Nov 05 09:07:17 pve-srv2 corosync[4973]:   [KNET  ] link: host: 1 link: 0 is down
Nov 05 09:07:17 pve-srv2 corosync[4973]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 05 09:07:17 pve-srv2 corosync[4973]:   [KNET  ] host: host: 1 has no active links
Nov 05 09:07:18 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 68 69 6a 6b 6c 6d 6e 6f 70 b5 b6 b7 b8
Nov 05 09:07:25 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 6a 6b 6c 6d 6e 6f 70 71 d4 d5 d6 d7
Nov 05 09:07:25 pve-srv2 corosync[4973]:   [KNET  ] rx: host: 1 link: 0 is up
Nov 05 09:07:25 pve-srv2 corosync[4973]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 05 09:07:29 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 85 87 e4 e5 e6 e7
Nov 05 09:07:29 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 9f a3 a4 a5 a6
Nov 05 09:07:31 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: a7 a8 a9 aa ab ac
Nov 05 09:07:33 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 106 107 108
Nov 05 09:07:36 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: cf d1 110 111
Nov 05 09:07:36 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 11d 11e
Nov 05 09:07:38 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 12c 12d
Nov 05 09:07:44 pve-srv2 corosync[4973]:   [KNET  ] link: host: 1 link: 0 is down
Nov 05 09:07:44 pve-srv2 corosync[4973]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 05 09:07:44 pve-srv2 corosync[4973]:   [KNET  ] host: host: 1 has no active links
Nov 05 09:07:49 pve-srv2 corosync[4973]:   [KNET  ] rx: host: 1 link: 0 is up
Nov 05 09:07:49 pve-srv2 corosync[4973]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 05 09:07:55 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 1ff 200
Nov 05 09:07:55 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 22d 22e 22f 230 231
Nov 05 09:07:55 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 236 237
Nov 05 09:07:58 pve-srv2 pmxcfs[4609]: [status] notice: received all states
Nov 05 09:07:58 pve-srv2 corosync[4973]:   [QUORUM] Sync members[10]: 2 3 4 5 6 7 8 9 10 11
Nov 05 09:07:58 pve-srv2 corosync[4973]:   [QUORUM] Sync joined[5]: 2 3 4 5 6
Nov 05 09:07:58 pve-srv2 corosync[4973]:   [QUORUM] Sync left[5]: 2 3 4 5 6

Pifouney · Nov 7, 2021

ilia987 said:

For some reason most of the cluster is crashed (servers rebooted) it became stable after the reboot but there was a small downtime .
i tried to find the reason in the loges but i could not understand what caused it
here are the logs of the cluster from one of the nodes that was not rebooted (on the rebooted nodes i can see any loges before the reboot)

the beginning of the crash: ( more lines in the file attached )

Code:

Nov 05 09:05:00 pve-srv2 systemd[1]: Starting Proxmox VE replication runner...
Nov 05 09:05:01 pve-srv2 systemd[1]: pvesr.service: Succeeded.
Nov 05 09:05:01 pve-srv2 systemd[1]: Finished Proxmox VE replication runner.
Nov 05 09:05:52 pve-srv2 corosync[4973]:   [TOTEM ] Token has not been received in 6637 ms
Nov 05 09:05:58 pve-srv2 corosync[4973]:   [QUORUM] Sync members[11]: 1 2 3 4 5 6 7 8 9 10 11
Nov 05 09:05:58 pve-srv2 corosync[4973]:   [TOTEM ] A new membership (1.357c) was formed. Members
Nov 05 09:06:00 pve-srv2 systemd[1]: Starting Proxmox VE replication runner...
Nov 05 09:06:03 pve-srv2 pmxcfs[4609]: [status] notice: cpg_send_message retry 10
Nov 05 09:06:04 pve-srv2 pmxcfs[4609]: [status] notice: cpg_send_message retry 20
Nov 05 09:06:05 pve-srv2 corosync[4973]:   [TOTEM ] Token has not been received in 6637 ms
Nov 05 09:06:05 pve-srv2 pmxcfs[4609]: [status] notice: cpg_send_message retry 30
Nov 05 09:06:06 pve-srv2 pmxcfs[4609]: [status] notice: cpg_send_message retry 40
Nov 05 09:06:07 pve-srv2 corosync[4973]:   [QUORUM] Members[11]: 1 2 3 4 5 6 7 8 9 10 11
Nov 05 09:06:07 pve-srv2 corosync[4973]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov 05 09:06:07 pve-srv2 pmxcfs[4609]: [status] notice: cpg_send_message retried 41 times
Nov 05 09:06:09 pve-srv2 pvesr[3942067]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 05 09:06:10 pve-srv2 pvesr[3942067]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 05 09:06:18 pve-srv2 corosync[4973]:   [TOTEM ] Token has not been received in 6637 ms
Nov 05 09:06:19 pve-srv2 pvesr[3942067]: cfs-lock 'file-replication_cfg' error: got lock request timeout
Nov 05 09:06:19 pve-srv2 systemd[1]: pvesr.service: Main process exited, code=exited, status=17/n/a
Nov 05 09:06:19 pve-srv2 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Nov 05 09:06:19 pve-srv2 systemd[1]: Failed to start Proxmox VE replication runner.
Nov 05 09:06:22 pve-srv2 corosync[4973]:   [KNET  ] link: host: 2 link: 0 is down
Nov 05 09:06:22 pve-srv2 corosync[4973]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 05 09:06:22 pve-srv2 corosync[4973]:   [KNET  ] host: host: 2 has no active links
Nov 05 09:06:27 pve-srv2 corosync[4973]:   [KNET  ] rx: host: 2 link: 0 is up
Nov 05 09:06:27 pve-srv2 corosync[4973]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 05 09:06:28 pve-srv2 corosync[4973]:   [TOTEM ] Token has not been received in 6637 ms
Nov 05 09:06:34 pve-srv2 corosync[4973]:   [KNET  ] link: host: 2 link: 0 is down
Nov 05 09:06:34 pve-srv2 corosync[4973]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 05 09:06:34 pve-srv2 corosync[4973]:   [KNET  ] host: host: 2 has no active links
Nov 05 09:06:39 pve-srv2 corosync[4973]:   [KNET  ] rx: host: 2 link: 0 is up
Nov 05 09:06:39 pve-srv2 corosync[4973]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 05 09:06:50 pve-srv2 corosync[4973]:   [QUORUM] Sync members[10]: 2 3 4 5 6 7 8 9 10 11
Nov 05 09:06:50 pve-srv2 corosync[4973]:   [QUORUM] Sync left[1]: 1
Nov 05 09:06:50 pve-srv2 corosync[4973]:   [TOTEM ] A new membership (2.3584) was formed. Members left: 1
Nov 05 09:06:50 pve-srv2 corosync[4973]:   [TOTEM ] Failed to receive the leave message. failed: 1
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: members: 2/1068, 3/5510, 4/1933, 5/2000, 6/2873, 7/2076, 8/4609, 9/4036, 10/3760, 11/4185
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: starting data syncronisation
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [status] notice: members: 2/1068, 3/5510, 4/1933, 5/2000, 6/2873, 7/2076, 8/4609, 9/4036, 10/3760, 11/4185
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [status] notice: starting data syncronisation
Nov 05 09:06:50 pve-srv2 corosync[4973]:   [QUORUM] Members[10]: 2 3 4 5 6 7 8 9 10 11
Nov 05 09:06:50 pve-srv2 corosync[4973]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: received sync request (epoch 2/1068/00000052)
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [status] notice: received sync request (epoch 2/1068/00000046)
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: received all states
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: leader is 2/1068
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: synced members: 2/1068, 3/5510, 4/1933, 5/2000, 6/2873, 7/2076, 8/4609, 9/4036, 10/3760, 11/4185
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: all data is up to date
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: dfsm_deliver_queue: queue length 28
Nov 05 09:06:50 pve-srv2 pve-ha-crm[6362]: loop take too long (31 seconds)
Nov 05 09:06:55 pve-srv2 pve-ha-lrm[6758]: loop take too long (31 seconds)
Nov 05 09:07:00 pve-srv2 systemd[1]: Starting Proxmox VE replication runner...
Nov 05 09:07:11 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 7 8 b d e f 11
Nov 05 09:07:11 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 2a 2b 2c 2d 2e 2f 30 31 32
Nov 05 09:07:13 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 39 3a 3b 3c 40 41 42 43 44 45 5b 5c 5d
Nov 05 09:07:14 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 50 51 52 53 54 55 88 89 8a 8b
Nov 05 09:07:17 pve-srv2 corosync[4973]:   [KNET  ] link: host: 1 link: 0 is down
Nov 05 09:07:17 pve-srv2 corosync[4973]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 05 09:07:17 pve-srv2 corosync[4973]:   [KNET  ] host: host: 1 has no active links
Nov 05 09:07:18 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 68 69 6a 6b 6c 6d 6e 6f 70 b5 b6 b7 b8
Nov 05 09:07:25 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 6a 6b 6c 6d 6e 6f 70 71 d4 d5 d6 d7
Nov 05 09:07:25 pve-srv2 corosync[4973]:   [KNET  ] rx: host: 1 link: 0 is up
Nov 05 09:07:25 pve-srv2 corosync[4973]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 05 09:07:29 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 85 87 e4 e5 e6 e7
Nov 05 09:07:29 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 9f a3 a4 a5 a6
Nov 05 09:07:31 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: a7 a8 a9 aa ab ac
Nov 05 09:07:33 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 106 107 108
Nov 05 09:07:36 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: cf d1 110 111
Nov 05 09:07:36 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 11d 11e
Nov 05 09:07:38 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 12c 12d
Nov 05 09:07:44 pve-srv2 corosync[4973]:   [KNET  ] link: host: 1 link: 0 is down
Nov 05 09:07:44 pve-srv2 corosync[4973]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 05 09:07:44 pve-srv2 corosync[4973]:   [KNET  ] host: host: 1 has no active links
Nov 05 09:07:49 pve-srv2 corosync[4973]:   [KNET  ] rx: host: 1 link: 0 is up
Nov 05 09:07:49 pve-srv2 corosync[4973]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 05 09:07:55 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 1ff 200
Nov 05 09:07:55 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 22d 22e 22f 230 231
Nov 05 09:07:55 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 236 237
Nov 05 09:07:58 pve-srv2 pmxcfs[4609]: [status] notice: received all states
Nov 05 09:07:58 pve-srv2 corosync[4973]:   [QUORUM] Sync members[10]: 2 3 4 5 6 7 8 9 10 11
Nov 05 09:07:58 pve-srv2 corosync[4973]:   [QUORUM] Sync joined[5]: 2 3 4 5 6
Nov 05 09:07:58 pve-srv2 corosync[4973]:   [QUORUM] Sync left[5]: 2 3 4 5 6

Hey,

regarding your attached logs, you found many information in:

Nov 05 09:08:30 pve-srv2 pvesr[3949098]: cfs-lock 'file-replication_cfg' error: no quorum!

When you're in a cluster production, you need three nodes. If you have three nodes, check your network connectivity.
If only two nodes, try to declares "pvecm expected 1"(temporary call, forgotten after reboot)
If you have only one node, it's a filesystem problem

But, this kind of logs let's seem me that you've a prblm with your network connectivity:

Nov 05 09:08:31 pve-srv2 lsass[4696]: [lsass] The cached machine account password was rejected by the DC.

ilia987 · Nov 7, 2021

I have checked the switch for errors there was none (all nodes connected to the same switch)

The only thing i was able to think about is that there was some load\freezee on the switch( but there was noting special logged ), i am waiting for some hardware in order to migrate the corosync ring to dedicated network. hopefully it will improve stability

Pifouney · Nov 7, 2021

ilia987 said:
I have checked the switch for errors there was none (all nodes connected to the same switch)

The only thing i was able to think about is that there was some load\freezee on the switch( but there was noting special logged ), i am waiting for some hardware in order to migrate the corosync ring to dedicated network. hopefully it will improve stability

Did you have checked yours host,resolv.conf files & FQDN reliability ?

ilia987 · Nov 7, 2021

Pifouney said:
Did you have checked yours host,resolv.conf files & FQDN reliability ?

We dont have anything special, or DC is online without issues and was up when the error occurred. and it was running for around a month (i upgraded v6.4->v7 and rebooted each host)

we faced something like this few month ago due to power failure, it took few days to the system to stabilize (had random crashes on the first few days) (it dont know why caused it . it resolved by itself )

fabian · Nov 8, 2021

the logs (which are not very complete) indicate that you had network problems (see the link down events). if you have HA enabled, losing quorum for long enough will cause nodes to be fenced.

ilia987 · Nov 13, 2021

fabian said:
the logs (which are not very complete) indicate that you had network problems (see the link down events). if you have HA enabled, losing quorum for long enough will cause nodes to be fenced.

I tried to give all the logs without expose some of our internal (info ip\mac), this is just occurs again today, ill try to investigate it further .

what i don't understand why a network issue (if it is the case )trigger a server reboot, and moreover reboot of most of the cluster

fabian · Nov 15, 2021

if you have HA enabled, nodes that lose quorum (are not part of the majority of the cluster anymore) will fence themselves - it's the only way that it is safe for the remaining majority to take over the HA-enabled resources (like VMs).

see https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_pvecm and https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_ha_manager

ilia987 · Dec 22, 2021

fabian said:
if you have HA enabled, nodes that lose quorum (are not part of the majority of the cluster anymore) will fence themselves - it's the only way that it is safe for the remaining majority to take over the HA-enabled resources (like VMs).

see https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_pvecm and https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_ha_manager

i have read it already,.
i dont understand why the entire cluster goes reboot.

for now ill try to disable HA . to try to isolate the issue but i dont think it is the case.

Search

Search

cluster crash with unknown reason. logs attached

ilia987

Active Member

Attachments

Pifouney

Active Member

ilia987

Active Member

Pifouney

Active Member

ilia987

Active Member

fabian

Proxmox Staff Member

ilia987

Active Member

fabian

Proxmox Staff Member

ilia987

Active Member