cluster crash with unknown reason. logs attached

ilia987

Member
Sep 9, 2019
236
10
23
35
For some reason most of the cluster is crashed (servers rebooted) it became stable after the reboot but there was a small downtime .
i tried to find the reason in the loges but i could not understand what caused it
here are the logs of the cluster from one of the nodes that was not rebooted (on the rebooted nodes i can see any loges before the reboot)

the beginning of the crash: ( more lines in the file attached )
Code:
Nov 05 09:05:00 pve-srv2 systemd[1]: Starting Proxmox VE replication runner...
Nov 05 09:05:01 pve-srv2 systemd[1]: pvesr.service: Succeeded.
Nov 05 09:05:01 pve-srv2 systemd[1]: Finished Proxmox VE replication runner.
Nov 05 09:05:52 pve-srv2 corosync[4973]:   [TOTEM ] Token has not been received in 6637 ms
Nov 05 09:05:58 pve-srv2 corosync[4973]:   [QUORUM] Sync members[11]: 1 2 3 4 5 6 7 8 9 10 11
Nov 05 09:05:58 pve-srv2 corosync[4973]:   [TOTEM ] A new membership (1.357c) was formed. Members
Nov 05 09:06:00 pve-srv2 systemd[1]: Starting Proxmox VE replication runner...
Nov 05 09:06:03 pve-srv2 pmxcfs[4609]: [status] notice: cpg_send_message retry 10
Nov 05 09:06:04 pve-srv2 pmxcfs[4609]: [status] notice: cpg_send_message retry 20
Nov 05 09:06:05 pve-srv2 corosync[4973]:   [TOTEM ] Token has not been received in 6637 ms
Nov 05 09:06:05 pve-srv2 pmxcfs[4609]: [status] notice: cpg_send_message retry 30
Nov 05 09:06:06 pve-srv2 pmxcfs[4609]: [status] notice: cpg_send_message retry 40
Nov 05 09:06:07 pve-srv2 corosync[4973]:   [QUORUM] Members[11]: 1 2 3 4 5 6 7 8 9 10 11
Nov 05 09:06:07 pve-srv2 corosync[4973]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov 05 09:06:07 pve-srv2 pmxcfs[4609]: [status] notice: cpg_send_message retried 41 times
Nov 05 09:06:09 pve-srv2 pvesr[3942067]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 05 09:06:10 pve-srv2 pvesr[3942067]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 05 09:06:18 pve-srv2 corosync[4973]:   [TOTEM ] Token has not been received in 6637 ms
Nov 05 09:06:19 pve-srv2 pvesr[3942067]: cfs-lock 'file-replication_cfg' error: got lock request timeout
Nov 05 09:06:19 pve-srv2 systemd[1]: pvesr.service: Main process exited, code=exited, status=17/n/a
Nov 05 09:06:19 pve-srv2 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Nov 05 09:06:19 pve-srv2 systemd[1]: Failed to start Proxmox VE replication runner.
Nov 05 09:06:22 pve-srv2 corosync[4973]:   [KNET  ] link: host: 2 link: 0 is down
Nov 05 09:06:22 pve-srv2 corosync[4973]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 05 09:06:22 pve-srv2 corosync[4973]:   [KNET  ] host: host: 2 has no active links
Nov 05 09:06:27 pve-srv2 corosync[4973]:   [KNET  ] rx: host: 2 link: 0 is up
Nov 05 09:06:27 pve-srv2 corosync[4973]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 05 09:06:28 pve-srv2 corosync[4973]:   [TOTEM ] Token has not been received in 6637 ms
Nov 05 09:06:34 pve-srv2 corosync[4973]:   [KNET  ] link: host: 2 link: 0 is down
Nov 05 09:06:34 pve-srv2 corosync[4973]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 05 09:06:34 pve-srv2 corosync[4973]:   [KNET  ] host: host: 2 has no active links
Nov 05 09:06:39 pve-srv2 corosync[4973]:   [KNET  ] rx: host: 2 link: 0 is up
Nov 05 09:06:39 pve-srv2 corosync[4973]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 05 09:06:50 pve-srv2 corosync[4973]:   [QUORUM] Sync members[10]: 2 3 4 5 6 7 8 9 10 11
Nov 05 09:06:50 pve-srv2 corosync[4973]:   [QUORUM] Sync left[1]: 1
Nov 05 09:06:50 pve-srv2 corosync[4973]:   [TOTEM ] A new membership (2.3584) was formed. Members left: 1
Nov 05 09:06:50 pve-srv2 corosync[4973]:   [TOTEM ] Failed to receive the leave message. failed: 1
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: members: 2/1068, 3/5510, 4/1933, 5/2000, 6/2873, 7/2076, 8/4609, 9/4036, 10/3760, 11/4185
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: starting data syncronisation
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [status] notice: members: 2/1068, 3/5510, 4/1933, 5/2000, 6/2873, 7/2076, 8/4609, 9/4036, 10/3760, 11/4185
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [status] notice: starting data syncronisation
Nov 05 09:06:50 pve-srv2 corosync[4973]:   [QUORUM] Members[10]: 2 3 4 5 6 7 8 9 10 11
Nov 05 09:06:50 pve-srv2 corosync[4973]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: received sync request (epoch 2/1068/00000052)
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [status] notice: received sync request (epoch 2/1068/00000046)
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: received all states
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: leader is 2/1068
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: synced members: 2/1068, 3/5510, 4/1933, 5/2000, 6/2873, 7/2076, 8/4609, 9/4036, 10/3760, 11/4185
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: all data is up to date
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: dfsm_deliver_queue: queue length 28
Nov 05 09:06:50 pve-srv2 pve-ha-crm[6362]: loop take too long (31 seconds)
Nov 05 09:06:55 pve-srv2 pve-ha-lrm[6758]: loop take too long (31 seconds)
Nov 05 09:07:00 pve-srv2 systemd[1]: Starting Proxmox VE replication runner...
Nov 05 09:07:11 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 7 8 b d e f 11
Nov 05 09:07:11 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 2a 2b 2c 2d 2e 2f 30 31 32
Nov 05 09:07:13 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 39 3a 3b 3c 40 41 42 43 44 45 5b 5c 5d
Nov 05 09:07:14 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 50 51 52 53 54 55 88 89 8a 8b
Nov 05 09:07:17 pve-srv2 corosync[4973]:   [KNET  ] link: host: 1 link: 0 is down
Nov 05 09:07:17 pve-srv2 corosync[4973]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 05 09:07:17 pve-srv2 corosync[4973]:   [KNET  ] host: host: 1 has no active links
Nov 05 09:07:18 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 68 69 6a 6b 6c 6d 6e 6f 70 b5 b6 b7 b8
Nov 05 09:07:25 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 6a 6b 6c 6d 6e 6f 70 71 d4 d5 d6 d7
Nov 05 09:07:25 pve-srv2 corosync[4973]:   [KNET  ] rx: host: 1 link: 0 is up
Nov 05 09:07:25 pve-srv2 corosync[4973]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 05 09:07:29 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 85 87 e4 e5 e6 e7
Nov 05 09:07:29 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 9f a3 a4 a5 a6
Nov 05 09:07:31 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: a7 a8 a9 aa ab ac
Nov 05 09:07:33 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 106 107 108
Nov 05 09:07:36 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: cf d1 110 111
Nov 05 09:07:36 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 11d 11e
Nov 05 09:07:38 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 12c 12d
Nov 05 09:07:44 pve-srv2 corosync[4973]:   [KNET  ] link: host: 1 link: 0 is down
Nov 05 09:07:44 pve-srv2 corosync[4973]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 05 09:07:44 pve-srv2 corosync[4973]:   [KNET  ] host: host: 1 has no active links
Nov 05 09:07:49 pve-srv2 corosync[4973]:   [KNET  ] rx: host: 1 link: 0 is up
Nov 05 09:07:49 pve-srv2 corosync[4973]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 05 09:07:55 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 1ff 200
Nov 05 09:07:55 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 22d 22e 22f 230 231
Nov 05 09:07:55 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 236 237
Nov 05 09:07:58 pve-srv2 pmxcfs[4609]: [status] notice: received all states
Nov 05 09:07:58 pve-srv2 corosync[4973]:   [QUORUM] Sync members[10]: 2 3 4 5 6 7 8 9 10 11
Nov 05 09:07:58 pve-srv2 corosync[4973]:   [QUORUM] Sync joined[5]: 2 3 4 5 6
Nov 05 09:07:58 pve-srv2 corosync[4973]:   [QUORUM] Sync left[5]: 2 3 4 5 6
 

Attachments

  • crashlog.txt
    20.5 KB · Views: 4

Pifouney

Member
Oct 17, 2021
91
5
8
33
For some reason most of the cluster is crashed (servers rebooted) it became stable after the reboot but there was a small downtime .
i tried to find the reason in the loges but i could not understand what caused it
here are the logs of the cluster from one of the nodes that was not rebooted (on the rebooted nodes i can see any loges before the reboot)

the beginning of the crash: ( more lines in the file attached )
Code:
Nov 05 09:05:00 pve-srv2 systemd[1]: Starting Proxmox VE replication runner...
Nov 05 09:05:01 pve-srv2 systemd[1]: pvesr.service: Succeeded.
Nov 05 09:05:01 pve-srv2 systemd[1]: Finished Proxmox VE replication runner.
Nov 05 09:05:52 pve-srv2 corosync[4973]:   [TOTEM ] Token has not been received in 6637 ms
Nov 05 09:05:58 pve-srv2 corosync[4973]:   [QUORUM] Sync members[11]: 1 2 3 4 5 6 7 8 9 10 11
Nov 05 09:05:58 pve-srv2 corosync[4973]:   [TOTEM ] A new membership (1.357c) was formed. Members
Nov 05 09:06:00 pve-srv2 systemd[1]: Starting Proxmox VE replication runner...
Nov 05 09:06:03 pve-srv2 pmxcfs[4609]: [status] notice: cpg_send_message retry 10
Nov 05 09:06:04 pve-srv2 pmxcfs[4609]: [status] notice: cpg_send_message retry 20
Nov 05 09:06:05 pve-srv2 corosync[4973]:   [TOTEM ] Token has not been received in 6637 ms
Nov 05 09:06:05 pve-srv2 pmxcfs[4609]: [status] notice: cpg_send_message retry 30
Nov 05 09:06:06 pve-srv2 pmxcfs[4609]: [status] notice: cpg_send_message retry 40
Nov 05 09:06:07 pve-srv2 corosync[4973]:   [QUORUM] Members[11]: 1 2 3 4 5 6 7 8 9 10 11
Nov 05 09:06:07 pve-srv2 corosync[4973]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov 05 09:06:07 pve-srv2 pmxcfs[4609]: [status] notice: cpg_send_message retried 41 times
Nov 05 09:06:09 pve-srv2 pvesr[3942067]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 05 09:06:10 pve-srv2 pvesr[3942067]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 05 09:06:18 pve-srv2 corosync[4973]:   [TOTEM ] Token has not been received in 6637 ms
Nov 05 09:06:19 pve-srv2 pvesr[3942067]: cfs-lock 'file-replication_cfg' error: got lock request timeout
Nov 05 09:06:19 pve-srv2 systemd[1]: pvesr.service: Main process exited, code=exited, status=17/n/a
Nov 05 09:06:19 pve-srv2 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Nov 05 09:06:19 pve-srv2 systemd[1]: Failed to start Proxmox VE replication runner.
Nov 05 09:06:22 pve-srv2 corosync[4973]:   [KNET  ] link: host: 2 link: 0 is down
Nov 05 09:06:22 pve-srv2 corosync[4973]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 05 09:06:22 pve-srv2 corosync[4973]:   [KNET  ] host: host: 2 has no active links
Nov 05 09:06:27 pve-srv2 corosync[4973]:   [KNET  ] rx: host: 2 link: 0 is up
Nov 05 09:06:27 pve-srv2 corosync[4973]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 05 09:06:28 pve-srv2 corosync[4973]:   [TOTEM ] Token has not been received in 6637 ms
Nov 05 09:06:34 pve-srv2 corosync[4973]:   [KNET  ] link: host: 2 link: 0 is down
Nov 05 09:06:34 pve-srv2 corosync[4973]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 05 09:06:34 pve-srv2 corosync[4973]:   [KNET  ] host: host: 2 has no active links
Nov 05 09:06:39 pve-srv2 corosync[4973]:   [KNET  ] rx: host: 2 link: 0 is up
Nov 05 09:06:39 pve-srv2 corosync[4973]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 05 09:06:50 pve-srv2 corosync[4973]:   [QUORUM] Sync members[10]: 2 3 4 5 6 7 8 9 10 11
Nov 05 09:06:50 pve-srv2 corosync[4973]:   [QUORUM] Sync left[1]: 1
Nov 05 09:06:50 pve-srv2 corosync[4973]:   [TOTEM ] A new membership (2.3584) was formed. Members left: 1
Nov 05 09:06:50 pve-srv2 corosync[4973]:   [TOTEM ] Failed to receive the leave message. failed: 1
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: members: 2/1068, 3/5510, 4/1933, 5/2000, 6/2873, 7/2076, 8/4609, 9/4036, 10/3760, 11/4185
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: starting data syncronisation
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [status] notice: members: 2/1068, 3/5510, 4/1933, 5/2000, 6/2873, 7/2076, 8/4609, 9/4036, 10/3760, 11/4185
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [status] notice: starting data syncronisation
Nov 05 09:06:50 pve-srv2 corosync[4973]:   [QUORUM] Members[10]: 2 3 4 5 6 7 8 9 10 11
Nov 05 09:06:50 pve-srv2 corosync[4973]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: received sync request (epoch 2/1068/00000052)
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [status] notice: received sync request (epoch 2/1068/00000046)
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: received all states
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: leader is 2/1068
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: synced members: 2/1068, 3/5510, 4/1933, 5/2000, 6/2873, 7/2076, 8/4609, 9/4036, 10/3760, 11/4185
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: all data is up to date
Nov 05 09:06:50 pve-srv2 pmxcfs[4609]: [dcdb] notice: dfsm_deliver_queue: queue length 28
Nov 05 09:06:50 pve-srv2 pve-ha-crm[6362]: loop take too long (31 seconds)
Nov 05 09:06:55 pve-srv2 pve-ha-lrm[6758]: loop take too long (31 seconds)
Nov 05 09:07:00 pve-srv2 systemd[1]: Starting Proxmox VE replication runner...
Nov 05 09:07:11 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 7 8 b d e f 11
Nov 05 09:07:11 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 2a 2b 2c 2d 2e 2f 30 31 32
Nov 05 09:07:13 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 39 3a 3b 3c 40 41 42 43 44 45 5b 5c 5d
Nov 05 09:07:14 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 50 51 52 53 54 55 88 89 8a 8b
Nov 05 09:07:17 pve-srv2 corosync[4973]:   [KNET  ] link: host: 1 link: 0 is down
Nov 05 09:07:17 pve-srv2 corosync[4973]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 05 09:07:17 pve-srv2 corosync[4973]:   [KNET  ] host: host: 1 has no active links
Nov 05 09:07:18 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 68 69 6a 6b 6c 6d 6e 6f 70 b5 b6 b7 b8
Nov 05 09:07:25 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 6a 6b 6c 6d 6e 6f 70 71 d4 d5 d6 d7
Nov 05 09:07:25 pve-srv2 corosync[4973]:   [KNET  ] rx: host: 1 link: 0 is up
Nov 05 09:07:25 pve-srv2 corosync[4973]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 05 09:07:29 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 85 87 e4 e5 e6 e7
Nov 05 09:07:29 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 9f a3 a4 a5 a6
Nov 05 09:07:31 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: a7 a8 a9 aa ab ac
Nov 05 09:07:33 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 106 107 108
Nov 05 09:07:36 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: cf d1 110 111
Nov 05 09:07:36 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 11d 11e
Nov 05 09:07:38 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 12c 12d
Nov 05 09:07:44 pve-srv2 corosync[4973]:   [KNET  ] link: host: 1 link: 0 is down
Nov 05 09:07:44 pve-srv2 corosync[4973]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 05 09:07:44 pve-srv2 corosync[4973]:   [KNET  ] host: host: 1 has no active links
Nov 05 09:07:49 pve-srv2 corosync[4973]:   [KNET  ] rx: host: 1 link: 0 is up
Nov 05 09:07:49 pve-srv2 corosync[4973]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 05 09:07:55 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 1ff 200
Nov 05 09:07:55 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 22d 22e 22f 230 231
Nov 05 09:07:55 pve-srv2 corosync[4973]:   [TOTEM ] Retransmit List: 236 237
Nov 05 09:07:58 pve-srv2 pmxcfs[4609]: [status] notice: received all states
Nov 05 09:07:58 pve-srv2 corosync[4973]:   [QUORUM] Sync members[10]: 2 3 4 5 6 7 8 9 10 11
Nov 05 09:07:58 pve-srv2 corosync[4973]:   [QUORUM] Sync joined[5]: 2 3 4 5 6
Nov 05 09:07:58 pve-srv2 corosync[4973]:   [QUORUM] Sync left[5]: 2 3 4 5 6
Hey,

regarding your attached logs, you found many information in:
Nov 05 09:08:30 pve-srv2 pvesr[3949098]: cfs-lock 'file-replication_cfg' error: no quorum!
When you're in a cluster production, you need three nodes. If you have three nodes, check your network connectivity.
If only two nodes, try to declares "pvecm expected 1"(temporary call, forgotten after reboot)
If you have only one node, it's a filesystem problem

But, this kind of logs let's seem me that you've a prblm with your network connectivity:
Nov 05 09:08:31 pve-srv2 lsass[4696]: [lsass] The cached machine account password was rejected by the DC.
 

ilia987

Member
Sep 9, 2019
236
10
23
35
I have checked the switch for errors there was none (all nodes connected to the same switch)

The only thing i was able to think about is that there was some load\freezee on the switch( but there was noting special logged ), i am waiting for some hardware in order to migrate the corosync ring to dedicated network. hopefully it will improve stability
 

Pifouney

Member
Oct 17, 2021
91
5
8
33
I have checked the switch for errors there was none (all nodes connected to the same switch)

The only thing i was able to think about is that there was some load\freezee on the switch( but there was noting special logged ), i am waiting for some hardware in order to migrate the corosync ring to dedicated network. hopefully it will improve stability
Did you have checked yours host,resolv.conf files & FQDN reliability ?
 

ilia987

Member
Sep 9, 2019
236
10
23
35
Did you have checked yours host,resolv.conf files & FQDN reliability ?
We dont have anything special, or DC is online without issues and was up when the error occurred. and it was running for around a month (i upgraded v6.4->v7 and rebooted each host)

we faced something like this few month ago due to power failure, it took few days to the system to stabilize (had random crashes on the first few days) (it dont know why caused it . it resolved by itself )
 

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
7,483
1,396
164
the logs (which are not very complete) indicate that you had network problems (see the link down events). if you have HA enabled, losing quorum for long enough will cause nodes to be fenced.
 
  • Like
Reactions: Pifouney and itNGO

ilia987

Member
Sep 9, 2019
236
10
23
35
the logs (which are not very complete) indicate that you had network problems (see the link down events). if you have HA enabled, losing quorum for long enough will cause nodes to be fenced.
I tried to give all the logs without expose some of our internal (info ip\mac), this is just occurs again today, ill try to investigate it further .

what i don't understand why a network issue (if it is the case )trigger a server reboot, and moreover reboot of most of the cluster
 
Last edited:

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
7,483
1,396
164

ilia987

Member
Sep 9, 2019
236
10
23
35
if you have HA enabled, nodes that lose quorum (are not part of the majority of the cluster anymore) will fence themselves - it's the only way that it is safe for the remaining majority to take over the HA-enabled resources (like VMs).

see https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_pvecm and https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_ha_manager
i have read it already,.
i dont understand why the entire cluster goes reboot.

for now ill try to disable HA . to try to isolate the issue but i dont think it is the case.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!