Server Randomly Rebooting

cwilling · Sep 21, 2023

Hi i have an cluster with 3 servers and the servers randomly reboot. All servers have 4 nvme ssds 2 with zfs 4 proxmox and 2 whith pcie passthourgh to an vm on every server that build an ceph cluster this is done becourse the ceph cluster is also used for kubernetes which also runs on the servers.

Code:

CPU(s) 32 x AMD Ryzen 9 7950X3D 16-Core Processor (1 Socket)
Kernel Version Linux 6.2.16-12-pve #1 SMP PREEMPT_DYNAMIC PMX 6.2.16-12 (2023-09-04T13:21Z)
PVE Manager Version pve-manager/8.0.4

Server are AX102 from hetzner and have 128gb ddr memory and have a 10gb ethernet card

i tried adding

Code:

pci=assign-busses apicmaintimer idle=poll reboot=cold,hard

to /etc/default/grub and /etc/kernel/cmdline but this doesnt seems to work but this doesnt change anything or maybe seems to make it worse.

running journalctl --list-boots on one server shows

Code:

IDX BOOT ID                          FIRST ENTRY                  LAST ENTRY                
-34 8ba28233a65a4108be6853da8cc9fc18 Fri 2023-07-07 14:43:47 CEST Fri 2023-07-07 14:47:41 CEST
-33 46eb05c3a68048718c03929a27b9f715 Fri 2023-07-07 14:49:50 CEST Fri 2023-07-07 14:49:50 CEST
-32 2c7af644e33548c4ac4f59e419cf303a Fri 2023-07-07 14:50:37 CEST Fri 2023-07-07 14:50:43 CEST
-31 e77ad97bd0a44429a5cc4bf4f007866c Fri 2023-07-07 14:52:57 CEST Fri 2023-07-07 14:56:20 CEST
-30 92daa39bb06b4fda9eff10891bd84814 Fri 2023-07-07 14:57:18 CEST Fri 2023-07-07 15:16:03 CEST
-29 08dd71a1f2104a1eb990902d8ba91dcd Fri 2023-07-07 15:17:03 CEST Fri 2023-07-07 22:56:42 CEST
-28 e9d14dc2c4384311be565de68ecd3389 Fri 2023-07-07 22:57:49 CEST Wed 2023-07-12 14:46:52 CEST
-27 22f44e886921468096f7b2c14543e7dd Wed 2023-07-12 14:53:45 CEST Wed 2023-07-12 15:44:38 CEST
-26 8be4f551524a4160bda27bc41d2a0464 Wed 2023-07-12 15:48:32 CEST Thu 2023-07-13 20:33:41 CEST
-25 e1654a9d17ee47c6b56edd8db4b0de1f Thu 2023-07-13 20:59:00 CEST Fri 2023-07-14 11:47:39 CEST
-24 27748f1ac92b4f6b96ec30a4fb499b79 Fri 2023-07-14 11:57:27 CEST Wed 2023-07-26 23:31:40 CEST
-23 dc7cc3e7f2354d71b9b06930835f0038 Wed 2023-07-26 23:34:17 CEST Thu 2023-07-27 00:22:18 CEST
-22 fd2c91512b014d3994ae491f6634d591 Thu 2023-07-27 00:23:25 CEST Thu 2023-07-27 15:52:23 CEST
-21 716bec032a454e58bd45f96c4fa96654 Thu 2023-07-27 15:53:30 CEST Tue 2023-08-01 17:40:44 CEST
-20 5fe5bff403a147afbdf7a446a9fece7f Tue 2023-08-01 17:43:23 CEST Sun 2023-08-06 16:11:45 CEST
-19 a18faa2303d84c85aed8c5efa1315a41 Sun 2023-08-06 16:12:50 CEST Sun 2023-08-06 18:04:00 CEST
-18 a662b63ae9844cd2a92af589dc2a636c Sun 2023-08-06 18:05:06 CEST Tue 2023-09-05 15:45:11 CEST
-17 fbce1f6973524bb88001bd102d732cfb Tue 2023-09-05 15:48:00 CEST Wed 2023-09-13 11:19:50 CEST
-16 cf73f89b000941c9b9546660913da440 Wed 2023-09-13 11:23:12 CEST Tue 2023-09-19 14:58:30 CEST
-15 3dc642b2a1b841c8a9f35fe1c7012524 Tue 2023-09-19 15:00:01 CEST Tue 2023-09-19 15:16:42 CEST
-14 ccb7671420254fa98737a6c3fe515625 Tue 2023-09-19 15:18:08 CEST Tue 2023-09-19 17:44:29 CEST
-13 9b4b6dba847e4986a0332dce6fcd74f5 Tue 2023-09-19 17:45:58 CEST Tue 2023-09-19 18:56:44 CEST
-12 88240c120f8145f5ac8132cd3180ccc3 Tue 2023-09-19 18:58:08 CEST Tue 2023-09-19 19:35:05 CEST
-11 9c138c7edce94200830f1a255b9f81c6 Tue 2023-09-19 19:36:22 CEST Tue 2023-09-19 20:15:30 CEST
-10 05c2ea91bdd54ae4ba11e4e9f82f9e22 Tue 2023-09-19 20:17:08 CEST Wed 2023-09-20 08:28:01 CEST
 -9 f2df48cec75842e689ed7cae689012eb Wed 2023-09-20 08:29:37 CEST Wed 2023-09-20 08:34:55 CEST
 -8 50fbb4eb3ba44bd78f8b83d0e0181b28 Wed 2023-09-20 08:36:51 CEST Wed 2023-09-20 09:41:43 CEST
 -7 cb3d8c59270b45e0bf7d0f8304fb5310 Wed 2023-09-20 09:43:13 CEST Wed 2023-09-20 09:49:56 CEST
 -6 41b868ba4dac4f1f97b117774d190c21 Wed 2023-09-20 09:51:27 CEST Wed 2023-09-20 11:21:32 CEST
 -5 8d4cd516ff5b4f81b6f03e178aa8c6bf Wed 2023-09-20 11:22:45 CEST Wed 2023-09-20 11:31:31 CEST
 -4 45ba1c8903b74465b1d90aeefcdd28a2 Wed 2023-09-20 11:32:53 CEST Wed 2023-09-20 11:40:56 CEST
 -3 f4cf8631e26a47d2a65b784bcb658c61 Wed 2023-09-20 11:42:10 CEST Wed 2023-09-20 13:46:28 CEST
 -2 8efcf864fd114553912c92885327c1ed Wed 2023-09-20 13:47:39 CEST Wed 2023-09-20 15:36:08 CEST

example syslog from before and an reboot is added as attachment.

it also seems that multiple nodes fail at the same time or verry short after another.

sometimes they run for an day but mostly they fail ever 2 to 4 hours. I have one vm with ha that sometimes doesnt restart after an node failes could be that multiple nodes faild or that ceph is maybe down if 2 server fail.

shanreich · Sep 22, 2023

Do you have HA activated? It seems like your network is having issues, and because of that Corosync is losing quorum. On nodes with HA resources, nodes that are in the non-quorate partition fence themselves [1]. You need to make sure the corosync network is stable or configure a second fallback link for your corosync network [2].

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#ha_manager_fencing
[2] https://pve.proxmox.com/wiki/Cluster_Manager#pvecm_redundancy

cwilling · Sep 23, 2023

yes there is one vm running with HA enabled.

the servers are all connected with there public ipv4 address. There is also an private vlan for the vm's but it runs over the same network card (this is the normal way on hetzner servers without paying for an extra switch) so adding an second link via the vlan would probably not helping. is there a way to see if the watchdog was triggered?

cwilling · Sep 26, 2023

We now have an seperate 1Gbit switch for Corosync wich dont have any problem ans as sekond link the public ip from the servers.. now the log looks like this:

Code:

Sep 26 10:17:02 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:17:02 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:17:24 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:17:32 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:17:45 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:17:59 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:18:02 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:18:27 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:18:27 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:19:00 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:19:27 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:19:38 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:19:40 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:19:55 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:20:16 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:20:16 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:20:32 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:20:58 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:20:59 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:21:42 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:21:59 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:21:59 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:22:14 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:22:19 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:22:34 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:22:34 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:22:46 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:22:47 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:23:07 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:23:12 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:23:16 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:23:20 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:23:26 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:23:35 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:23:38 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:23:54 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:23:54 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:24:09 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:24:50 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:24:50 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:25:20 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:25:32 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:25:47 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:25:49 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:25:59 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:26:07 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:26:10 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:26:21 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:26:21 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:26:34 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:26:44 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:26:46 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:26:55 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:27:12 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:27:12 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:27:36 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:27:38 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:27:50 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:28:18 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:28:18 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:28:42 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:28:48 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:28:51 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:29:02 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:29:06 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:29:21 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:29:39 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:29:39 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:30:02 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:30:26 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:30:26 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:30:43 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:30:53 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:30:53 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:31:03 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:31:15 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:31:18 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:31:35 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:31:35 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:32:04 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:32:07 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:32:16 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:32:16 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:32:31 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:32:31 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:32:40 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:32:44 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:32:49 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:33:08 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:33:11 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:33:17 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:33:23 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:33:32 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:33:43 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:34:02 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:34:14 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:34:15 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:34:32 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:34:36 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:34:56 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:35:00 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:35:19 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:35:40 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:35:49 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:35:53 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:36:15 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:36:33 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:36:35 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:36:47 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:36:59 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:37:44 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:37:48 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:37:56 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:37:58 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:38:15 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:38:15 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:38:59 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:39:09 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:39:15 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:39:25 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:39:25 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:39:34 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:39:48 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:40:04 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:40:05 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:40:14 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:40:14 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:40:30 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:40:30 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:40:40 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:40:50 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:40:50 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:41:04 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:41:06 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:41:15 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:41:15 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 26 10:41:24 server1 corosync[548844]:   [KNET  ] link: host: 3 link: 1 is down
Sep 26 10:41:24 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down

is there any way to debug whats wrong with the hetzner network or my configuration maybe? the soulution with the extra switch works for now but is a problem if we want to scale to more server becurse they have to be physically next to the switch so i want to fix this.

shanreich · Sep 26, 2023

Putting the Corosync network on a dedicated Switch with low latency is a good step, this should make the cluster stable hopefully. It probably still would be a good idea to add a second, local, network for redundancy - since running Corosync via a public network even as a backup link is not a good idea for several reasons.

As for your corosync logs: How does your corosync configuration look like? Which network is ring0 / which one is ring1?

Judging from your post it seems like you have the public network as link 1 - is that correct?
Based on that assumption and the logs, it says that link 1 is down. This would make sense, since you were having similar issues before with the same network (hence why using the public network as corosync link is not really a good idea). This is not necessarily due to the Hetzner network or your configuration having issues - Corosync just needs very low latency to work. If those latency requirements are not met, then Corosync will mark the link as down, even though you have a connection on that link.

Can you try pinging your nodes via the public IP and post the output? I suspect that the latency simply is too high.

cwilling · Sep 27, 2023

the single point of failure is now the switch. link1 is not an public network its an vlan network that hetzner provides across the hole data-center but it works over the public nic. But it is an private network.

in a ping where some error looks like this:

the errors:

Code:

Sep 27 11:59:50 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 27 12:00:03 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 27 12:00:17 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 27 12:00:37 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 27 12:00:48 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 27 12:00:59 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 27 12:01:18 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 27 12:01:31 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 27 12:01:49 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 27 12:02:15 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down
Sep 27 12:02:50 server1 corosync[548844]:   [KNET  ] link: host: 2 link: 1 is down

the shorten ping result is (pinged every 0.2s):

Code:

....
[1695808511.869721] 64 bytes from 10.0.2.2: icmp_seq=885 ttl=64 time=0.776 ms
[1695808512.078984] 64 bytes from 10.0.2.2: icmp_seq=886 ttl=64 time=2.05 ms
[1695808512.277993] 64 bytes from 10.0.2.2: icmp_seq=887 ttl=64 time=0.785 ms
....
[1695808900.573937] 64 bytes from 10.0.2.2: icmp_seq=2785 ttl=64 time=0.998 ms
[1695808900.774050] 64 bytes from 10.0.2.2: icmp_seq=2786 ttl=64 time=0.860 ms
[1695808900.978019] 64 bytes from 10.0.2.2: icmp_seq=2787 ttl=64 time=1.08 ms
[1695808901.178270] 64 bytes from 10.0.2.2: icmp_seq=2788 ttl=64 time=1.01 ms
[1695808901.378352] 64 bytes from 10.0.2.2: icmp_seq=2789 ttl=64 time=0.931 ms
....
--- 10.0.2.2 ping statistics ---
3363 packets transmitted, 3280 received, 2.46803% packet loss, time 688093ms
rtt min/avg/max/mdev = 0.125/0.226/2.048/0.175 ms

so the network isn't perfect but 2ms max over more than 30min isn't bad either and far from the 5ms max that are recommended.

VictorSTS · Sep 27, 2023

That 2'4% packet loss is more than enough for Heztner to check it. Anything over 0% packet loss is bad and anything over 0'5% is definitely noticeable. There's either something misbehaving in their network/servers' nic or you are overloading that nic.

If that 2'4% packet loss happens in 83 consecutive packets, HA will trigger a fence for sure.

cwilling · Sep 27, 2023

is there a way to test if the nic is overloaded? Hetzner said that they can't be sure if the problem is from there side becourse they dont support proxmox officially and it didn't help that i pointed out that they support debian and proxmox is based on that.

VictorSTS · Sep 28, 2023

Well, packet loss is OS independent

I mean, monitor the network for packet loss with something like smokeping or similar tools so you can be sure that there is or not packet loss. You can use influx+grafana+Proxmox metric server to easily monitor your nic's capacity (among other things). Then report back to Hetzner if needed.

If there's packet loss and you have only one corosync link and you are using HA your cluster will be unstable. In fact, two corosync links are must if you require HA to avoid a global fencing due to a network issue.

cwilling · Sep 28, 2023

so shouldn't ip -s link be an good indicator ?

Code:

root@server1:~# ip -s link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    RX:  bytes packets errors dropped  missed   mcast
      27023229   87512      0       0       0       0
    TX:  bytes packets errors dropped carrier collsns
      27023229   87512      0       0       0       0
2: enp9s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 9c:6b:00:0c:00:e2 brd ff:ff:ff:ff:ff:ff
    RX:  bytes  packets errors dropped  missed   mcast
    4781075016 23098037      0       0       0       0
    TX:  bytes  packets errors dropped carrier collsns
    4719863838 23609317      0       0       0       0
3: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UP mode DEFAULT group default qlen 1000
    link/ether 6c:b3:11:09:c2:40 brd ff:ff:ff:ff:ff:ff
    RX:    bytes   packets errors dropped  missed   mcast
    113515514983 187444638      0       0       0    1717
    TX:    bytes   packets errors dropped carrier collsns
    119364215450 192485301      0       0       0       0
4: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 6c:b3:11:09:c2:40 brd ff:ff:ff:ff:ff:ff
    RX:  bytes packets errors dropped  missed   mcast
     993112216  870313      0       0       0       2
    TX:  bytes packets errors dropped carrier collsns
     197705968  470845      0       0       0       0
5: enp2s0.4000@enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue master vmbr1 state UP mode DEFAULT group default qlen 1000
    link/ether 6c:b3:11:09:c2:40 brd ff:ff:ff:ff:ff:ff
    RX:  bytes packets errors dropped  missed   mcast
       4853987   72752      0       0       0     133
    TX:  bytes packets errors dropped carrier collsns
       1287428   10298      0       0       0       0
6: vmbr1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 6c:b3:11:09:c2:40 brd ff:ff:ff:ff:ff:ff
    RX:  bytes packets errors dropped  missed   mcast
       1104912   21405      0       0       0    2122
    TX:  bytes packets errors dropped carrier collsns
          6286      85      0       0       0       0
7: enp2s0.4001@enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue master vmbr2 state UP mode DEFAULT group default qlen 1000
    link/ether 6c:b3:11:09:c2:40 brd ff:ff:ff:ff:ff:ff
    RX:    bytes   packets errors dropped  missed   mcast
    107165458126 135805723      0       0       0    1431
    TX:    bytes   packets errors dropped carrier collsns
    115568855226 138832540      0       0       0       0
8: vmbr2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 6c:b3:11:09:c2:40 brd ff:ff:ff:ff:ff:ff
    RX:   bytes  packets errors dropped  missed   mcast
     9400070074 13615130      0       0       0  240414
    TX:   bytes  packets errors dropped carrier collsns
    52117438224 13608105      0       0       0       0
18: enp2s0.4002@enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 6c:b3:11:09:c2:40 brd ff:ff:ff:ff:ff:ff
    RX:  bytes packets errors dropped  missed   mcast
     141001566 1101918      0       0       0     151
    TX:  bytes packets errors dropped carrier collsns
     157315844 1100864      0       0       0       0
40: tap102i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1400 qdisc pfifo_fast master vmbr2 state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether da:b0:1b:b5:c9:82 brd ff:ff:ff:ff:ff:ff
    RX:   bytes  packets errors dropped  missed   mcast
    36485624700 44907855      0       0       0       0
    TX:   bytes  packets errors dropped carrier collsns
    54745571734 44489345      0       0       0       0
41: tap103i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1400 qdisc pfifo_fast master vmbr2 state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 46:97:05:2d:a3:30 brd ff:ff:ff:ff:ff:ff
    RX:   bytes  packets errors dropped  missed   mcast
    13637358998 38990749      0       0       0       0
    TX:   bytes  packets errors dropped carrier collsns
    13311808865 40459963      0       0       0       0
42: tap104i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1400 qdisc pfifo_fast master vmbr2 state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 9e:a6:e5:f3:f5:ac brd ff:ff:ff:ff:ff:ff
    RX:  bytes packets errors dropped  missed   mcast
    4051868748 6924183      0       0       0       0
    TX:  bytes packets errors dropped carrier collsns
    5695620759 5059091      0       0       0       0
43: tap100i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1400 qdisc pfifo_fast master vmbr2 state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether de:e9:6e:66:ce:d4 brd ff:ff:ff:ff:ff:ff
    RX:  bytes packets errors dropped  missed   mcast
     491178337 1475443      0       0       0       0
    TX:  bytes packets errors dropped carrier collsns
    2645969829 2100567      0       0       0       0
44: tap110i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1400 qdisc pfifo_fast master vmbr2 state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 4a:da:46:62:5e:a5 brd ff:ff:ff:ff:ff:ff
    RX:  bytes packets errors dropped  missed   mcast
     146535972 2139473      0       0       0       0
    TX:  bytes packets errors dropped carrier collsns
      74232118 1009472      0       0       0       0
45: tap110i1: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1400 qdisc pfifo_fast master fwbr110i1 state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 4e:43:c0:af:ca:b0 brd ff:ff:ff:ff:ff:ff
    RX:  bytes packets errors dropped  missed   mcast
        682839    4607      0       0       0       0
    TX:  bytes packets errors dropped carrier collsns
       1756443   17275      0       0       0       0
46: fwbr110i1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 3e:1e:86:55:bc:ee brd ff:ff:ff:ff:ff:ff
    RX:  bytes packets errors dropped  missed   mcast
        601056   13039      0       0       0     154
    TX:  bytes packets errors dropped carrier collsns
             0       0      0       0       0       0
47: fwpr110p1@fwln110i1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue master vmbr1 state UP mode DEFAULT group default qlen 1000
    link/ether 46:24:fa:57:7b:ad brd ff:ff:ff:ff:ff:ff
    RX:  bytes packets errors dropped  missed   mcast
        682489    4604      0       0       0       0
    TX:  bytes packets errors dropped carrier collsns
       3724801   50276      0       0       0       0
48: fwln110i1@fwpr110p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue master fwbr110i1 state UP mode DEFAULT group default qlen 1000
    link/ether c6:ed:c5:ec:51:77 brd ff:ff:ff:ff:ff:ff
    RX:  bytes packets errors dropped  missed   mcast
       3724801   50276      0       0       0       0
    TX:  bytes packets errors dropped carrier collsns
        682489    4604      0       0       0       0

there isn't one dropped package sines the last boot of the system (if i get this right) on any interface and all the ping that had dropped packages were done after the last boot and they have dropped packages. So does that mean the packages got lost in the network outside of the servers?

shanreich · Sep 28, 2023

cwilling said:
is there a way to test if the nic is overloaded?

You could use iftop to check the current utilization of network interfaces.

It's quite likely there is something wrong with the network inbetween, as @VictorSTS mentioned. 2.5% packet loss in a local network is not acceptable at all.

Search

Search

Server Randomly Rebooting

cwilling

New Member

Attachments

shanreich

Proxmox Staff Member

cwilling

New Member

cwilling

New Member

shanreich

Proxmox Staff Member

cwilling

New Member

VictorSTS

Famous Member

cwilling

New Member

VictorSTS

Famous Member

cwilling

New Member

shanreich

Proxmox Staff Member

We value your privacy