Hi i have an cluster with 3 servers and the servers randomly reboot. All servers have 4 nvme ssds 2 with zfs 4 proxmox and 2 whith pcie passthourgh to an vm on every server that build an ceph cluster this is done becourse the ceph cluster is also used for kubernetes which also runs on the servers.
Server are AX102 from hetzner and have 128gb ddr memory and have a 10gb ethernet card
i tried adding
to /etc/default/grub and /etc/kernel/cmdline but this doesnt seems to work but this doesnt change anything or maybe seems to make it worse.
running journalctl --list-boots on one server shows
example syslog from before and an reboot is added as attachment.
it also seems that multiple nodes fail at the same time or verry short after another.
sometimes they run for an day but mostly they fail ever 2 to 4 hours. I have one vm with ha that sometimes doesnt restart after an node failes could be that multiple nodes faild or that ceph is maybe down if 2 server fail.
Code:
CPU(s) 32 x AMD Ryzen 9 7950X3D 16-Core Processor (1 Socket)
Kernel Version Linux 6.2.16-12-pve #1 SMP PREEMPT_DYNAMIC PMX 6.2.16-12 (2023-09-04T13:21Z)
PVE Manager Version pve-manager/8.0.4
Server are AX102 from hetzner and have 128gb ddr memory and have a 10gb ethernet card
i tried adding
Code:
pci=assign-busses apicmaintimer idle=poll reboot=cold,hard
running journalctl --list-boots on one server shows
Code:
IDX BOOT ID FIRST ENTRY LAST ENTRY
-34 8ba28233a65a4108be6853da8cc9fc18 Fri 2023-07-07 14:43:47 CEST Fri 2023-07-07 14:47:41 CEST
-33 46eb05c3a68048718c03929a27b9f715 Fri 2023-07-07 14:49:50 CEST Fri 2023-07-07 14:49:50 CEST
-32 2c7af644e33548c4ac4f59e419cf303a Fri 2023-07-07 14:50:37 CEST Fri 2023-07-07 14:50:43 CEST
-31 e77ad97bd0a44429a5cc4bf4f007866c Fri 2023-07-07 14:52:57 CEST Fri 2023-07-07 14:56:20 CEST
-30 92daa39bb06b4fda9eff10891bd84814 Fri 2023-07-07 14:57:18 CEST Fri 2023-07-07 15:16:03 CEST
-29 08dd71a1f2104a1eb990902d8ba91dcd Fri 2023-07-07 15:17:03 CEST Fri 2023-07-07 22:56:42 CEST
-28 e9d14dc2c4384311be565de68ecd3389 Fri 2023-07-07 22:57:49 CEST Wed 2023-07-12 14:46:52 CEST
-27 22f44e886921468096f7b2c14543e7dd Wed 2023-07-12 14:53:45 CEST Wed 2023-07-12 15:44:38 CEST
-26 8be4f551524a4160bda27bc41d2a0464 Wed 2023-07-12 15:48:32 CEST Thu 2023-07-13 20:33:41 CEST
-25 e1654a9d17ee47c6b56edd8db4b0de1f Thu 2023-07-13 20:59:00 CEST Fri 2023-07-14 11:47:39 CEST
-24 27748f1ac92b4f6b96ec30a4fb499b79 Fri 2023-07-14 11:57:27 CEST Wed 2023-07-26 23:31:40 CEST
-23 dc7cc3e7f2354d71b9b06930835f0038 Wed 2023-07-26 23:34:17 CEST Thu 2023-07-27 00:22:18 CEST
-22 fd2c91512b014d3994ae491f6634d591 Thu 2023-07-27 00:23:25 CEST Thu 2023-07-27 15:52:23 CEST
-21 716bec032a454e58bd45f96c4fa96654 Thu 2023-07-27 15:53:30 CEST Tue 2023-08-01 17:40:44 CEST
-20 5fe5bff403a147afbdf7a446a9fece7f Tue 2023-08-01 17:43:23 CEST Sun 2023-08-06 16:11:45 CEST
-19 a18faa2303d84c85aed8c5efa1315a41 Sun 2023-08-06 16:12:50 CEST Sun 2023-08-06 18:04:00 CEST
-18 a662b63ae9844cd2a92af589dc2a636c Sun 2023-08-06 18:05:06 CEST Tue 2023-09-05 15:45:11 CEST
-17 fbce1f6973524bb88001bd102d732cfb Tue 2023-09-05 15:48:00 CEST Wed 2023-09-13 11:19:50 CEST
-16 cf73f89b000941c9b9546660913da440 Wed 2023-09-13 11:23:12 CEST Tue 2023-09-19 14:58:30 CEST
-15 3dc642b2a1b841c8a9f35fe1c7012524 Tue 2023-09-19 15:00:01 CEST Tue 2023-09-19 15:16:42 CEST
-14 ccb7671420254fa98737a6c3fe515625 Tue 2023-09-19 15:18:08 CEST Tue 2023-09-19 17:44:29 CEST
-13 9b4b6dba847e4986a0332dce6fcd74f5 Tue 2023-09-19 17:45:58 CEST Tue 2023-09-19 18:56:44 CEST
-12 88240c120f8145f5ac8132cd3180ccc3 Tue 2023-09-19 18:58:08 CEST Tue 2023-09-19 19:35:05 CEST
-11 9c138c7edce94200830f1a255b9f81c6 Tue 2023-09-19 19:36:22 CEST Tue 2023-09-19 20:15:30 CEST
-10 05c2ea91bdd54ae4ba11e4e9f82f9e22 Tue 2023-09-19 20:17:08 CEST Wed 2023-09-20 08:28:01 CEST
-9 f2df48cec75842e689ed7cae689012eb Wed 2023-09-20 08:29:37 CEST Wed 2023-09-20 08:34:55 CEST
-8 50fbb4eb3ba44bd78f8b83d0e0181b28 Wed 2023-09-20 08:36:51 CEST Wed 2023-09-20 09:41:43 CEST
-7 cb3d8c59270b45e0bf7d0f8304fb5310 Wed 2023-09-20 09:43:13 CEST Wed 2023-09-20 09:49:56 CEST
-6 41b868ba4dac4f1f97b117774d190c21 Wed 2023-09-20 09:51:27 CEST Wed 2023-09-20 11:21:32 CEST
-5 8d4cd516ff5b4f81b6f03e178aa8c6bf Wed 2023-09-20 11:22:45 CEST Wed 2023-09-20 11:31:31 CEST
-4 45ba1c8903b74465b1d90aeefcdd28a2 Wed 2023-09-20 11:32:53 CEST Wed 2023-09-20 11:40:56 CEST
-3 f4cf8631e26a47d2a65b784bcb658c61 Wed 2023-09-20 11:42:10 CEST Wed 2023-09-20 13:46:28 CEST
-2 8efcf864fd114553912c92885327c1ed Wed 2023-09-20 13:47:39 CEST Wed 2023-09-20 15:36:08 CEST
example syslog from before and an reboot is added as attachment.
it also seems that multiple nodes fail at the same time or verry short after another.
sometimes they run for an day but mostly they fail ever 2 to 4 hours. I have one vm with ha that sometimes doesnt restart after an node failes could be that multiple nodes faild or that ceph is maybe down if 2 server fail.
Attachments
Last edited: