My cluster have 38 nodes with ceph, yesterday, I added the 39th node, the hole cluster die!!!!
I don't have HA enabled, I think the node should not reboot, but my cluserter has 2 node reboot, and the cluster spit to different cluster quorum, like node 1, 3, 5, 7 , another is 2,4, 6,8. Really a disaster!
How I slove it :
1. Reboot all node not work!
2. I shutdown all the 39 nodes, start 3 nodes that start one by one, that works !
one reboot node log:
last reboot 0:11- 17 rebooting
reboot log
I don't have HA enabled, I think the node should not reboot, but my cluserter has 2 node reboot, and the cluster spit to different cluster quorum, like node 1, 3, 5, 7 , another is 2,4, 6,8. Really a disaster!
How I slove it :
1. Reboot all node not work!
2. I shutdown all the 39 nodes, start 3 nodes that start one by one, that works !
one reboot node log:
last reboot 0:11- 17 rebooting
Code:
root@g8kvm02:/var/log# last | grep -i boot
reboot system boot 5.4.55-1-pve Mon Sep 7 00:17 still running
reboot system boot 5.4.55-1-pve Mon Aug 31 10:43 still running
reboot system boot 5.4.34-1-pve Wed Aug 19 20:27 - 10:40 (11+14:12)
reboot system boot 5.4.34-1-pve Wed Aug 19 17:45 - 10:40 (11+16:54)
reboot system boot 5.4.34-1-pve Tue Aug 18 12:37 - 10:40 (12+22:02)
reboot system boot 5.4.34-1-pve Tue Aug 18 12:13 - 10:40 (12+22:26)
reboot system boot 5.4.34-1-pve Tue Aug 18 09:56 - 12:05 (02:09)
reboot system boot 5.4.34-1-pve Tue Aug 18 00:37 - 12:05 (11:27)
reboot system boot 5.4.34-1-pve Mon Aug 17 19:32 - 12:05 (16:32)
reboot system boot 5.4.34-1-pve Mon Aug 17 19:14 - 19:29 (00:14)
reboot system boot 5.4.34-1-pve Mon Aug 17 19:00 - 19:29 (00:28)
reboot system boot 5.4.34-1-pve Fri Aug 7 14:56 - 19:29 (10+04:33)
reboot system boot 5.4.34-1-pve Wed Jun 24 23:54 - 14:51 (43+14:57)
reboot system boot 5.4.34-1-pve Thu Jun 25 07:10 - 23:49 (-7:20)
reboot log
Code:
Sep 7 00:12:10 g8kvm02 ceph-osd[2292]: 2020-09-07 00:12:10.885 7fa29b6fa700 -1 osd.8 96815 heartbeat_check: no reply from 10.0.141.36:6851 osd.302 since back 2020-09-07 00:10:45.448456 front 2020-09-07 00:10:52.950887 (oldest deadline 2020-09-07 00:11:07.747487)
Sep 7 00:12:10 g8kvm02 ceph-osd[2292]: 2020-09-07 00:12:10.885 7fa29b6fa700 -1 osd.8 96815 heartbeat_check: no reply from 10.0.141.36:6838 osd.303 since back 2020-09-07 00:10:52.950433 front 2020-09-07 00:10:52.950764 (oldest deadline 2020-09-07 00:11:15.249640)
Sep 7 00:12:10 g8kvm02 ceph-osd[2292]: 2020-09-07 00:12:10.885 7fa29b6fa700 -1 osd.8 96815 heartbeat_check: no reply from 10.0.141.37:6850 osd.305 since back 2020-09-07 00:10:52.950589 front 2020-09-07 00:10:52.950805 (oldest deadline 2020-09-07 00:11:15.249640)
Sep 7 00:12:10 g8kvm02 ceph-osd[2292]: 2020-09-07 00:12:10.885 7fa29b6fa700 -1 osd.8 96815 heartbeat_check: no reply from 10.0.141.37:6810 osd.307 since back 2020-09-07 00:10:52.950811 front 2020-09-07 00:10:52.950563 (oldest deadline 2020-09-07 00:11:15.249640)
Sep 7 00:12:10 g8kvm02 ceph-osd[2292]: 2020-09-07 00:12:10.885 7fa29b6fa700 -1 osd.8 96815 heartbeat_check: no reply from 10.0.141.38:6835 osd.313 since back 2020-09-07 00:10:39.646821 front 2020-09-07 00:10:39.647357 (oldest deadline 2020-09-07 00:11:04.346043)
Sep 7 00:12:10 g8kvm02 ceph-osd[2292]: 2020-09-07 00:12:10.885 7fa29b6fa700 -1 osd.8 96815 heartbeat_check: no reply from 10.0.141.38:6868 osd.314 since back 2020-09-07 00:10:39.647466 front 2020-09-07 00:10:39.646843 (oldest deadline 2020-09-07 00:11:04.346043)
Sep 7 00:12:10 g8kvm02 ceph-osd[2292]: 2020-09-07 00:12:10.885 7fa29b6fa700 -1 osd.8 96815 heartbeat_check: no reply from 10.0.141.38:6814 osd.319 since back 2020-09-07 00:10:39.647415 front 2020-09-07 00:10:39.647523 (oldest deadline 2020-09-07 00:11:04.346043)
Sep 7 00:12:10 g8kvm02 ceph-osd[2292]: 2020-09-07 00:12:10.885 7fa29b6fa700 -1 osd.8 96815 heartbeat_check: no reply from 10.0.141.38:6854 osd.320 since back 2020-09-07 00:10:39.647401 front 2020-09-07 00:10:39.647537 (oldest deadline 2020-09-07 00:11:04.346043)
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Sep 7 00:17:12 g8kvm02 dmeventd[755]: dmeventd ready for processing.
Sep 7 00:17:12 g8kvm02 lvm[748]: 1 logical volume(s) in volume group "ceph-e6fd3e3c-8853-4dd2-9e0e-6399af0ba30b" monitored
Sep 7 00:17:12 g8kvm02 kernel: [ 0.000000] Linux version 5.4.55-1-pve (root@nora) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.4.55-1 (Mon, 10 Aug 2020 10:26:27 +0200) ()
Sep 7 00:17:12 g8kvm02 lvm[748]: 1 logical volume(s) in volume group "ceph-d07a5580-6e25-4c92-8a4a-76e949751b87" monitored
Sep 7 00:17:12 g8kvm02 kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.55-1-pve root=/dev/mapper/pve-root ro quiet nmi_watchdog=0
Sep 7 00:17:12 g8kvm02 lvm[748]: 1 logical volume(s) in volume group "ceph-c732bd53-0f25-40c7-b726-fb67866d8176" monitored
Sep 7 00:17:12 g8kvm02 kernel: [ 0.000000] KERNEL supported cpus:
Sep 7 00:17:12 g8kvm02 kernel: [ 0.000000] Intel GenuineIntel
Sep 7 00:17:12 g8kvm02 kernel: [ 0.000000] AMD AuthenticAMD
Sep 7 00:17:12 g8kvm02 lvm[748]: 1 logical volume(s) in volume group "ceph-9e42c888-294f-4c48-a575-285e72ee4114" monitored
Sep 7 00:17:12 g8kvm02 kernel: [ 0.000000] Hygon HygonGenuine
Sep 7 00:17:12 g8kvm02 kernel: [ 0.000000] Centaur CentaurHauls
Sep 7 00:17:12 g8kvm02 kernel: [ 0.000000] zhaoxin Shanghai
Sep 7 00:17:12 g8kvm02 lvm[748]: 1 logical volume(s) in volume group "ceph-bfc41a84-297e-42f0-8170-0b2dc3d9f2cf" monitored
Sep 7 00:17:12 g8kvm02 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Sep 7 00:17:12 g8kvm02 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Sep 7 00:17:12 g8kvm02 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Sep 7 00:17:12 g8kvm02 kernel: [ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
Sep 7 00:17:12 g8kvm02 lvm[748]: 1 logical volume(s) in volume group "ceph-e908d34c-5cae-4dfa-9e1c-33a5c3925c73" monitored