I have an brand new, completely clean, 6.2 installation that was completed on Friday. No VMs yet, 4 identical machines (hhcs000[1-4]). And on two of the three monitors I am getting a lot of these in the logs on hhcs0002 and hhcs0003, while hhcs0001 seems fine:
I originally had four monitors but after reading these forums I destroyed on Friday during troubleshooting (hhcs0004 still show up in the gui and likely needs to be addressed). The OSS tree only displays sporadically in the web GUI and doesn't appear to be related to which machine I logged in to or am using to display it at the time. The OSD tree looks good as far as I know and the command returns immediately on all 4 systems. The DBs are stored on the OSD disks.
Digging deeper in to the logs shows: https://pastebin.com/sUnQuZX
The hardware and configuration are identical for each machine:
mon.hhsc0002@1(peon) e5 get_health_metrics reporting 4 slow ops, oldest is mon_command({"prefix":"pg dump","dumpcontents":["osds"],"format":"json"} v 0)
. If I look at the health it shows that the monitors are crashing.
Code:
# ceph health detail
HEALTH_WARN 5 daemons have recently crashed
RECENT_CRASH 5 daemons have recently crashed
mon.hhcs0003 crashed on host hhcs0003 at 2020-09-25 23:58:24.642998Z
mon.hhsc0002 crashed on host hhsc0002 at 2020-09-25 23:58:39.260149Z
mon.hhsc0002 crashed on host hhsc0002 at 2020-09-26 00:17:37.328757Z
mon.hhcs0003 crashed on host hhcs0003 at 2020-09-26 00:18:18.605781Z
mon.hhcs0003 crashed on host hhcs0003 at 2020-09-26 00:38:00.457194Z
I originally had four monitors but after reading these forums I destroyed on Friday during troubleshooting (hhcs0004 still show up in the gui and likely needs to be addressed). The OSS tree only displays sporadically in the web GUI and doesn't appear to be related to which machine I logged in to or am using to display it at the time. The OSD tree looks good as far as I know and the command returns immediately on all 4 systems. The DBs are stored on the OSD disks.
Code:
# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 1.74640 root default
-3 0.43660 host hhcs0001
0 ssd 0.43660 osd.0 up 1.00000 1.00000
-7 0.43660 host hhcs0003
2 ssd 0.43660 osd.2 up 1.00000 1.00000
-9 0.43660 host hhcs0004
3 ssd 0.43660 osd.3 up 1.00000 1.00000
-5 0.43660 host hhsc0002
1 ssd 0.43660 osd.1 up 1.00000 1.00000
Digging deeper in to the logs shows: https://pastebin.com/sUnQuZX
The hardware and configuration are identical for each machine:
- CPU: Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz
- Memory: 256 GB
- HDD:
- 2x INTEL SSD DC 3510 120GB using RAID 1 ZFS via the installer for OS
- 1x Intel® SSD D3-S4510 480GB SSD used for OSD
- 2x Intel X710 10GB network cards
- bond0 802.3ad for frontend network, used for vmbr0
- bond1 802.3ad for backend network, CEPH/Corosync (will likely split corosync out to shared port with BMC)
- Baseboard: Supermicro X11DPT-PS
- /etc/pve/ceph.conf (I did notice, just now, that the public network and the cluster network are the same in ceph.conf and that's not accurate but I don't think it would cause what I'm seeing. I will, however wait to change that until I get this sorted or it's identified as a possible cause.)