Ceph OSD down intermittently since PVE6/Nautilus upgrade

SourCheeks

New Member
Sep 25, 2019
13
1
3
34
I have a small 3 node pve cluster running ceph that i recently upgraded to PVE6 and Ceph Nautilus. However, since upgrading, one OSD on one of the nodes keeps going down. The node that it is on is also experiencing intermittent kernel panics (not sure if related).

Code:
pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.21-3-pve)
pve-manager: 6.0-9 (running version: 6.0-9/508dcee0)
pve-kernel-5.0: 6.0-9
pve-kernel-helper: 6.0-9
pve-kernel-4.15: 5.4-8
pve-kernel-5.0.21-3-pve: 5.0.21-7
pve-kernel-5.0.21-2-pve: 5.0.21-7
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph: 14.2.4-pve1
ceph-fuse: 14.2.4-pve1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-5
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.0-9
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-65
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
openvswitch-switch: 2.10.0+2018.08.28+git.8ca7c82b7d+ds1-12
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-8
pve-cluster: 6.0-7
pve-container: 3.0-7
pve-docs: 6.0-7
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.1-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-9
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve1

Could someone give me some direction on troubleshooting this? Would very much appreciate any help.
 
The node that it is on is also experiencing intermittent kernel panics (not sure if related).
What are they? And what does the journal/syslog & the ceph logs show?
 
Okay bear with me if this sounds extremely basic, but how do I interpret the kernel panic screen dump? Additionally, how can I capture the text? When I experience a kernel panic on that node, the machine locks up and needs to be reset, so I'm not sure how to save that text to analyze later.

In examining the syslog and ceph logs, I found this related to the OSD that is down. Export is attached. OSD is 'osd10' on node 'pve6700'.
 

Attachments

  • syslog.txt
    33.3 KB · Views: 6
Would it be safe to destroy and recreate the osd in question? If so, could you point me to a procedure to do so?
 
On the CLI you can run the following:
Code:
ceph osd out osd.<id>
If it isn't out already.

Code:
systemctl stop ceph-osd@<id>.service
pveceph osd destroy <id>

And then create the OSD again. But check first that the disk is still healthy.

Okay bear with me if this sounds extremely basic, but how do I interpret the kernel panic screen dump?
Try that link below. And check the journal/kernel.log, if it wasn't written already.
https://help.ubuntu.com/lts/serverguide/kernel-crash-dump.html
 
Thanks Alwin. Disk SMART data suggests it's healthy, so I've recreated the OSD and it's currently rebuilding, so far so good.

Meanwhile, I did have another lockup since the OSD was rebuilding. I'm beginning to suspect the frequency of the system lockups caused the OSD corruption in the first place. I looked at the syslog but it only shows @^@^@^@^ during time of crash so I followed your link and installed kdump-tools.

I think I configured it correctly, but can you confirm if I need nmi_watchdog=1 in my grub?

Will report back when it crashes again.
 
Okay I just had two crashes back to back, attached are excerpts from both crash logs. It appears the issue is "general protection fault". I suspect you'll tell me to update my BIOS as the next step, so I'll do that tonight. But please let me know if you there is anything else I should do from looking at the crash dump.
 

Attachments

  • kdump.txt
    8.4 KB · Views: 3
Okay updated the BIOS last night, but had another crash this morning, and it looks like the same issue. Any ideas?
 

Attachments

  • kdump after bios update.txt
    4.2 KB · Views: 3
This sounds like memory failures. The 'MSI MS-7976/Z170A GAMING M7' doesn't seem to have ECC support. I strongly recommend to use hardware with ECC, on long running systems.
 
Thanks Alwin. Do you recommend I run the memtest on boot that comes with PVE or some other utility to diagnose the memory?
 
This is an old thread but you may simply be running out of memory on your OSD processes. Memory limit defaults were changed in this release, perhaps try limit the individual OSDs by setting the following in /etc/pve/ceph.conf

The following is from an old node with limited memory:
Code:
[osd.50]                                                                                       
         osd_memory_target = 536870912                                                         
                                                                                              
[osd.51]                                                                                       
         osd_memory_target = 536870912                                                         
                                                                                              
[osd.52]                                                                                       
         osd_memory_target = 536870912
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!