Ceph OSD down intermittently since PVE6/Nautilus upgrade

SourCheeks · Nov 5, 2019

I have a small 3 node pve cluster running ceph that i recently upgraded to PVE6 and Ceph Nautilus. However, since upgrading, one OSD on one of the nodes keeps going down. The node that it is on is also experiencing intermittent kernel panics (not sure if related).

Code:

pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.21-3-pve)
pve-manager: 6.0-9 (running version: 6.0-9/508dcee0)
pve-kernel-5.0: 6.0-9
pve-kernel-helper: 6.0-9
pve-kernel-4.15: 5.4-8
pve-kernel-5.0.21-3-pve: 5.0.21-7
pve-kernel-5.0.21-2-pve: 5.0.21-7
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph: 14.2.4-pve1
ceph-fuse: 14.2.4-pve1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-5
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.0-9
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-65
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
openvswitch-switch: 2.10.0+2018.08.28+git.8ca7c82b7d+ds1-12
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-8
pve-cluster: 6.0-7
pve-container: 3.0-7
pve-docs: 6.0-7
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.1-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-9
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve1

Could someone give me some direction on troubleshooting this? Would very much appreciate any help.

Alwin · Nov 5, 2019

SourCheeks said:
The node that it is on is also experiencing intermittent kernel panics (not sure if related).

What are they? And what does the journal/syslog & the ceph logs show?

SourCheeks · Nov 5, 2019

Okay bear with me if this sounds extremely basic, but how do I interpret the kernel panic screen dump? Additionally, how can I capture the text? When I experience a kernel panic on that node, the machine locks up and needs to be reset, so I'm not sure how to save that text to analyze later.

In examining the syslog and ceph logs, I found this related to the OSD that is down. Export is attached. OSD is 'osd10' on node 'pve6700'.

SourCheeks · Nov 6, 2019

Would it be safe to destroy and recreate the osd in question? If so, could you point me to a procedure to do so?

Alwin · Nov 7, 2019

On the CLI you can run the following:

Code:

ceph osd out osd.<id>

If it isn't out already.

Code:

systemctl stop ceph-osd@<id>.service
pveceph osd destroy <id>

And then create the OSD again. But check first that the disk is still healthy.

SourCheeks said:
Okay bear with me if this sounds extremely basic, but how do I interpret the kernel panic screen dump?

Try that link below. And check the journal/kernel.log, if it wasn't written already.
https://help.ubuntu.com/lts/serverguide/kernel-crash-dump.html

SourCheeks · Nov 8, 2019

Thanks Alwin. Disk SMART data suggests it's healthy, so I've recreated the OSD and it's currently rebuilding, so far so good.

Meanwhile, I did have another lockup since the OSD was rebuilding. I'm beginning to suspect the frequency of the system lockups caused the OSD corruption in the first place. I looked at the syslog but it only shows @^@^@^@^ during time of crash so I followed your link and installed kdump-tools.

I think I configured it correctly, but can you confirm if I need nmi_watchdog=1 in my grub?

Will report back when it crashes again.

SourCheeks · Nov 9, 2019

Okay I just had two crashes back to back, attached are excerpts from both crash logs. It appears the issue is "general protection fault". I suspect you'll tell me to update my BIOS as the next step, so I'll do that tonight. But please let me know if you there is anything else I should do from looking at the crash dump.

SourCheeks · Nov 9, 2019

Okay updated the BIOS last night, but had another crash this morning, and it looks like the same issue. Any ideas?

Alwin · Nov 11, 2019

This sounds like memory failures. The 'MSI MS-7976/Z170A GAMING M7' doesn't seem to have ECC support. I strongly recommend to use hardware with ECC, on long running systems.

SourCheeks · Nov 12, 2019

Thanks Alwin. Do you recommend I run the memtest on boot that comes with PVE or some other utility to diagnose the memory?

Alwin · Nov 18, 2019

SourCheeks said:
Thanks Alwin. Do you recommend I run the memtest on boot that comes with PVE or some other utility to diagnose the memory?

Either way.

David Herselman · Dec 14, 2019

This is an old thread but you may simply be running out of memory on your OSD processes. Memory limit defaults were changed in this release, perhaps try limit the individual OSDs by setting the following in /etc/pve/ceph.conf

The following is from an old node with limited memory:

Code:

[osd.50]                                                                                       
         osd_memory_target = 536870912                                                         
                                                                                              
[osd.51]                                                                                       
         osd_memory_target = 536870912                                                         
                                                                                              
[osd.52]                                                                                       
         osd_memory_target = 536870912

Search

Search

Ceph OSD down intermittently since PVE6/Nautilus upgrade

SourCheeks

New Member

Alwin

Proxmox Retired Staff

SourCheeks

New Member

Attachments

SourCheeks

New Member

Alwin

Proxmox Retired Staff

SourCheeks

New Member

SourCheeks

New Member

Attachments

SourCheeks

New Member

Attachments

Alwin

Proxmox Retired Staff

SourCheeks

New Member

Alwin

Proxmox Retired Staff

David Herselman

Renowned Member

We value your privacy