Ceph OSD down intermittently since PVE6/Nautilus upgrade

SourCheeks

New Member
Sep 25, 2019
10
0
1
30
I have a small 3 node pve cluster running ceph that i recently upgraded to PVE6 and Ceph Nautilus. However, since upgrading, one OSD on one of the nodes keeps going down. The node that it is on is also experiencing intermittent kernel panics (not sure if related).

Code:
pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.21-3-pve)
pve-manager: 6.0-9 (running version: 6.0-9/508dcee0)
pve-kernel-5.0: 6.0-9
pve-kernel-helper: 6.0-9
pve-kernel-4.15: 5.4-8
pve-kernel-5.0.21-3-pve: 5.0.21-7
pve-kernel-5.0.21-2-pve: 5.0.21-7
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph: 14.2.4-pve1
ceph-fuse: 14.2.4-pve1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-5
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.0-9
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-65
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
openvswitch-switch: 2.10.0+2018.08.28+git.8ca7c82b7d+ds1-12
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-8
pve-cluster: 6.0-7
pve-container: 3.0-7
pve-docs: 6.0-7
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.1-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-9
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve1
Could someone give me some direction on troubleshooting this? Would very much appreciate any help.
 

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
3,020
261
88
The node that it is on is also experiencing intermittent kernel panics (not sure if related).
What are they? And what does the journal/syslog & the ceph logs show?
 

SourCheeks

New Member
Sep 25, 2019
10
0
1
30
Okay bear with me if this sounds extremely basic, but how do I interpret the kernel panic screen dump? Additionally, how can I capture the text? When I experience a kernel panic on that node, the machine locks up and needs to be reset, so I'm not sure how to save that text to analyze later.

In examining the syslog and ceph logs, I found this related to the OSD that is down. Export is attached. OSD is 'osd10' on node 'pve6700'.
 

Attachments

SourCheeks

New Member
Sep 25, 2019
10
0
1
30
Would it be safe to destroy and recreate the osd in question? If so, could you point me to a procedure to do so?
 

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
3,020
261
88
On the CLI you can run the following:
Code:
ceph osd out osd.<id>
If it isn't out already.

Code:
systemctl stop ceph-osd@<id>.service
pveceph osd destroy <id>
And then create the OSD again. But check first that the disk is still healthy.

Okay bear with me if this sounds extremely basic, but how do I interpret the kernel panic screen dump?
Try that link below. And check the journal/kernel.log, if it wasn't written already.
https://help.ubuntu.com/lts/serverguide/kernel-crash-dump.html
 

SourCheeks

New Member
Sep 25, 2019
10
0
1
30
Thanks Alwin. Disk SMART data suggests it's healthy, so I've recreated the OSD and it's currently rebuilding, so far so good.

Meanwhile, I did have another lockup since the OSD was rebuilding. I'm beginning to suspect the frequency of the system lockups caused the OSD corruption in the first place. I looked at the syslog but it only shows @^@^@^@^ during time of crash so I followed your link and installed kdump-tools.

I think I configured it correctly, but can you confirm if I need nmi_watchdog=1 in my grub?

Will report back when it crashes again.
 

SourCheeks

New Member
Sep 25, 2019
10
0
1
30
Okay I just had two crashes back to back, attached are excerpts from both crash logs. It appears the issue is "general protection fault". I suspect you'll tell me to update my BIOS as the next step, so I'll do that tonight. But please let me know if you there is anything else I should do from looking at the crash dump.
 

Attachments

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
3,020
261
88
This sounds like memory failures. The 'MSI MS-7976/Z170A GAMING M7' doesn't seem to have ECC support. I strongly recommend to use hardware with ECC, on long running systems.
 

SourCheeks

New Member
Sep 25, 2019
10
0
1
30
Thanks Alwin. Do you recommend I run the memtest on boot that comes with PVE or some other utility to diagnose the memory?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!