[SOLVED] OSDs fail on on one node / cannot re-create

lucentwolf · Jul 6, 2020

Hi all

My cluster consists of 6 nodes with 3 OSDs each (18 OSDs total), pve 6.2-6 and ceph 14.2.9. BTW, it's been up and running fine for 7 months now and went through all updates flawlessly so far.

However, after rebooting the nodes one after the other upon updating to 6.2-6, the 3 OSDs on one nodes didn't come up again. After ceph was back to clean, and the 3 OSDs being "out", I decided to destroy them; and waitet for the clean-state again. Then (on the respective node) I tried

ceph-volume lvm zap /dev/sda --destroy

which failed, returning:

Code:

Traceback (most recent call last):
  File "/usr/sbin/ceph-volume", line 6, in <module>
    from pkg_resources import load_entry_point
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 83, in <module>
    __import__('pkg_resources.extern.packaging.specifiers')
  File "/usr/lib/python2.7/dist-packages/pkg_resources/extern/__init__.py", line 43, in load_module
    __import__(extant)
ValueError: bad marshal data (unknown type code)

The attempt to add the OSD anyway using

pveceph osd create /dev/sda

also failed, returning:

Code:

wipe disk/partition: /dev/sda
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 1.0918 s, 192 MB/s
Traceback (most recent call last):
  File "/sbin/ceph-volume", line 6, in <module>
    from pkg_resources import load_entry_point
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 83, in <module>
    __import__('pkg_resources.extern.packaging.specifiers')
  File "/usr/lib/python2.7/dist-packages/pkg_resources/extern/__init__.py", line 43, in load_module
    __import__(extant)
ValueError: bad marshal data (unknown type code)
command 'ceph-volume lvm create --cluster-fsid a8d6705a-74c4-4904-9000-0db5742043fc --data /dev/sda' failed: exit code 1

The same happens with the other two HDDs in that node (/dev/sdb and /dev/sdc) So, I'm kind'a stuck - and I'd appreciate any hint & help on this

Kind regards
lucentwolf

Alwin · Jul 6, 2020

You may have leftover LVs. They and the VGs containing them need to be removed first.

Pravednik · Jul 6, 2020

Hello,

Sorry in advance, to avoid new similar topic I`ll write here.
I have issue with ceph-volume lvm zap command too.

I use PVE 6.0.1 with Ceph for testing purpose. Now I have to migrate this cluster to production. I made clear install (PVE + upgrade + Ceph) on each of 4 nodes, create cluster and face with problem that I can`t add OSD to new Ceph cluster. All disk are in use (No disk unused from GUI).
lsblk shows that disk has old Ceph volumes:

Code:

root@pve-01:~# lsblk
NAME                                                                                                  MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                                                                                                     8:0    0 232.9G  0 disk
|-sda1                                                                                                  8:1    0  1007K  0 part
|-sda2                                                                                                  8:2    0   512M  0 part
`-sda3                                                                                                  8:3    0 232.4G  0 part
  |-pve-swap                                                                                          253:5    0     8G  0 lvm  [SWAP]
  |-pve-root                                                                                          253:6    0    58G  0 lvm  /
  |-pve-data_tmeta                                                                                    253:7    0   1.5G  0 lvm 
  | `-pve-data                                                                                        253:9    0 147.4G  0 lvm 
  `-pve-data_tdata                                                                                    253:8    0 147.4G  0 lvm 
    `-pve-data                                                                                        253:9    0 147.4G  0 lvm 
sdb                                                                                                     8:16   0 931.5G  0 disk
`-ceph--c8f6fde5--3a68--418b--b3ba--2aaf8b4b75c5-osd--block--b7726b15--fd54--45d9--8a07--2729bda9c414 253:4    0 931.5G  0 lvm 
sdc                                                                                                     8:32   0 931.5G  0 disk
`-ceph--23342ddd--b606--40a7--8a3a--6400485cc7a2-osd--block--a12e2660--d544--4c94--af59--e91cfce06eb7 253:3    0 931.5G  0 lvm 
sdd                                                                                                     8:48   0 931.5G  0 disk
`-ceph--8108d3bd--9ef7--4c1b--b5ae--1912b56dbbff-osd--block--01cb0cdc--a0e6--43fb--b6df--f5598e099899 253:2    0 931.5G  0 lvm 
sde                                                                                                     8:64   0 931.5G  0 disk
`-ceph--67670145--0dff--4a95--9b55--7b48b6af7d0f-osd--block--ae4bb4f4--e187--42f5--a2b3--3e843536d18d 253:1    0 931.5G  0 lvm 
sdf                                                                                                     8:80   0 931.5G  0 disk
`-ceph--0bf050a7--1961--4e49--8e42--c2c5907bbc21-osd--block--2294241e--78ba--483c--8610--ad5c031c1750 253:0    0 931.5G  0 lvm

I tried to prepare disk with ceph-volume lvm zap

Code:

root@pve-01:~# ceph-volume lvm zap /dev/sdb --destroy
--> Zapping: /dev/sdb
 stderr: wipefs: error: /dev/sdb: probing initialization failed: Device or resource busy
--> failed to wipefs device, will try again to workaround probable race condition

Also I already try sgdisk -Z /dev/sdb, wipefs -af /dev/sdb/, dd if=/dev/zero of=/dev/sdb bs=500M count=2048. No luck...

How to destroy old ceph data\id`s whatever to create new OSD`s?

lucentwolf · Jul 7, 2020

Hi all, esp Alwin

LV means logical volume, right? Checking with

ceph-volume lvm list

returns basically the same error:

Code:

Traceback (most recent call last):
  File "/usr/sbin/ceph-volume", line 6, in <module>
    from pkg_resources import load_entry_point
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 83, in <module>
    __import__('pkg_resources.extern.packaging.specifiers')
  File "/usr/lib/python2.7/dist-packages/pkg_resources/extern/__init__.py", line 43, in load_module
    __import__(extant)
ValueError: bad marshal data (unknown type code)

To me it looks like a flaw in the python setup...

lucentwolf

Pravednik · Jul 7, 2020

Hi lucentwolf, yes, ceph-volume lvm list also didn`t work. I found workaround for my issue (format disk from LiveCD).
Also I found something looks like our issue https://github.com/ceph/ceph/pull/23532
Hope this would be helpfull.

lucentwolf · Jul 7, 2020

Hi Pravednik, I tried
- creating a new GPT on each drive; then
- removed all Partition tables from the drives
...same issue.

Following up on Alwin's hint, vgdisplay shows one VG named 'pve' (no others) - is that the one I need to remove?

Pravednik · Jul 7, 2020

lucentwolf, do you reboot node after removing partitions? I tried to mount disks after removing partitions and got same error. Only after reboot node all disks are available for Ceph.

lucentwolf · Jul 7, 2020

Pravednik, sure I rebooted numerous times :-(
[EDIT]: I also rebooted the other nodes - just in case it would make a difference...

Alwin · Jul 7, 2020

lucentwolf said:
To me it looks like a flaw in the python setup...

Did you upgrade all Ceph related packages?

lucentwolf said:
Following up on Alwin's hint, vgdisplay shows one VG named 'pve' (no others) - is that the one I need to remove?

No, otherwise the OS and lvm-local will be lost.

What's the output of pveceph osd create /dev/<disk>?

lucentwolf · Jul 7, 2020

Hi Alwin

...appreciate your involvment; all packages report to be up-do-date (no update action after 'apt update' respectively 'apt dist-upgrade'). PVE and Ceph versions are identical to the remaining nodes (see initial post), python --version gives 2.7.16 on all nodes.

pveceph osd create /dev/sda

reports

Code:

create OSD on /dev/sda (bluestore)
wipe disk/partition: /dev/sda
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 1.14345 s, 183 MB/s
Traceback (most recent call last):
  File "/sbin/ceph-volume", line 6, in <module>
    from pkg_resources import load_entry_point
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 83, in <module>
    __import__('pkg_resources.extern.packaging.specifiers')
  File "/usr/lib/python2.7/dist-packages/pkg_resources/extern/__init__.py", line 43, in load_module
    __import__(extant)
ValueError: bad marshal data (unknown type code)
command 'ceph-volume lvm create --cluster-fsid a8d6705a-74c4-4904-9000-0db5742043fc --data /dev/sda' failed: exit code 1

Alwin · Jul 7, 2020

lucentwolf said:
ValueError: bad marshal data (unknown type code) command 'ceph-volume lvm create --cluster-fsid a8d6705a-74c4-4904-9000-0db5742043fc --data /dev/sda' failed: exit code 1

ceph-volume is in the ceph-osd package, you may reinstall the package.

lucentwolf · Jul 7, 2020

OK, tried apt remove ceph-osd; then apt install ceph-osd, rebooted - same error :-(

Alwin · Jul 7, 2020

lucentwolf said:
OK, tried apt remove ceph-osd; then apt install ceph-osd, rebooted - same error :-(

What's the output of pveversion -v?

lucentwolf · Jul 7, 2020

That's at first glance identical to the other nodes:

Code:

proxmox-ve: 6.2-1 (running kernel: 5.4.44-2-pve)
pve-manager: 6.2-6 (running version: 6.2-6/ee1d7754)
pve-kernel-5.4: 6.2-4
pve-kernel-helper: 6.2-4
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.18-1-pve: 5.3.18-1
pve-kernel-5.3.13-2-pve: 5.3.13-2
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 14.2.9-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve2
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-3
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-8
pve-cluster: 6.1-8
pve-container: 3.1-8
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1

lucentwolf · Jul 7, 2020

oops: Just noticed: All nodes report
ceph: 14.2.9-pve1
whereas the affected node has no such line in pveversion -v
[EDIT]
However, in the UI/Ceph/OSD the node shows Version 14.2.9 (as all others)

Alwin · Jul 7, 2020

lucentwolf said:
oops: Just noticed: All nodes report
ceph: 14.2.9-pve1
whereas the affected node has no such line in pveversion -v

Then Ceph is not installed on that node or at least not the meta-package. Run a pveceph install to get it installed.

lucentwolf · Jul 7, 2020

Well, I did, and it said "1 new installed" (ie ceph) - however: Still same issue

Alwin · Jul 7, 2020

lucentwolf said:
Well, I did, and it said "1 new installed" (ie ceph) - however: Still same issue

But pveversion -v should now include the ceph package.

lucentwolf said:
ValueError: bad marshal data (unknown type code)

Maybe this suggestion helps.
https://stackoverflow.com/questions...-valueerrorbad-marshal-data/30861494#30861494

lucentwolf · Jul 7, 2020

Sry, need to stay naggin' you about this...

I indeed stumbled over the linked stackoverflow thread - however I don't quite understand how to fix it. Recap:

already a simple 'ceph-volume' (without arguments) results in the same "ValueError..." whereas on the other nodes I get the "Available subcommands"-help displayed
That "ValueError: bad marshal data (unknown type code)" happens if python 2.7 .pyc is loaded in python 3.5 (and there seems to be a regression in 3.7)
Looking at the stackoverflow thread you mentioned it says "...reinstall the python application" or "...remove the .pyc"; but
Ceph & ceph-osd is re-installed (purge, autoremove, pveceph install) - so the potentially included .pyc should be fine.

So, honestly, I'm stuck. Would a re-installation of the python packages be a viable option?

Alwin · Jul 8, 2020

lucentwolf said:
Looking at the stackoverflow thread you mentioned it says "...reinstall the python application" or "...remove the .pyc"; but

Try to manually remove the .pyc, they may not have been cleared with the package re-installation. And is the node on the latest available packages (missing updates)?

[SOLVED] OSDs fail on on one node / cannot re-create

Active Member

Proxmox Retired Staff

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Proxmox Retired Staff

Active Member

Proxmox Retired Staff

Active Member

Proxmox Retired Staff

Active Member

Active Member

Proxmox Retired Staff

Active Member

Proxmox Retired Staff

Active Member

Proxmox Retired Staff

We value your privacy