[SOLVED] OSDs fail on on one node / cannot re-create

lucentwolf

Active Member
Dec 19, 2019
29
9
43
Hi all

My cluster consists of 6 nodes with 3 OSDs each (18 OSDs total), pve 6.2-6 and ceph 14.2.9. BTW, it's been up and running fine for 7 months now and went through all updates flawlessly so far.

However, after rebooting the nodes one after the other upon updating to 6.2-6, the 3 OSDs on one nodes didn't come up again. After ceph was back to clean, and the 3 OSDs being "out", I decided to destroy them; and waitet for the clean-state again. Then (on the respective node) I tried
  • ceph-volume lvm zap /dev/sda --destroy
which failed, returning:

Code:
Traceback (most recent call last):
  File "/usr/sbin/ceph-volume", line 6, in <module>
    from pkg_resources import load_entry_point
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 83, in <module>
    __import__('pkg_resources.extern.packaging.specifiers')
  File "/usr/lib/python2.7/dist-packages/pkg_resources/extern/__init__.py", line 43, in load_module
    __import__(extant)
ValueError: bad marshal data (unknown type code)

The attempt to add the OSD anyway using

  • pveceph osd create /dev/sda

also failed, returning:

Code:
wipe disk/partition: /dev/sda
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 1.0918 s, 192 MB/s
Traceback (most recent call last):
  File "/sbin/ceph-volume", line 6, in <module>
    from pkg_resources import load_entry_point
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 83, in <module>
    __import__('pkg_resources.extern.packaging.specifiers')
  File "/usr/lib/python2.7/dist-packages/pkg_resources/extern/__init__.py", line 43, in load_module
    __import__(extant)
ValueError: bad marshal data (unknown type code)
command 'ceph-volume lvm create --cluster-fsid a8d6705a-74c4-4904-9000-0db5742043fc --data /dev/sda' failed: exit code 1

The same happens with the other two HDDs in that node (/dev/sdb and /dev/sdc) So, I'm kind'a stuck - and I'd appreciate any hint & help on this :)

Kind regards
lucentwolf
 
Last edited:
You may have leftover LVs. They and the VGs containing them need to be removed first.
 
Hello,

Sorry in advance, to avoid new similar topic I`ll write here.
I have issue with ceph-volume lvm zap command too.

I use PVE 6.0.1 with Ceph for testing purpose. Now I have to migrate this cluster to production. I made clear install (PVE + upgrade + Ceph) on each of 4 nodes, create cluster and face with problem that I can`t add OSD to new Ceph cluster. All disk are in use (No disk unused from GUI).
lsblk shows that disk has old Ceph volumes:
Code:
root@pve-01:~# lsblk
NAME                                                                                                  MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                                                                                                     8:0    0 232.9G  0 disk
|-sda1                                                                                                  8:1    0  1007K  0 part
|-sda2                                                                                                  8:2    0   512M  0 part
`-sda3                                                                                                  8:3    0 232.4G  0 part
  |-pve-swap                                                                                          253:5    0     8G  0 lvm  [SWAP]
  |-pve-root                                                                                          253:6    0    58G  0 lvm  /
  |-pve-data_tmeta                                                                                    253:7    0   1.5G  0 lvm 
  | `-pve-data                                                                                        253:9    0 147.4G  0 lvm 
  `-pve-data_tdata                                                                                    253:8    0 147.4G  0 lvm 
    `-pve-data                                                                                        253:9    0 147.4G  0 lvm 
sdb                                                                                                     8:16   0 931.5G  0 disk
`-ceph--c8f6fde5--3a68--418b--b3ba--2aaf8b4b75c5-osd--block--b7726b15--fd54--45d9--8a07--2729bda9c414 253:4    0 931.5G  0 lvm 
sdc                                                                                                     8:32   0 931.5G  0 disk
`-ceph--23342ddd--b606--40a7--8a3a--6400485cc7a2-osd--block--a12e2660--d544--4c94--af59--e91cfce06eb7 253:3    0 931.5G  0 lvm 
sdd                                                                                                     8:48   0 931.5G  0 disk
`-ceph--8108d3bd--9ef7--4c1b--b5ae--1912b56dbbff-osd--block--01cb0cdc--a0e6--43fb--b6df--f5598e099899 253:2    0 931.5G  0 lvm 
sde                                                                                                     8:64   0 931.5G  0 disk
`-ceph--67670145--0dff--4a95--9b55--7b48b6af7d0f-osd--block--ae4bb4f4--e187--42f5--a2b3--3e843536d18d 253:1    0 931.5G  0 lvm 
sdf                                                                                                     8:80   0 931.5G  0 disk
`-ceph--0bf050a7--1961--4e49--8e42--c2c5907bbc21-osd--block--2294241e--78ba--483c--8610--ad5c031c1750 253:0    0 931.5G  0 lvm

I tried to prepare disk with ceph-volume lvm zap

Code:
root@pve-01:~# ceph-volume lvm zap /dev/sdb --destroy
--> Zapping: /dev/sdb
 stderr: wipefs: error: /dev/sdb: probing initialization failed: Device or resource busy
--> failed to wipefs device, will try again to workaround probable race condition

Also I already try sgdisk -Z /dev/sdb, wipefs -af /dev/sdb/, dd if=/dev/zero of=/dev/sdb bs=500M count=2048. No luck...

How to destroy old ceph data\id`s whatever to create new OSD`s?
 
Hi all, esp Alwin

LV means logical volume, right? Checking with
  • ceph-volume lvm list
returns basically the same error:
Code:
Traceback (most recent call last):
  File "/usr/sbin/ceph-volume", line 6, in <module>
    from pkg_resources import load_entry_point
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 83, in <module>
    __import__('pkg_resources.extern.packaging.specifiers')
  File "/usr/lib/python2.7/dist-packages/pkg_resources/extern/__init__.py", line 43, in load_module
    __import__(extant)
ValueError: bad marshal data (unknown type code)

To me it looks like a flaw in the python setup...

lucentwolf
 
Hi Pravednik, I tried
- creating a new GPT on each drive; then
- removed all Partition tables from the drives
...same issue.

Following up on Alwin's hint, vgdisplay shows one VG named 'pve' (no others) - is that the one I need to remove?
 
lucentwolf, do you reboot node after removing partitions? I tried to mount disks after removing partitions and got same error. Only after reboot node all disks are available for Ceph.
 
Pravednik, sure I rebooted numerous times :-(
[EDIT]: I also rebooted the other nodes - just in case it would make a difference...
 
Last edited:
To me it looks like a flaw in the python setup...
Did you upgrade all Ceph related packages?

Following up on Alwin's hint, vgdisplay shows one VG named 'pve' (no others) - is that the one I need to remove?
No, otherwise the OS and lvm-local will be lost. ;)

What's the output of pveceph osd create /dev/<disk>?
 
Hi Alwin

...appreciate your involvment; all packages report to be up-do-date (no update action after 'apt update' respectively 'apt dist-upgrade'). PVE and Ceph versions are identical to the remaining nodes (see initial post), python --version gives 2.7.16 on all nodes.

  • pveceph osd create /dev/sda
reports
Code:
create OSD on /dev/sda (bluestore)
wipe disk/partition: /dev/sda
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 1.14345 s, 183 MB/s
Traceback (most recent call last):
  File "/sbin/ceph-volume", line 6, in <module>
    from pkg_resources import load_entry_point
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 83, in <module>
    __import__('pkg_resources.extern.packaging.specifiers')
  File "/usr/lib/python2.7/dist-packages/pkg_resources/extern/__init__.py", line 43, in load_module
    __import__(extant)
ValueError: bad marshal data (unknown type code)
command 'ceph-volume lvm create --cluster-fsid a8d6705a-74c4-4904-9000-0db5742043fc --data /dev/sda' failed: exit code 1
 
ValueError: bad marshal data (unknown type code) command 'ceph-volume lvm create --cluster-fsid a8d6705a-74c4-4904-9000-0db5742043fc --data /dev/sda' failed: exit code 1
ceph-volume is in the ceph-osd package, you may reinstall the package.
 
That's at first glance identical to the other nodes:
Code:
proxmox-ve: 6.2-1 (running kernel: 5.4.44-2-pve)
pve-manager: 6.2-6 (running version: 6.2-6/ee1d7754)
pve-kernel-5.4: 6.2-4
pve-kernel-helper: 6.2-4
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.18-1-pve: 5.3.18-1
pve-kernel-5.3.13-2-pve: 5.3.13-2
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 14.2.9-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve2
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-3
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-8
pve-cluster: 6.1-8
pve-container: 3.1-8
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1
 
oops: Just noticed: All nodes report
ceph: 14.2.9-pve1
whereas the affected node has no such line in pveversion -v
[EDIT]
However, in the UI/Ceph/OSD the node shows Version 14.2.9 (as all others)
 
oops: Just noticed: All nodes report
ceph: 14.2.9-pve1
whereas the affected node has no such line in pveversion -v
Then Ceph is not installed on that node or at least not the meta-package. Run a pveceph install to get it installed.
 
Sry, need to stay naggin' you about this...

I indeed stumbled over the linked stackoverflow thread - however I don't quite understand how to fix it. Recap:
  • already a simple 'ceph-volume' (without arguments) results in the same "ValueError..." whereas on the other nodes I get the "Available subcommands"-help displayed
  • That "ValueError: bad marshal data (unknown type code)" happens if python 2.7 .pyc is loaded in python 3.5 (and there seems to be a regression in 3.7)
  • Looking at the stackoverflow thread you mentioned it says "...reinstall the python application" or "...remove the .pyc"; but
  • Ceph & ceph-osd is re-installed (purge, autoremove, pveceph install) - so the potentially included .pyc should be fine.
So, honestly, I'm stuck. Would a re-installation of the python packages be a viable option?
 
  • Looking at the stackoverflow thread you mentioned it says "...reinstall the python application" or "...remove the .pyc"; but
Try to manually remove the .pyc, they may not have been cleared with the package re-installation. And is the node on the latest available packages (missing updates)?
 
  • Like
Reactions: lucentwolf