[SOLVED] OSDs fail on on one node / cannot re-create

lucentwolf

Member
Dec 19, 2019
11
1
8
Hi all

My cluster consists of 6 nodes with 3 OSDs each (18 OSDs total), pve 6.2-6 and ceph 14.2.9. BTW, it's been up and running fine for 7 months now and went through all updates flawlessly so far.

However, after rebooting the nodes one after the other upon updating to 6.2-6, the 3 OSDs on one nodes didn't come up again. After ceph was back to clean, and the 3 OSDs being "out", I decided to destroy them; and waitet for the clean-state again. Then (on the respective node) I tried
  • ceph-volume lvm zap /dev/sda --destroy
which failed, returning:

Code:
Traceback (most recent call last):
  File "/usr/sbin/ceph-volume", line 6, in <module>
    from pkg_resources import load_entry_point
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 83, in <module>
    __import__('pkg_resources.extern.packaging.specifiers')
  File "/usr/lib/python2.7/dist-packages/pkg_resources/extern/__init__.py", line 43, in load_module
    __import__(extant)
ValueError: bad marshal data (unknown type code)

The attempt to add the OSD anyway using

  • pveceph osd create /dev/sda

also failed, returning:

Code:
wipe disk/partition: /dev/sda
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 1.0918 s, 192 MB/s
Traceback (most recent call last):
  File "/sbin/ceph-volume", line 6, in <module>
    from pkg_resources import load_entry_point
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 83, in <module>
    __import__('pkg_resources.extern.packaging.specifiers')
  File "/usr/lib/python2.7/dist-packages/pkg_resources/extern/__init__.py", line 43, in load_module
    __import__(extant)
ValueError: bad marshal data (unknown type code)
command 'ceph-volume lvm create --cluster-fsid a8d6705a-74c4-4904-9000-0db5742043fc --data /dev/sda' failed: exit code 1

The same happens with the other two HDDs in that node (/dev/sdb and /dev/sdc) So, I'm kind'a stuck - and I'd appreciate any hint & help on this :)

Kind regards
lucentwolf
 
Last edited:

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
4,617
443
88
You may have leftover LVs. They and the VGs containing them need to be removed first.
 

Pravednik

Member
Sep 18, 2018
17
0
6
37
Hello,

Sorry in advance, to avoid new similar topic I`ll write here.
I have issue with ceph-volume lvm zap command too.

I use PVE 6.0.1 with Ceph for testing purpose. Now I have to migrate this cluster to production. I made clear install (PVE + upgrade + Ceph) on each of 4 nodes, create cluster and face with problem that I can`t add OSD to new Ceph cluster. All disk are in use (No disk unused from GUI).
lsblk shows that disk has old Ceph volumes:
Code:
root@pve-01:~# lsblk
NAME                                                                                                  MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                                                                                                     8:0    0 232.9G  0 disk
|-sda1                                                                                                  8:1    0  1007K  0 part
|-sda2                                                                                                  8:2    0   512M  0 part
`-sda3                                                                                                  8:3    0 232.4G  0 part
  |-pve-swap                                                                                          253:5    0     8G  0 lvm  [SWAP]
  |-pve-root                                                                                          253:6    0    58G  0 lvm  /
  |-pve-data_tmeta                                                                                    253:7    0   1.5G  0 lvm 
  | `-pve-data                                                                                        253:9    0 147.4G  0 lvm 
  `-pve-data_tdata                                                                                    253:8    0 147.4G  0 lvm 
    `-pve-data                                                                                        253:9    0 147.4G  0 lvm 
sdb                                                                                                     8:16   0 931.5G  0 disk
`-ceph--c8f6fde5--3a68--418b--b3ba--2aaf8b4b75c5-osd--block--b7726b15--fd54--45d9--8a07--2729bda9c414 253:4    0 931.5G  0 lvm 
sdc                                                                                                     8:32   0 931.5G  0 disk
`-ceph--23342ddd--b606--40a7--8a3a--6400485cc7a2-osd--block--a12e2660--d544--4c94--af59--e91cfce06eb7 253:3    0 931.5G  0 lvm 
sdd                                                                                                     8:48   0 931.5G  0 disk
`-ceph--8108d3bd--9ef7--4c1b--b5ae--1912b56dbbff-osd--block--01cb0cdc--a0e6--43fb--b6df--f5598e099899 253:2    0 931.5G  0 lvm 
sde                                                                                                     8:64   0 931.5G  0 disk
`-ceph--67670145--0dff--4a95--9b55--7b48b6af7d0f-osd--block--ae4bb4f4--e187--42f5--a2b3--3e843536d18d 253:1    0 931.5G  0 lvm 
sdf                                                                                                     8:80   0 931.5G  0 disk
`-ceph--0bf050a7--1961--4e49--8e42--c2c5907bbc21-osd--block--2294241e--78ba--483c--8610--ad5c031c1750 253:0    0 931.5G  0 lvm

I tried to prepare disk with ceph-volume lvm zap

Code:
root@pve-01:~# ceph-volume lvm zap /dev/sdb --destroy
--> Zapping: /dev/sdb
 stderr: wipefs: error: /dev/sdb: probing initialization failed: Device or resource busy
--> failed to wipefs device, will try again to workaround probable race condition

Also I already try sgdisk -Z /dev/sdb, wipefs -af /dev/sdb/, dd if=/dev/zero of=/dev/sdb bs=500M count=2048. No luck...

How to destroy old ceph data\id`s whatever to create new OSD`s?
 

lucentwolf

Member
Dec 19, 2019
11
1
8
Hi all, esp Alwin

LV means logical volume, right? Checking with
  • ceph-volume lvm list
returns basically the same error:
Code:
Traceback (most recent call last):
  File "/usr/sbin/ceph-volume", line 6, in <module>
    from pkg_resources import load_entry_point
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 83, in <module>
    __import__('pkg_resources.extern.packaging.specifiers')
  File "/usr/lib/python2.7/dist-packages/pkg_resources/extern/__init__.py", line 43, in load_module
    __import__(extant)
ValueError: bad marshal data (unknown type code)

To me it looks like a flaw in the python setup...

lucentwolf
 

lucentwolf

Member
Dec 19, 2019
11
1
8
Hi Pravednik, I tried
- creating a new GPT on each drive; then
- removed all Partition tables from the drives
...same issue.

Following up on Alwin's hint, vgdisplay shows one VG named 'pve' (no others) - is that the one I need to remove?
 

Pravednik

Member
Sep 18, 2018
17
0
6
37
lucentwolf, do you reboot node after removing partitions? I tried to mount disks after removing partitions and got same error. Only after reboot node all disks are available for Ceph.
 

lucentwolf

Member
Dec 19, 2019
11
1
8
Pravednik, sure I rebooted numerous times :-(
[EDIT]: I also rebooted the other nodes - just in case it would make a difference...
 
Last edited:

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
4,617
443
88
To me it looks like a flaw in the python setup...
Did you upgrade all Ceph related packages?

Following up on Alwin's hint, vgdisplay shows one VG named 'pve' (no others) - is that the one I need to remove?
No, otherwise the OS and lvm-local will be lost. ;)

What's the output of pveceph osd create /dev/<disk>?
 

lucentwolf

Member
Dec 19, 2019
11
1
8
Hi Alwin

...appreciate your involvment; all packages report to be up-do-date (no update action after 'apt update' respectively 'apt dist-upgrade'). PVE and Ceph versions are identical to the remaining nodes (see initial post), python --version gives 2.7.16 on all nodes.

  • pveceph osd create /dev/sda
reports
Code:
create OSD on /dev/sda (bluestore)
wipe disk/partition: /dev/sda
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 1.14345 s, 183 MB/s
Traceback (most recent call last):
  File "/sbin/ceph-volume", line 6, in <module>
    from pkg_resources import load_entry_point
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 83, in <module>
    __import__('pkg_resources.extern.packaging.specifiers')
  File "/usr/lib/python2.7/dist-packages/pkg_resources/extern/__init__.py", line 43, in load_module
    __import__(extant)
ValueError: bad marshal data (unknown type code)
command 'ceph-volume lvm create --cluster-fsid a8d6705a-74c4-4904-9000-0db5742043fc --data /dev/sda' failed: exit code 1
 

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
4,617
443
88
ValueError: bad marshal data (unknown type code) command 'ceph-volume lvm create --cluster-fsid a8d6705a-74c4-4904-9000-0db5742043fc --data /dev/sda' failed: exit code 1
ceph-volume is in the ceph-osd package, you may reinstall the package.
 

lucentwolf

Member
Dec 19, 2019
11
1
8
That's at first glance identical to the other nodes:
Code:
proxmox-ve: 6.2-1 (running kernel: 5.4.44-2-pve)
pve-manager: 6.2-6 (running version: 6.2-6/ee1d7754)
pve-kernel-5.4: 6.2-4
pve-kernel-helper: 6.2-4
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.18-1-pve: 5.3.18-1
pve-kernel-5.3.13-2-pve: 5.3.13-2
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 14.2.9-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve2
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-3
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-8
pve-cluster: 6.1-8
pve-container: 3.1-8
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1
 

lucentwolf

Member
Dec 19, 2019
11
1
8
oops: Just noticed: All nodes report
ceph: 14.2.9-pve1
whereas the affected node has no such line in pveversion -v
[EDIT]
However, in the UI/Ceph/OSD the node shows Version 14.2.9 (as all others)
 

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
4,617
443
88
oops: Just noticed: All nodes report
ceph: 14.2.9-pve1
whereas the affected node has no such line in pveversion -v
Then Ceph is not installed on that node or at least not the meta-package. Run a pveceph install to get it installed.
 

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
4,617
443
88

lucentwolf

Member
Dec 19, 2019
11
1
8
Sry, need to stay naggin' you about this...

I indeed stumbled over the linked stackoverflow thread - however I don't quite understand how to fix it. Recap:
  • already a simple 'ceph-volume' (without arguments) results in the same "ValueError..." whereas on the other nodes I get the "Available subcommands"-help displayed
  • That "ValueError: bad marshal data (unknown type code)" happens if python 2.7 .pyc is loaded in python 3.5 (and there seems to be a regression in 3.7)
  • Looking at the stackoverflow thread you mentioned it says "...reinstall the python application" or "...remove the .pyc"; but
  • Ceph & ceph-osd is re-installed (purge, autoremove, pveceph install) - so the potentially included .pyc should be fine.
So, honestly, I'm stuck. Would a re-installation of the python packages be a viable option?
 

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
4,617
443
88
  • Looking at the stackoverflow thread you mentioned it says "...reinstall the python application" or "...remove the .pyc"; but
Try to manually remove the .pyc, they may not have been cleared with the package re-installation. And is the node on the latest available packages (missing updates)?
 
  • Like
Reactions: lucentwolf

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!