[SOLVED] Destroy OSD wiped journal/db drive partition table

flamozzle

New Member
Mar 13, 2014
6
0
1
This happened on proxmox 5.3-6, a cluster of 3 servers with ceph OSDs on two of the servers. I was destroying a bluestore OSD to re-create it as a filestore OSD. The system had a mixture of filestore and bluestore OSDs.

I did everything using the web UI. I stopped the OSD, out-ed it. Then I destroyed it.

The *really big* problem is that it wiped the partition table of the journal/db drive in the process.

In the output below, sdc is the OSD disk, and sdf is the journal/db drive (an SSD).

Here's what happened:
destroy OSD osd.6
Remove osd.6 from the CRUSH map
Remove the osd.6 authentication key.
Remove OSD osd.6
Unmount OSD osd.6 from /var/lib/ceph/osd/ceph-6
remove partition /dev/sdc1 (disk '/dev/sdc', partnum 1)
The operation has completed successfully.
remove partition /dev/sdc2 (disk '/dev/sdc', partnum 2)
The operation has completed successfully.
remove partition /dev/sdf14 (disk '/dev/sdf', partnum 14)
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot or after you
run partprobe(8) or kpartx(8)
The operation has completed successfully.
wipe disk: /dev/sdf
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 2.06369 s, 102 MB/s
wipe disk: /dev/sdc
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 3.04237 s, 68.9 MB/s
TASK OK
After this, the primary GPT on sdf was gone:
root@pm0:~# sfdisk -d /dev/sdf
sfdisk: /dev/sdf: does not contain a recognized partition table
Luckily, I was paying close attention, and noticed the problem right away.

I was able to recover from the backup GPT.

I used gdisk to verify the backup GPT looked good:
root@pm0:~# gdisk -l /dev/sdf
GPT fdisk (gdisk) version 1.0.1

Caution: invalid main GPT header, but valid backup; regenerating main header
from backup!

Caution! After loading partitions, the CRC doesn't check out!
Warning! Main partition table CRC mismatch! Loaded backup partition table
instead of main partition table!

Warning! One or more CRCs don't match. You should repair the disk!

Partition table scan:
MBR: not present
BSD: not present
APM: not present
GPT: damaged

Found invalid MBR and corrupt GPT. What do you want to do? (Using the
GPT MAY permit recovery of GPT data.)
1 - Use current GPT
2 - Create blank GPT

Your answer: 1
Disk /dev/sdf: 234441648 sectors, 111.8 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): B56DBFD0-6FC3-48D8-9095-A66F94512F70
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 234441614
Partitions will be aligned on 2048-sector boundaries
Total free space is 171527021 sectors (81.8 GiB)

Number Start (sector) End (sector) Size Code Name
3 20973568 31459327 5.0 GiB 8300
4 31459328 41945087 5.0 GiB 8300
10 94373888 104859647 5.0 GiB F802 ceph journal
12 115345408 125831167 5.0 GiB F802 ceph journal
13 125831168 136316927 5.0 GiB F802 ceph journal
15 138414080 148899839 5.0 GiB F802 ceph journal
root@pm0:~#

and then was able to use gdisk to recover the backup GPT:
root@pm0:~# gdisk /dev/sdf
GPT fdisk (gdisk) version 1.0.1

Caution: invalid main GPT header, but valid backup; regenerating main header
from backup!

Caution! After loading partitions, the CRC doesn't check out!
Warning! Main partition table CRC mismatch! Loaded backup partition table
instead of main partition table!

Warning! One or more CRCs don't match. You should repair the disk!

Partition table scan:
MBR: not present
BSD: not present
APM: not present
GPT: damaged

Found invalid MBR and corrupt GPT. What do you want to do? (Using the
GPT MAY permit recovery of GPT data.)
1 - Use current GPT
2 - Create blank GPT

Your answer: 1

Command (? for help): v

Problem: The CRC for the main partition table is invalid. This table may be
corrupt. Consider loading the backup partition table ('c' on the recovery &
transformation menu). This report may be a false alarm if you've already
corrected other problems.

Identified 1 problems!

Command (? for help): r

Recovery/transformation command (? for help): c
Warning! This will probably do weird things if you've converted an MBR to
GPT form and haven't yet saved the GPT! Proceed? (Y/N): y

Recovery/transformation command (? for help): v

No problems found. 171527021 free sectors (81.8 GiB) available in 5
segments, the largest of which is 85541775 (40.8 GiB) in size.

Recovery/transformation command (? for help): w

Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING
PARTITIONS!!

Do you want to proceed? (Y/N): y
OK; writing new GUID partition table (GPT) to /dev/sdf.
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot or after you
run partprobe(8) or kpartx(8)
The operation has completed successfully.
root@pm0:~# sfdisk -d /dev/sdf
label: gpt
label-id: B56DBFD0-6FC3-48D8-9095-A66F94512F70
device: /dev/sdf
unit: sectors
first-lba: 34
last-lba: 234441614

/dev/sdf3 : start= 20973568, size= 10485760, type=0FC63DAF-8483-4772-8E79-3D69D8477DE4, uuid=2A40747F-1AA6-4A5E-A734-C16294DA01B8
/dev/sdf4 : start= 31459328, size= 10485760, type=0FC63DAF-8483-4772-8E79-3D69D8477DE4, uuid=E83ED104-8726-487E-AB94-E88E4D7F9474
/dev/sdf10 : start= 94373888, size= 10485760, type=45B0969E-9B03-4F30-B4C6-B4B80CEFF106, uuid=27266775-967F-444E-8E3A-A9A5CA372692, name="ceph journal"
/dev/sdf12 : start= 115345408, size= 10485760, type=45B0969E-9B03-4F30-B4C6-B4B80CEFF106, uuid=89FB93BD-43DD-4FC2-BAE2-89746272FAFE, name="ceph journal"
/dev/sdf13 : start= 125831168, size= 10485760, type=45B0969E-9B03-4F30-B4C6-B4B80CEFF106, uuid=44ED149F-CEB1-48B6-96D8-DC48A83D71B7, name="ceph journal"
/dev/sdf15 : start= 138414080, size= 10485760, type=45B0969E-9B03-4F30-B4C6-B4B80CEFF106, uuid=3E75D1C3-45F0-4B3E-9A08-760C1FA467CF, name="ceph journal"
root@pm0:~#

I am not in a hurry to see if I can reproduce this problem. Experiencing it once was enough excitement for today
 

flamozzle

New Member
Mar 13, 2014
6
0
1
I have since realized that the damage was more extensive than I originally thought.

Because the disk wipe writes 200MiB at the beginning of the disk, it would also corrupt the first partition on the journal/db disk.

In my case, I was incredibly fortunate to have an old unused partition as the first one on that disk.
 

flamozzle

New Member
Mar 13, 2014
6
0
1
I have now reproduced this on a different server in the same (proxmox and ceph) cluster.

Here is the second example, following the same procedure as before, but on a different server and with a different OSD (tho it was again a bluestore OSD). In this case /dev/sde is the OSD and /dev/sdd is the journal/db disk.

destroy OSD osd.0
Remove osd.0 from the CRUSH map
Remove the osd.0 authentication key.
Remove OSD osd.0
Unmount OSD osd.0 from /var/lib/ceph/osd/ceph-0
remove partition /dev/sde1 (disk '/dev/sde', partnum 1)
The operation has completed successfully.
remove partition /dev/sde2 (disk '/dev/sde', partnum 2)
The operation has completed successfully.
remove partition /dev/sdd7 (disk '/dev/sdd', partnum 7)
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot or after you
run partprobe(8) or kpartx(8)
The operation has completed successfully.
wipe disk: /dev/sde
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 1.2406 s, 169 MB/s
wipe disk: /dev/sdd
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 0.966619 s, 217 MB/s
TASK OK
I used the same method to recover from it as before.
 

tobby

New Member
Feb 21, 2017
16
0
1
31
Experienced the same. Luckily it was a test VM and I was able to restore to a snapshot. It would have taken hours to recreate one node of my cluster if this would have happened to a "real" server.
 
May 31, 2015
27
3
3
California
I can confirm this behavior.
We used the same process we have followed in the past to replace a failed disk via the gui.
-set osd out, wait for healthy cluster.
-stop osd.
-destroy osd and partition.
The cluster immediately began re-balancing which I didn't understand until I looked at the log for the destroy task and saw that it wiped my journal disk. I ended up destroying the other four osd's on the node and re-creating them. I also had to use cfdisk to initialize gpt on my journal disk since "Initialize Disk with GPT" from the gui resulted in the following:

Caution: invalid main GPT header, but valid backup; regenerating main header
from backup!

Caution! After loading partitions, the CRC doesn't check out!
Warning! Main partition table CRC mismatch! Loaded backup partition table
instead of main partition table!

Warning! One or more CRCs don't match. You should repair the disk!

Invalid partition data!
TASK ERROR: command '/sbin/sgdisk /dev/sde -U R' failed: exit code 2


The good news is this is ceph. Even with a third of the disks down and then rebuilding, nobody felt a thing!
The bad news is this was a nasty surprise on a fully up to date cluster (5.3-9/ba817b29). What is the recommended process to follow until this is fixed?
 

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
2,572
222
63
You can find the package (pve-manager >= 5.3-10) containing the fix on the 'pvetest' repository.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!