ceph journaldisk multi partitions on NVMe nightmare

Gerhard W. Recher · May 31, 2017

Hi Folks,

Just got my new cluster machines all NVMe

I have one Journaldisk Fast NVMe with 1.5 TB you can not partion this on gui so i did this in cli...

parted /dev/nvme0n1
mkpart journal01 1 250G
mkpart journal02 250G 500G
mkpart journal03 500G 750G
mkpart journal04 750G 1000G
mkpart journal05 1000G 1250G
mkpart journal06 1250G 1500G
print
quit

Disk /dev/nvme0n1: 1.5 TiB, 1600321314816 bytes, 3125627568 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 956A2C38-D7FF-4B03-9B18-76903797A0E6

Device Start End Sectors Size Type
/dev/nvme0n1p1 2048 488282111 488280064 232.9G Linux filesystem
/dev/nvme0n1p2 488282112 976562175 488280064 232.9G Linux filesystem
/dev/nvme0n1p3 976562176 1464844287 488282112 232.9G Linux filesystem
/dev/nvme0n1p4 1464844288 1953124351 488280064 232.9G Linux filesystem
/dev/nvme0n1p5 1953124352 2441406463 488282112 232.9G Linux filesystem
/dev/nvme0n1p6 2441406464 2929686527 488280064 232.9G Linux filesystem

and 6 Data NVMe with 1.8 TB
i only be able to create osd via cli, GUI does not present one of the 6 Partinions of journaldisk !

pveceph createosd /dev/nvme1n1 -journal_dev /dev/nvme0n1p1

create OSD on /dev/nvme1n1 (xfs)
using device '/dev/nvme0n1p1' for journal
Caution: invalid backup GPT header, but valid main header; regenerating
backup header from main header.

****************************************************************************
Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
verification and recovery are STRONGLY recommended.
****************************************************************************
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
Creating new GPT entries.
The operation has completed successfully.
prepare_device: OSD will not be hot-swappable if journal is not the same device as the osd data
prepare_device: Journal /dev/nvme0n1p1 was not prepared with ceph-disk. Symlinking directly.
Setting name!
partNum is 0
REALLY setting name!
The operation has completed successfully.
meta-data=/dev/nvme1n1p1 isize=2048 agcount=4, agsize=122094597 blks
= sectsz=512 attr=2, projid32bit=1
= crc=0 finobt=0
data = bsize=4096 blocks=488378385, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=0
log =internal log bsize=4096 blocks=238466, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
The operation has completed successfully.

But osd does not show up and is not declared in /var/lib/ceph/osd !

so I'm stuck !...

any help is appreciated !

regards

Gerhard

fabian · May 31, 2017

you don't need to manually partition your journal device - just select the disk itself, and the journal partition(s) will be created automatically

Gerhard W. Recher · May 31, 2017

Fabian, thanks, but how will ceph decide this ?
i have a likely orphaned osd now, destroy osd does not work because where are no entries in /var/lib/ceph/osd/* ...

how to overcome this remedy ?

regards Gerhard
will this work ?

pveceph createosd /dev/nvme1n1 -journal_dev /dev/nvme0n1
pveceph createosd /dev/nvme2n1 -journal_dev /dev/nvme0n1
pveceph createosd /dev/nvme3n1 -journal_dev /dev/nvme0n1
pveceph createosd /dev/nvme4n1 -journal_dev /dev/nvme0n1
pveceph createosd /dev/nvme5n1 -journal_dev /dev/nvme0n1
pveceph createosd /dev/nvme6n1 -journal_dev /dev/nvme0n1

fabian · May 31, 2017

Gerhard W. Recher said:
Fabian, thanks, but how will ceph decide this ?

decide what? you tell it to put the journal on a block device, and it will attempt to create a partition of the configured size to use as a journal..

i have a likely orphaned osd now, destroy osd does not work because where are no entries in /var/lib/ceph/osd/* ...

how to overcome this remedy ?

does the ceph cluster know about the OSD? if not, simply format / zap the OSD disk and you are okay.

will this work ?

pveceph createosd /dev/nvme1n1 -journal_dev /dev/nvme0n1
pveceph createosd /dev/nvme2n1 -journal_dev /dev/nvme0n1
pveceph createosd /dev/nvme3n1 -journal_dev /dev/nvme0n1
pveceph createosd /dev/nvme4n1 -journal_dev /dev/nvme0n1
pveceph createosd /dev/nvme5n1 -journal_dev /dev/nvme0n1
pveceph createosd /dev/nvme6n1 -journal_dev /dev/nvme0n1

yes, that would be the correct syntax (note that both your journal and OSD device should initially be blank and have a GPT partition table).

but why do you want to use an external journal when your OSDs are already on NVME devices? an external journal only makes sense if you journal device has vastly better sync performance than your OSD devices. otherwise, you are introducing a single point of failure for no added benefit.

Gerhard W. Recher · May 31, 2017

Fabian,

Thank you for suggestions. We thought a fast and more durable journal disk would be better.
24 OSD Disks: on 4 nodes:

2000GB Intel SSD DC P3520, 2,5", PCIe 3.0 x4,bulk
NVMe 2.5" in PCIe 3.0, 20nm, MLC, Sequential Read: 1700 MB/s,
Sequential Write: 1350 MB/s, Random Read (100%
Span): 260000 IOPS, MTBF: 2.0 million hours,
#SSDPE2MX020T7, 5 Jahre Herstellergarantie

Journaldisk:

1,6TB Intel SSD DC P3700, 2,5", U.2 PCIe 3.0
HET 2.5" MLC (NVMe) Solid State Drive SSDPE2MD016T4
MLC, Sequential Read: 2800 MB/s, Sequential Write: 1900
MB/s, Random Read (100% Span): 450000 IOPS, MTBF:
2.0 million hours, #SSDPE2MD016T4, 5 Jahre Herstellergarantie

also we interconnect with 40GiBE Mellanox cards and Mellanox Switch.

is it possible to change Hard-coded Journalssize from 5GB to more ? 5GB is just one second at this wire speed ...
by simply editing:

root@pve01:/etc/pve# cat ceph.conf
[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 192.168.100.0/24
filestore xattr use omap = true
fsid = f81bbf2e-887e-4e49-8d28-511de4539b09
keyring = /etc/pve/priv/$cluster.$name.keyring
osd journal size = 5120 <<change to 250000 ???
osd pool default min size = 1
public network = 192.168.100.0/24

as we now have on the 1st node a strange status

how to remove our first attempt to create osd with pre partitioned journal disk ...

crushmap displays :
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 device1 <<<these come from our first attempt i guess

device 2 device2
device 3 device3
device 4 device4
device 5 device5
device 6 device6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23
device 24 osd.24
device 25 osd.25
device 26 osd.26
device 27 osd.27
device 28 osd.28

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host pve01 {
id -2 # do not change unnecessarily
# weight 9.092
alg straw
hash 0 # rjenkins1
item osd.0 weight 1.818
item osd.7 weight 1.818
item osd.8 weight 1.818
item osd.9 weight 1.818
item osd.10 weight 1.818
}
host pve02 {
id -3 # do not change unnecessarily
# weight 10.911
alg straw
hash 0 # rjenkins1
item osd.11 weight 1.818
item osd.12 weight 1.818
item osd.14 weight 1.818
item osd.17 weight 1.818
item osd.20 weight 1.818
item osd.24 weight 1.818
}
host pve03 {
id -4 # do not change unnecessarily
# weight 10.911
alg straw
hash 0 # rjenkins1
item osd.13 weight 1.818
item osd.15 weight 1.818
item osd.18 weight 1.818
item osd.21 weight 1.818
item osd.23 weight 1.818
item osd.26 weight 1.818
}
host pve04 {
id -5 # do not change unnecessarily
# weight 10.911
alg straw
hash 0 # rjenkins1
item osd.16 weight 1.818
item osd.19 weight 1.818
item osd.22 weight 1.818
item osd.25 weight 1.818
item osd.27 weight 1.818
item osd.28 weight 1.818
}
root default {
id -1 # do not change unnecessarily
# weight 41.825
alg straw
hash 0 # rjenkins1
item pve01 weight 9.092
item pve02 weight 10.911
item pve03 weight 10.911
item pve04 weight 10.911
}

# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map

fabian · Jun 1, 2017

Gerhard W. Recher said:
Fabian,

Thank you for suggestions. We thought a fast and more durable journal disk would be better.

if you share a journal device among multiple OSDs which are only slightly less fast and durable, the outcome is actually worse than just using the OSDs themselves

24 OSD Disks: on 4 nodes:

2000GB Intel SSD DC P3520, 2,5", PCIe 3.0 x4,bulk
NVMe 2.5" in PCIe 3.0, 20nm, MLC, Sequential Read: 1700 MB/s,
Sequential Write: 1350 MB/s, Random Read (100%
Span): 260000 IOPS, MTBF: 2.0 million hours,
#SSDPE2MX020T7, 5 Jahre Herstellergarantie

Journaldisk:

1,6TB Intel SSD DC P3700, 2,5", U.2 PCIe 3.0
HET 2.5" MLC (NVMe) Solid State Drive SSDPE2MD016T4
MLC, Sequential Read: 2800 MB/s, Sequential Write: 1900
MB/s, Random Read (100% Span): 450000 IOPS, MTBF:
2.0 million hours, #SSDPE2MD016T4, 5 Jahre Herstellergarantie

like I said, this is not a situation where an external journal device is helpful

also we interconnect with 40GiBE Mellanox cards and Mellanox Switch.

is it possible to change Hard-coded Journalssize from 5GB to more ? 5GB is just one second at this wire speed ...

while that is true, 5GB is more than one second of writes to the journal device (or journals on OSDs), and you also have to factor in a bit of overhead caused by ceph itself. also keep in mind that the journal is per OSD, so if you have 6 OSDs per node, that actually means you have an average of ~6 seconds to fill each journal in parallel at full wire speed (note - very much a back of the envelope calculation

).

I doubt that the journal size is a bottleneck here, but I'd advise you to test yourself (your NVME devices are big enough to sacrifice a few more GB if you see even a bit of a benefit there).

by simply editing:

root@pve01:/etc/pve# cat ceph.conf
[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 192.168.100.0/24
filestore xattr use omap = true
fsid = f81bbf2e-887e-4e49-8d28-511de4539b09
keyring = /etc/pve/priv/$cluster.$name.keyring
osd journal size = 5120 <<change to 250000 ???
osd pool default min size = 1
public network = 192.168.100.0/24

yes, but naturally this will only affect new OSDs. I would change the min size to 2 as well, min size 1 is not recommended for production use.

as we now have on the 1st node a strange status

how to remove our first attempt to create osd with pre partitioned journal disk ...

crushmap displays :
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 device1 <<<these come from our first attempt i guess
device 2 device2
device 3 device3
device 4 device4
device 5 device5
device 6 device6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23
device 24 osd.24
device 25 osd.25
device 26 osd.26
device 27 osd.27
device 28 osd.28

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host pve01 {
id -2 # do not change unnecessarily
# weight 9.092
alg straw
hash 0 # rjenkins1
item osd.0 weight 1.818
item osd.7 weight 1.818
item osd.8 weight 1.818
item osd.9 weight 1.818
item osd.10 weight 1.818
}
host pve02 {
id -3 # do not change unnecessarily
# weight 10.911
alg straw
hash 0 # rjenkins1
item osd.11 weight 1.818
item osd.12 weight 1.818
item osd.14 weight 1.818
item osd.17 weight 1.818
item osd.20 weight 1.818
item osd.24 weight 1.818
}
host pve03 {
id -4 # do not change unnecessarily
# weight 10.911
alg straw
hash 0 # rjenkins1
item osd.13 weight 1.818
item osd.15 weight 1.818
item osd.18 weight 1.818
item osd.21 weight 1.818
item osd.23 weight 1.818
item osd.26 weight 1.818
}
host pve04 {
id -5 # do not change unnecessarily
# weight 10.911
alg straw
hash 0 # rjenkins1
item osd.16 weight 1.818
item osd.19 weight 1.818
item osd.22 weight 1.818
item osd.25 weight 1.818
item osd.27 weight 1.818
item osd.28 weight 1.818
}
root default {
id -1 # do not change unnecessarily
# weight 41.825
alg straw
hash 0 # rjenkins1
item pve01 weight 9.092
item pve02 weight 10.911
item pve03 weight 10.911
item pve04 weight 10.911
}

# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map

AFAIK, you will need to manually edit your crush map and remove those entries: http://docs.ceph.com/docs/master/rados/operations/crush-map/#editing-a-crush-map

if this cluster is already used in production, please be careful

(note, while performance is pretty bad, you can test a lot of ceph operations using a virtualized ceph cluster. but for larger-scale production use, it is probably even better to have a small, similar to production physical cluster to play around with).

Search

Search

ceph journaldisk multi partitions on NVMe nightmare

Gerhard W. Recher

Well-Known Member

fabian

Proxmox Staff Member

Gerhard W. Recher

Well-Known Member

fabian

Proxmox Staff Member

Gerhard W. Recher

Well-Known Member

fabian

Proxmox Staff Member