No such block device - Proxmox 3.2 and Ceph configuration

nucode · Aug 16, 2015

symmcom said:
Sorry to hear about your experience with ZFS. But i think you are underestimating ZFS a bit. Sounds like you lost HDDs due to bad cables which may have made more than permissible HDD failure causing array lost. Same thing would have happened with physical RAID. ZFS is extremely resilient to consider it as Enterprise grade mission critical storage system. Before we moved to Ceph we used to use ZFS for long time. Never had any issue.

Actually incorrect, at least with software raid the mount drops to read only the moment you've lost 1 drive too many.
This gives the opportunity to repair it without damage to data, after getting the drives back online a resync will happen.
ZFS does not do this, it will happily continue on writing garbage to the array.
One network & DC operator CEO told me the same story, tho to him it was just a testing node without anything important and that was the end of the story. This was actually near, being the neighbouring rooms were his DCs

symmcom said:
Yes our target market is completely different than yours. For us data safety and redundancy comes above anything. Given the size of our cluster, nature of our customer data and need to keep historical data , replica 3 is very much acceptable. We also have 3rd ZFS+Gluster setup for data cold storage which is completely offsite. As you can tell from my signature we have a cloud business, anything we use goes through months of tests before we put it in production.
There are several experts in ZFS in this forum who can give you even greater details on ZFS mechanics. Mir is one of them who i know.

ZFS unfortunately is for me end of the story, even if the idiocracies is fixed, the design is faulty (activating all disks for single I/O) and for bulk of my needs, not suitable, only for backups. But because of the aforementioned issues, i think i'd be happier just doing plain ol' software raid + ext4, i know i will sleep better at least. Too many nights ZFS has ruined my sleep, i remember very well a 2 week sprint 24/7 trying to recover from ZFS caused issues ...

symmcom said:
If data safety doesnt matter at all, then i think i should go with gluster or ZFS+gluster. Very low initial cost and it just works. You already have experience with ZFS, so you already know.

Due to the mentioned performance issues i would not put our customer data on ZFS, for that reason alone.

symmcom said:
Yes, if the ceph cluster goes down all at once or within few minutes of each nodes and they are rebooted , ceph is able to do its own check and bring cluster back to healthy status.

OK Good, thank you for clarifying that, it has been a big question for me

It's only on the case of total power outage.

symmcom said:
About the UPS, the way your customer are you are saying, you can get away without any protection at all including UPS. If up time not important, just let all nodes shutdown. Of course you will not be able to gracefully shutdown your server which could be bad. You can also modify a cheap UPS and add some batteries to it to give you just enough time to shut down everything properly.

Already have been considering this, but usually then the maximum power output becomes an issue. In any case, when we invest we will probably go for a rack sized unit sourced from china. Surprisingly cheap, and has been confirmed to be of high quality standards! Cheap enough for me not to consider an DIY solution

symmcom said:
With IPoIB you will never get full bandwidth. With enough tweaking you can push close to 20gbps. It is mainly because IPoIB overhead. But thats 20gbps at much less cost than 10gbps ethernet.
We use 36 port Mellanox IB switches and dual port Mellanox ConnectX-3 cards.

OH i would totally have expected it to be able to push around 32Gbps mark! Good to know!

**EDIT EDIT: Wrong window submit

mir · Aug 16, 2015

symmcom said:
Even during our initial testing we never achieved more than 19Gbps on these 40gbps QDR cards.

You should be able to see 30'ish throughput (ConnectX2).

PS. RDMA is for iSCSI (iSER or SRP)

nucode · Aug 16, 2015

mir said:
Your performance figures indicates that something else must have been wrong too. My 4 disk RAID10 with SSD log and cache easily gives 2800 random read IOPS and 900 random write IOPS.

If that is SSD IOPS that's weak, but i guess you are talking about HDD?
Use case differences and optimizations.
Did you look at the actual IOPS or before merges and kernel optimizations?
Even after that, you are hard pressed to get real figure since there is more optimizations done on the driven ot visible to the kernel.

Our load is highly parallel and concurrent. There might be 1000 individual read requests on the 2-4 drives simultaneously! And that's on a slow day ...
Average request size tends to hower around 600kb mark. Most are total random, but there might be one or two threads doing sequential read + write @ full steam at the same time.

And under these conditions: Everything still needs to run smooth!

mir · Aug 16, 2015

nucode said:
If that is SSD IOPS that's weak, but i guess you are talking about HDD?

No, this is benchmarks on the pool. Storage is on HDD - 7.2K SATA3. Only SSD for cache and log. Also remember ZFS loves RAM, the more the merrier. That's why I suggest you try a box with 32GB.

nucode · Aug 16, 2015

Wow, haven't taken a look in a while on infiniband switches.
Damn these are cheap used!
ConnectX-2 cards are 50-60$ a pop, switches (36port) hover around 500$ starting mark!

What should i look in a switch when choosing one?

I think even with the lower than expected throughput i'm already sold on the infiniband

nucode · Aug 16, 2015

mir said:
No, this is benchmarks on the pool. Storage is on HDD - 7.2K SATA3. Only SSD for cache and log. Also remember ZFS loves RAM, the more the merrier. That's why I suggest you try a box with 32GB.

I had 32GB on the testing machines

For the RAID5 arrays i only looked at the actual drive measurements, with RAID you may see much higher going to the device, but optimizations happen after on the device level.
iostat -xz to see all the merges going on. But even then you can't always be sure what's going on with all the caching, one server is doing raid resync right now and it displays 4400 IOPS per drive

one 4 drive production system shows ~450 reads going to the raid device of 4 disks, but actual is 28-75 IOPS per device. Drives read as 10-52.8% utilization, so very far from full utilization.
7200RPM gives theoretical maximum of ~120 IOPS - everything above is optimizations and caching at various levels.

EDIT: So apples vs oranges here basicly, we are looking at different figures at different kind of loads

mir · Aug 16, 2015

nucode said:
What should i look in a switch when choosing one?

apart from matching your cards obviously;-) the most important part is that it features a subnet manager =~ normally means a managed switch. Toplink (aquired by Cisco) is known to have noisy fans so if you buy such one be prepared to change fans;-)

nucode · Aug 16, 2015

mir said:
apart from matching your cards obviously;-) the most important part is that it features a subnet manager =~ normally means a managed switch. Toplink (aquired by Cisco) is known to have noisy fans so if you buy such one be prepared to change fans;-)

So let's says these:
MTS3600Q-1BNC Switch: http://www.ebay.com/itm/Mellanox-MT...-QSFP-Ports-/131579163377?hash=item1ea2bab2f1
Goes together well with ConnectX-2 HCA: http://www.ebay.com/itm/MHQH29B-XTR...-40GB-s-HCA-/131466500178?hash=item1e9c039852

Just making sure i'm not overlooking something totally obvious for someone who's worked with QDR infiniband

I have only experience with SDR (so in IT terms prehistoric era infiniband gear!). Infact, i have some IPoIB only 4xSDR+48GbE switches laying around and an odd bunch of random SDR PCI-e cards

Linux drivers are good if i stick with mellanox brand?

(Wow, how we got offtopic! So much new information here, thanks! )

mir · Aug 17, 2015

Looks ok to me. Be sure to get the right cables.

Mellanox is the top dog in the infiniband world so buying their products is a safe choice on any platform (I cannot be sure about windows, haven't used it for the last 15+ years).

nucode · Aug 17, 2015

mir said:
Looks ok to me. Be sure to get the right cables.

Mellanox is the top dog in the infiniband world so buying their products is a safe choice on any platform (I cannot be sure about windows, haven't used it for the last 15+ years).

Yea knew that about Mellanox

I haven't used infiniband in couple of years almost now, and even then it was just 10G over copper.
Cable prices sure have come down since then!

hacman · Sep 26, 2016

Hi all,

The workaround earlier in this thread has worked for us so far, with our HP SmartArray controllers.

Sadly, this is no longer the case - not sure if something has updated or changed.

So we now get the following:

root@[redacted]:/etc/pve# pveceph createosd /dev/cciss/c0d2
unable to get device info for 'cciss!c0d2'

I've got an entry in the bugtracker for this, but does anyone know of any other workarounds? Could we create the Ceph OSDs using the normal Ceph CLI tools and they still show up on the Proxmox UI?

Thanks,

Jon

Version data below:

Code:

proxmox-ve: 4.2-66 (running kernel: 4.4.19-1-pve) 
pve-manager: 4.2-23 (running version: 4.2-23/d94f9458) 
pve-kernel-4.4.6-1-pve: 4.4.6-48 
pve-kernel-4.4.13-1-pve: 4.4.13-56 
pve-kernel-4.4.13-2-pve: 4.4.13-58 
pve-kernel-4.4.15-1-pve: 4.4.15-60 
pve-kernel-4.4.16-1-pve: 4.4.16-64 
pve-kernel-4.4.19-1-pve: 4.4.19-66 
pve-kernel-4.4.10-1-pve: 4.4.10-54 
lvm2: 2.02.116-pve3 
corosync-pve: 2.4.0-1 
libqb0: 1.0-1 
pve-cluster: 4.0-46 
qemu-server: 4.0-88
pve-firmware: 1.1-9 
libpve-common-perl: 4.0-73 
libpve-access-control: 4.0-19 
libpve-storage-perl: 4.0-61 
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1 
pve-qemu-kvm: 2.6.1-6 
pve-container: 1.0-75 
pve-firewall: 2.0-29 
pve-ha-manager: 1.0-35 
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2 
lxc-pve: 2.0.4-1 lxcfs: 2.0.3-pve1 criu: 1.6.0-1 
novnc-pve: 0.5-8 zfsutils: 0.6.5.7-pve10~bpo80 
ceph: 0.94.9-1~bpo80+1

hacman · Sep 27, 2016

Hi all,

Just to add that this is still present after today's new release.

Code:

proxmox-ve: 4.3-66 (running kernel: 4.4.19-1-pve) 
pve-manager: 4.3-1 (running version: 4.3-1/e7cdc165) 
pve-kernel-4.4.6-1-pve: 4.4.6-48 
pve-kernel-4.4.13-1-pve: 4.4.13-56 
pve-kernel-4.4.13-2-pve: 4.4.13-58 
pve-kernel-4.4.15-1-pve: 4.4.15-60 
pve-kernel-4.4.16-1-pve: 4.4.16-64
pve-kernel-4.4.19-1-pve: 4.4.19-66
pve-kernel-4.4.10-1-pve: 4.4.10-54
lvm2: 2.02.116-pve3 
corosync-pve: 2.4.0-1
libqb0: 1.0-1 pve-cluster: 4.0-46 
qemu-server: 4.0-88 
pve-firmware: 1.1-9 
libpve-common-perl: 4.0-73 
libpve-access-control: 4.0-19 
libpve-storage-perl: 4.0-61 
pve-libspice-server1: 0.12.8-1 
vncterm: 1.2-1 
pve-qemu-kvm: 2.6.1-6 
pve-container: 1.0-75 
pve-firewall: 2.0-29 
pve-ha-manager: 1.0-35 
ksm-control-daemon: 1.2-1 
glusterfs-client: 3.5.2-2+deb8u2 
lxc-pve: 2.0.4-1
lxcfs: 2.0.3-pve1 criu: 1.6.0-1
novnc-pve: 0.5-8 zfsutils: 0.6.5.7-pve10~bpo80 
ceph: 0.94.9-1~bpo80+1

Thanks,

Jon

fabian · Sep 28, 2016

hacman said:
Hi all,

The workaround earlier in this thread has worked for us so far, with our HP SmartArray controllers.

Sadly, this is no longer the case - not sure if something has updated or changed.

So we now get the following:

root@[redacted]:/etc/pve# pveceph createosd /dev/cciss/c0d2
unable to get device info for 'cciss!c0d2'

I've got an entry in the bugtracker for this, but does anyone know of any other workarounds? Could we create the Ceph OSDs using the normal Ceph CLI tools and they still show up on the Proxmox UI?

yes - the bug was already acknowledged, so there is no need to bump this thread every day

you can create mons and osds using "regular" ceph tools - after all that is what pveceph does under the hood as well. if you are unsure about the parameters, feel free to ask.

hacman · Sep 28, 2016

Hi Fabian,

Thanks for confirming. If you are able to confirm the required parameters, that would help.

Wasn't meaning to bump - I just wanted to update to show that the new version yesterday hadn't changed the issue.

Thanks,

Jon

hacman · Sep 29, 2016

Hi Fabian,

I was looking at something like the following:

Code:

ceph osd create 1
mkdir /var/lib/ceph/osd/ceph-1

mkfs -t xfs /dev/cciss/c0d2
mount -t xfs /dev/cciss/c0d2 /var/lib/ceph/osd/ceph-1
ceph-osd -i 1 --mkfs --mkkey

ceph auth add osd.1 osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-1/keyring

ceph osd crush add 1 .269989 [host=node-3]

Do you think this would work?

Thanks,

Jon

fabian · Sep 30, 2016

I'd use "ceph-disk" for formatting and setting up an OSD: http://docs.ceph.com/docs/hammer/man/8/ceph-disk/
You can check in /usr/share/perl5/PVE/API2/Ceph.pm how PVE calls it - you need to take care to have the bootstrap OSD keyring in place:

check that "/var/lib/ceph/bootstrap-osd/ceph.keyring" exists
- if not, put the output of "ceph auth print_key client.bootstrap-osd" (starting with "[client.bootstrap-osd]") there
get the fsid: "ceph mon dump"
make sure you have the right disk and file system type for the following command
- replace FS with ext4 or xfs, FSID with the fsid, and YOUR/DEVICE (and optionally YOUR/JOURNALDEVICE) with your (journal) device/partition path
- WARNING: this will format the disk(s) / device(s) / partition(s) you pass, so be careful
- "ceph-disk prepare --zap-disk --fs-type FS --cluster ceph --cluster-uuid FSID /dev/YOUR/DEVICE" (for integrated journal)
- "ceph-disk prepare --zap-disk --fs-type FS --cluster ceph --cluster-uuid FSID --journal-dev /dev/YOUR/JOURNALDEV /dev/YOUR/DEVICE" (for external journal device/partition)
- for example: "ceph-disk prepare --zap-disk --fs-type xfs --cluster ceph --cluster-uuid f9077bb7-39f2-4cda-821c-f8cf940c40f9 /dev/vdb" will create an XFS backed OSD on /dev/vdb by formatting it, adding a data and journal partition each and then activate the OSD automatically

hacman · Sep 30, 2016

fabian said:
I'd use "ceph-disk" for formatting and setting up an OSD: http://docs.ceph.com/docs/hammer/man/8/ceph-disk/
You can check in /usr/share/perl5/PVE/API2/Ceph.pm how PVE calls it - you need to take care to have the bootstrap OSD keyring in place:

check that "/var/lib/ceph/bootstrap-osd/ceph.keyring" exists

if not, put the output of "ceph auth print_key client.bootstrap-osd" (starting with "[client.bootstrap-osd]") there

get the fsid: "ceph mon dump"

make sure you have the right disk and file system type for the following command

replace FS with ext4 or xfs, FSID with the fsid, and YOUR/DEVICE (and optionally YOUR/JOURNALDEVICE) with your (journal) device/partition path

WARNING: this will format the disk(s) / device(s) / partition(s) you pass, so be careful

"ceph-disk prepare --zap-disk --fs-type FS --cluster ceph --cluster-uuid FSID /dev/YOUR/DEVICE" (for integrated journal)

"c

ceph-disk prepare --zap-disk --fs-type FS --cluster ceph --cluster-uuid FSID --journal-dev /dev/YOUR/JOURNALDEV /dev/YOUR/DEVICE" (for external journal device/partition)

for example: "ceph-disk prepare --zap-disk --fs-type xfs --cluster ceph --cluster-uuid f9077bb7-39f2-4cda-821c-f8cf940c40f9 /dev/vdb" will create an XFS backed OSD on /dev/vdb by formatting it, adding a data and journal partition each and then activate the OSD automatically

Much easier! Thank you!

All working perfectly now!

Jon

Romkus · Mar 1, 2017

Hello all and happy with spring!
I want to say: why we need to be refused to use ANY drive type? I think it have to be just warning, but not stop and bump on searching what was wrong with the drive (when it's name not an "hd[x]" or other known)...
If this would be just a warning for a user, maybe it would be better. And maybe it will be better if user will know what's wrong with his/her device, but not just "unable to get device info for ..." or "No such block device".

For those who wants to try to bypass these messages: open the file "/usr/share/perl5/PVE/Diskmanage.pm", and find these lines in sub get_disks:

Code:

# whitelisting following devices
# hdX: ide block device
# sdX: sd block device
# vdX: virtual block device
# xvdX: xen virtual block device
# nvmeXnY: nvme devices
# cciss!cXnY: cciss devices
return if $dev !~ m/^(h|s|x?v)d[a-z]+$/ &&
  $dev !~ m/^nvme\d+n\d+$/ &&
  $dev !~ m/^cciss\!c\d+d\d+$/;

... and comment them out. Maybe it will help...

P.S. Maybe it will be also suitable to comment out at the same file the line in the get_sysdir_info:

Code:

return undef if ! -d "$sysdir/device";

(but just if You really want to try a virtual device...)

Search

Search

No such block device - Proxmox 3.2 and Ceph configuration

nucode

New Member

mir

Famous Member

nucode

New Member

mir

Famous Member

nucode

New Member

nucode

New Member

mir

Famous Member

nucode

New Member

mir

Famous Member

nucode

New Member

hacman

Renowned Member

hacman

Renowned Member

fabian

Proxmox Staff Member

hacman

Renowned Member

hacman

Renowned Member

fabian

Proxmox Staff Member

hacman

Renowned Member

Romkus

Member

We value your privacy