Error creating an SSD OSD (latest Ceph 0.94-7 on latest Proxmox 4.0)

pmteam

New Member
Oct 27, 2015
5
0
1
Italy
Hello folks,
we are working some time fine with a proxmox cluster 3.4 and ceph hammer.
Cluster have four nodes (each server = dell R620 2x xeon 26202 2x12 thread, 128 GB ram, net 2x1GB + 2x10GB Intel 720, 4x2TB spin disk on hw raid5 used for os and nas/nfs, 2x2TB spin disk for OSD, 1x800GB SSD Intel for ceph cache, 1x80GB Intel for journaling).
When we want to configure a complex VLAN configuration we encountered problem on communication between different VM on different nodes (we have found a thread on VLAN problem with debian kernel 2.6).
No deep troubleshooting activity has resolved the problem until we have upgraded proxmox to 4.0 (therefore upgraded the kernel): different VM on dofferent VLAN talk correctly to only the expected target.
Ok, then we worked to optimze the performance with Ceph with the help of the SSD (because the 80 GB SSD has been added after upgrading proxmox/ceph)
Before upgrade we have obtained good performance, for example near 800 MB/s read and near 150 MB/s write from within VM win2k8r2 using CrystalBench and similar benchmarking sw.
After upgrade we have much lower performance, for example 100-120 MB/s read and write.
Well, while troubleshooting this performance degradation we decide to update ceph to latest release to hopefully solve the problem arised.
Finally, we faced a new problem on OSD creation on SSD.
Follow ssh session to explain what happen:
----------------------
root-chiopve1:~# pveceph createosd /dev/sdc
create OSD on /dev/sdc (xfs)
Caution: invalid backup GPT header, but valid main header; regenerating
backup header from main header.

****************************************************************************
Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
verification and recovery are STRONGLY recommended.
****************************************************************************
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
Creating new GPT entries.
The operation has completed successfully.
Setting name!
partNum is 1
REALLY setting name!
The operation has completed successfully.
Setting name!
partNum is 0
REALLY setting name!
The operation has completed successfully.
existing superblock read failed: Input/output error
mkfs.xfs: pwrite64 failed: Input/output error
meta-data=/dev/sdc1 isize=2048 agcount=4, agsize=48510517 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=0 finobt=0
data = bsize=4096 blocks=194042065, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=0
log =internal log bsize=4096 blocks=94747, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
ceph-disk: Error: Command '['/sbin/mkfs', '-t', 'xfs', '-f', '-i', 'size=2048', '--', '/dev/sdc1']' returned non-zero exit status 1
command 'ceph-disk prepare --zap-disk --fs-type xfs --cluster ceph --cluster-uuid 307c09a3-1643-4422-b483-d0205d36d90d /dev/sdc' failed: exit code 1
root-chiopve1:~# parted
GNU Parted 3.2
Using /dev/sda
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) select /dev/sdc
Warning: Error fsyncing/closing /dev/sdc1: Input/output error
Retry/Ignore?
Warning: Error fsyncing/closing /dev/sdc2: Input/output error
Retry/Ignore?
----------------------

We think there isn't any hw problem with SSD drive because the same problem happen on every nodes.
Anyone has an idea to suggest to troubleshoot the problem ?
Anyone is using latest proxmox with latest ceph (0.94-7) and all work fine ?
Because web interface don't work well on Ceph configuration we suspect an incompatibility problem between proxmox and ceph.

Thanks for help !

Paul
 
I am a bit confused, you run Ceph on Proxmox VE 4.0 or on 3.4?

What is version 0.94-7? Latest stable ceph hammer is 0.94.5, from https://download.ceph.com/debian-hammer/

I can create OSD without problems (short test with just a clean new install on one single server).
 
To be more clear we installed and configured successfully proxmox 3.4 and ceph 0.9.4.3 in august. We used that as testbed for a period until the end of september. The performance as stated were very good specially with the SSD as a cache (800 MB/s read and up to 180-220 MB/s write) using either replication or erasure code.
Then we encountered the network configuration problem as previously stated and finally resolved the problem with the upgrade to proxmox to 4.0 in october.
Afterward we want to retest ceph with a upgraded ceph to 0.9.4.5 (sorry for typing error previously, you are right!). The performance fallen to 100-120 MB/s either r/w.
Following we attempted to recreate the configuration to solve the performance problem (deleted pools, osd, so on ..)
Now we encountered the problem with 800 GB SSD and we solved the problem only with a smaller partitions on that disk (two 400 GB).
We also test the single SSD disk with dd and perf are 380 MB/s read 330MB/s write, therefore SSD don't have any specific issue.
Because we must go in production we decided to do not use SSD (sigh!) until we resolve the problem.
However the read performance are inconsistent with the august/september result because we obtain now a max of 100-120 MB/s read/write (rados bench always)
Concluding we faced several problem:
networking -> solved with proxmox 4.0
performance -> ?
OSD SSD -> smaller partitioning
Our next step is to remove the 800 GB SSD and install on another test bed to recreate from groung the entire installation.
Suggestion are welcome !
Best

Paul

btw: we have community support subscription
 
Another quick but important question: how to upgrade to infernalis ? keyword infernalis dont exists on actual proxmox/ceph installation tool (pveceph).
Furthemore the ceph repository has changed to download.ceph.com.
Have you a correct suggested procedure ?
Thanks again.
Paul
 
Another quick but important question: how to upgrade to infernalis ? keyword infernalis dont exists on actual proxmox/ceph installation tool (pveceph).
Furthemore the ceph repository has changed to download.ceph.com.
Have you a correct suggested procedure ?
Thanks again.
Paul

you can change the repository in /etc/apt/sources.list.d/ceph.list

(https://download.ceph.com/debian-infernalis/)

Please read the upgrade procedure:
http://ceph.com/releases/v9-2-0-infernalis-released/

because they are some file permissions change with infernalis. (was root:root previously, and now use ceph user)
 
Thanks for response but we already aware of that info.
What we need it's a correct procedure using proxmox tool, we don't used ceph-deploy for initial installation but used 'pveceph install' and then update from repository to latest 0.9.4.5.
Using update we have got error on package dependencies we dont be able to solve this issue. pveceph don't know 'infernalis' only hammer and previous release.
Have suggestion to this issue ?
Thanks again.
Paul
 
Until infernalis support is officially added, adarumier's method works well and is simple to deploy. I used the following procedure on each node:
  • Begin by doing a proper installation using hammer (on my cluster it was already installed but I'm listing this here anyway)
  • edit ceph.list to reflect debian-infernalis
  • add to /etc/pve/ceph.conf under [global]
setuser match path = /var/lib/ceph/$type/$cluster-$id​
  • apt-get update && apt-get dist-upgrade
  • service ceph restart all
Naturally, you probably dont want to use this on a production system until its officially supported.
 
We use a custom "crush hook" to split SSD's from HDD's. Since we also use custom bucketTypes, we create one named "storageType" for "SSD" and "HDD" and move those under the "Root"

Code:
Add 2 new storageTypes:
ceph osd crush add-bucket HDD storageType
ceph osd crush add-bucket SSD storageType

move the new storageTypes under the default root:
ceph osd crush move HDD root=default
ceph osd crush move SSD root=default

then the when ever you create a OSD via pveceph or ceph the custom crush hook autimatically gets executed and you split your SSD from your HDD osds. You then can do with em what ever you want as you typically would (replicated, EC, Cache tiering, etc)

ps.: google "Wido den Hollander" + "crush location" loads of well explained stuff.
pps: That is a production ready way using pveceph install of hammer.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!