Proxmox VE Ceph Server released (beta)

felipe · Jul 3, 2014

i tried with fio on a linux guest. same speed. this is also the max speed i get in rados bench for write. with reads i can do a lot more.
for one vm machine i am really happy about the 500mb write. ande 500mb read. but for the whole cluster its not enogh. when i run 2 vms with write tests the speed is just the half...

impire · Jul 3, 2014

udo said:
Hi,
like mir allready wrote: Mellanox 40GB Card (without switch - it's an direct connection for drbd). For ceph I use 10GB Ethernet.

I'm not sure how this cable named... QSFP?!

In this case are the server testserver (to find an issue with an drbd-connection). Two amd-boxes - one fx8350 and one 965.

Normaly the cards are in dual-Opterons server.+

Udo

Thank you very much for sharing. I don't mean to dwell on this thread, but since we are on the topic of need for speed...

I am confused. You ran 10GB Ethernet on Ceph nodes. How were you able to achieve 15.7GBytes as indicated on your iperf?

How do you connect the 40GB cards without going through a switch?

Thanks again for putting up with my newbie questions.

udo · Jul 3, 2014

impire said:
Thank you very much for sharing. I don't mean to dwell on this thread, but since we are on the topic of need for speed...

I am confused. You ran 10GB Ethernet on Ceph nodes. How were you able to achieve 15.7GBytes as indicated on your iperf?

Hi,
the 15.7GB are data which are transfered during the iperf-test - speed was 13.4 Gbit/s.
iperf shows the max. bandwith between the iperf-nodes.

The post was the answer of spirits remark "Hi, I don't think you can reach 20gbits with IP over infiband. (I think maybe 4 gigabits max)" and has nothing to do with ceph - as I wrote before the infiniband-connection was used for drbd.

How do you connect the 40GB cards without going through a switch?

simply with an cable between the infiniband cards.

Udo

impire · Jul 4, 2014

udo said:
simply with an cable between the infiniband cards.

Udo

Thank you. So to connect 3 ceph nodes I need to get the dual ports 40GB for each server and daisy chain them in sequence (server A -> Server -> Server B -> Server C)?

I am still confused about the switch-less technology. Tried looking up on Mellanox website about Virtual Protocol Interconnect (VPI) and their ConnectX-2 VPI cards, but still quite couldn't understand it fully. Any help would be greatly appreciated. Thanks.

impire · Jul 6, 2014

Hello,

The boot drive to one of my ceph nodes (1 out of 3) just died.

I understand the failed node will need to be removed entirely. However, I can't seem to be able to remove the node? All documentation I seems so far only give example to replacement of failed OSDs. But what about replacement of the boot drive (Proxmox host)?

In order words, what is the proper way to remove that failed system entirely from Ceph nodes? Any help or suggestion is greatly appreciated.

wahmed · Jul 6, 2014

You should install proxmox as usual on a new disk. Name it something other than what it was. Them simply join it to proxmox cluster. Cluster should pick up the osds automatically.
I replaced many ceph only node without issue. Never had to replace proxmox+ceph node yet. But should work.

impire · Jul 6, 2014

symmcom said:
You should install proxmox as usual on a new disk. Name it something other than what it was. Them simply join it to proxmox cluster. Cluster should pick up the osds automatically.
I replaced many ceph only node without issue. Never had to replace proxmox+ceph node yet. But should work.

Thank you so much for the quick reply. I have no problem setting up the new Ceph nodes. It's the old Ceph node that I am having a problem with. The other Ceph nodes are still seeing it as missing.

It was setup as one of the monitors. It's Quorum status now indicated as "No".

I've tried stopping or removing the this monitor. It came back with "Connection error 595. No route to host."

All the OSDs on that nodes are showing status of "down/out"

I would like to have a clean removal of the old Ceph node which is no longer in existent. What's the best way? Thank you.

wahmed · Jul 6, 2014

What's the output of
# ceph osd tree

impire · Jul 6, 2014

Ceph stated it's software is "the end of RAID".

Many of our existent Dell servers have the built-in RAID controller. But we are not using the RAID feature and set each hard drive individual as RAID 0. With this config, Ceph are able to see each drive as separate OSDs.

The battery on some of these RAID controllers are causing an issue. It seems the batter is only good for the "write back cache" feature. With it removed, the controller defaulted to "write through".

Dell stated the hard drives without the write back cache feature enabled will suffer performance. Other than that, no problem.

Since Ceph do not need RAID controllers to operate. The golden questions are:

Is it better to let the hard drives run in write back cache mode or not? Does it give any advantage to using a RAID controller and run it with this feature? How would CEPH take advantage of it? What would happen if we go to write through, does CEPH have it's own method to perform write back cache or similar to increase performance?

***NOTE*** Most hard drives already come with it's own cache, but I do not know how CEPH will utilize those cache.

impire · Jul 6, 2014

symmcom said:
What's the output of
# ceph osd tree

root@c2:~# ceph osd tree
# id weight type name up/down reweight
-1 21.72 root default
-2 7.24 host h1
0 1.81 osd.0 up 1
1 1.81 osd.1 up 1
2 1.81 osd.2 up 1
3 1.81 osd.3 up 1
-3 7.24 host h2
4 1.81 osd.4 up 1
5 1.81 osd.5 up 1
6 1.81 osd.6 up 1
7 1.81 osd.7 up 1
-4 7.24 host h3
8 1.81 osd.8 down 0
9 1.81 osd.9 down 0
10 1.81 osd.10 down 0
11 1.81 osd.11 down 0
root@c2:~#

wahmed · Jul 6, 2014

impire said:
root@c2:~# ceph osd tree
# id weight type name up/down reweight
-1 21.72 root default
-2 7.24 host h1
0 1.81 osd.0 up 1
1 1.81 osd.1 up 1
2 1.81 osd.2 up 1
3 1.81 osd.3 up 1
-3 7.24 host h2
4 1.81 osd.4 up 1
5 1.81 osd.5 up 1
6 1.81 osd.6 up 1
7 1.81 osd.7 up 1
-4 7.24 host h3
8 1.81 osd.8 down 0
9 1.81 osd.9 down 0
10 1.81 osd.10 down 0
11 1.81 osd.11 down 0
root@c2:~#

Is host h3 the new node you setup or this is the dead one?

impire · Jul 6, 2014

symmcom said:
Is host h3 the new node you setup or this is the dead one?

h3 is the dead one.

wahmed · Jul 6, 2014

Ok. I am assuming your new host is h4, you have already setup Proxmox+Ceph on it and already the node is part of Proxmox cluster. If the HDDs are still in the same bays, try to move the OSDs from old node to new node using the commands below:
#ceph osd crush set osd.9 1.81 root=default host=h4

Try with this one osd first and see if the cluster picks up the osd.9 and starts rebalancing. The command above manipulates the CRUSHMap by moving osds.

wahmed · Jul 6, 2014

impire said:
Ceph stated it's software is "the end of RAID".

Many of our existent Dell servers have the built-in RAID controller. But we are not using the RAID feature and set each hard drive individual as RAID 0. With this config, Ceph are able to see each drive as separate OSDs.

The battery on some of these RAID controllers are causing an issue. It seems the batter is only good for the "write back cache" feature. With it removed, the controller defaulted to "write through".

Dell stated the hard drives without the write back cache feature enabled will suffer performance. Other than that, no problem.

Since Ceph do not need RAID controllers to operate. The golden questions are:

Is it better to let the hard drives run in write back cache mode or not? Does it give any advantage to using a RAID controller and run it with this feature? How would CEPH take advantage of it? What would happen if we go to write through, does CEPH have it's own method to perform write back cache or similar to increase performance?

***NOTE*** Most hard drives already come with it's own cache, but I do not know how CEPH will utilize those cache.

Which writeback/writethrough cache are you talking about? Proxmox or the RAID controller itself?

I also suggest that you open a new thread with the problem you are having with dead Ceph node.

mo_ · Jul 6, 2014

impire said:
Is it better to let the hard drives run in write back cache mode or not? Does it give any advantage to using a RAID controller and run it with this feature?

you may want to look at this benchmark: http://ceph.com/community/ceph-bobtail-performance-io-scheduler-comparison/

while it is actually trying to determine which IO scheduler is best for several setups, you can also see that the 8xRAID0 setup (meaning: 8 disks as 8 RAID0s) in almost all cases beats the JBOD mode due to the controller write cache.

impire · Jul 7, 2014

symmcom said:
Ok. I am assuming your new host is h4, you have already setup Proxmox+Ceph on it and already the node is part of Proxmox cluster. If the HDDs are still in the same bays, try to move the OSDs from old node to new node using the commands below:
#ceph osd crush set osd.9 1.81 root=default host=h4

Try with this one osd first and see if the cluster picks up the osd.9 and starts rebalancing. The command above manipulates the CRUSHMap by moving osds.

You are awesome! It worked. I moved the OSDs using the suggested command, then rebooted the server and the OSDs are now working well on the new server h4.

How do I remove the dead ceph node? Trying to stop or remove the monitor came back with "Connection error 595: No route to host".

Should just remove it from the cluster using "pvecm delnode h3"? Is it possible to later rename the H4 to H3 and change it's IP address later to match the previous? My boss want to keep it the same sequence but I am not sure if that's even possible.

Thanks for your suggestion about a new thread. I thought this post was created as a new thread. Did I not do it correctly?

Thanks for all of your help.

impire · Jul 7, 2014

symmcom said:
Which writeback/writethrough cache are you talking about? Proxmox or the RAID controller itself?

I also suggest that you open a new thread with the problem you are having with dead Ceph node.

I was referring to the RAID controller itself. Is it better to run it with write back cache enabled or Ceph has its own method to handle the write back cache?

impire · Jul 7, 2014

mo_ said:
you may want to look at this benchmark: http://ceph.com/community/ceph-bobtail-performance-io-scheduler-comparison/

while it is actually trying to determine which IO scheduler is best for several setups, you can also see that the 8xRAID0 setup (meaning: 8 disks as 8 RAID0s) in almost all cases beats the JBOD mode due to the controller write cache.

Thank you very much. I just read through the link you've provided.

So in summary, it is better to run the drives as RAID 0 and let the RAID controller use it's own write back cache?

CEPH stated their software spell "the end of RAID". However, when it come to hard drive performance we still need the RAID controller to perform the write back cache as opposed to depending on CEPH to handle this.

Any feedback is greatly appreciated. Thank you.

mo_ · Jul 7, 2014

You don't exactly NEED raid controllers for ceph to function.

- You need raid controllers if you want more disks in your system than the mainboards controller is offering
- You can use a raid controller to benefit from its battery-backed RW-caches. Do note that regular hard drives already have its own memory cache, just on a much smaller scale than what controllers have. They really only exist because non-SSD drives are extremely slow...

Also, while we tend to call them raid controllers, they really actually arent. They are actually just additional disk controllers, which just happen to implement some raid levels (which ceph neither needs nor wants). With ceph you dont use the raid functionality in any way shape or form - this single-disk RAID0 is really just a trick to present individual disks to ceph while still making use of the controller's cache (which the JBOD mode typically doesn't allow for).

sommarnatt · Jul 7, 2014

Have any of you had any issues with Ceph and Live Migrations? I'm not sure whether it's Ceph or something else - but as soon as the Live Migration is successful and the VM is resumed it takes up 100% CPU and is unreachable.
I've checked the logs and no qemu/kvm error message in them. The VM itself doesn't log anything, as if it doesn't have access to disks or just died right away.

Proxmox VE Ceph Server released (beta)

Well-Known Member

Active Member

Distinguished Member

Active Member

Active Member

Famous Member

Active Member

Famous Member

Active Member

Active Member

Famous Member

Active Member

Famous Member

Famous Member

Renowned Member

Active Member

Active Member

Active Member

Renowned Member

Active Member

We value your privacy