Proxmox cluster + ceph recommendations

Nov 15, 2015
32
3
28
Hi good people. My customer wants to create HA virtualization cluster from its hardware. But i need advice, because i have some planning question.

Hardware:
- 3x supermicro servers: 12x 3,5" 8TB 7k2 SATA HDD, 4x 800GB INTEL ENTERPRISE SATA SSD, 2x 8Cores 16vCores (summary 16 Cores, 32 vCores) Xeon CPU, 128 GB RAM
- 10gbps ethernet network between servers using mikrotik switch (now only one switch, but in feature will be two, for network redundancy)

I want to use ceph as storage, but reading ceph manual, i see than for raw HDD capacity of 96TB in each server i need about 100GB RAM for only ceph usage. So there will no free RAM for VMs.

So the question - can i reduce ceph ram usage without performance penalty and cluster stability (i can give about 32GB RAM for cluster FS, to get about 90GB RAM for virtual machines)?

As alternative, i am planning to make lvmraid 10 from HDD, than convert this raid10 to thinpool and use DRBD9 as cluster FS over this thinpool. LVM need a little RAM and DRBD9 i think is not RAM hungry. So i can make 2x thinlmv raid 10 pools from 6 HDD, and 1x thinlvm pool from 4SSD. And install DRBD9 over this 3 pools and sync them with 3 datacopies sheme.

Maybe there are some other variants (ZFS+DRBD9 not variant - ZFS is RAM hungry too, but maybe lvm + glusterfs, or something other)?

Thx.
 
To get the best performance out of CEPH your want to use KRBD with VM's, however you can't use KRBD on the same kernel as a server that is running OSD's. On the RAM side you can limit ram with the new bluestore versions, however it's only a target and it can still increase sometimes over this limit. Id really sugest with that many disks of that size you'd need and want the whole servers recources for CEPH operations alone.

So you will either need to find some extra servers to use for VM hosts's or as you said move to something else than CEPH, however you will lose out on some of the benefit's and simplicity of CEPH vs other multi layered setups.
 
Hi good people. My customer wants to create HA virtualization cluster from its hardware. But i need advice, because i have some planning question.

Hardware:
- 3x supermicro servers: 12x 3,5" 8TB 7k2 SATA HDD, 4x 800GB INTEL ENTERPRISE SATA SSD, 2x 8Cores 16vCores (summary 16 Cores, 32 vCores) Xeon CPU, 128 GB RAM
- 10gbps ethernet network between servers using mikrotik switch (now only one switch, but in feature will be two, for network redundancy)

I want to use ceph as storage, but reading ceph manual, i see than for raw HDD capacity of 96TB in each server i need about 100GB RAM for only ceph usage. So there will no free RAM for VMs.

So the question - can i reduce ceph ram usage without performance penalty and cluster stability (i can give about 32GB RAM for cluster FS, to get about 90GB RAM for virtual machines)?

As alternative, i am planning to make lvmraid 10 from HDD, than convert this raid10 to thinpool and use DRBD9 as cluster FS over this thinpool. LVM need a little RAM and DRBD9 i think is not RAM hungry. So i can make 2x thinlmv raid 10 pools from 6 HDD, and 1x thinlvm pool from 4SSD. And install DRBD9 over this 3 pools and sync them with 3 datacopies sheme.

Maybe there are some other variants (ZFS+DRBD9 not variant - ZFS is RAM hungry too, but maybe lvm + glusterfs, or something other)?

Thx.
Hi a4t3rburn3r

DRBD and Linstore are a good alternative to Ceph, we are going to be testing this solution soon.

There are a few other posters on here testing drbd as well.

There are modules available for DRBD that hook into ProxMox and allow for HA failover as well so it’s a good alternative with less resources overhead.

Just food for thought.

“”Cheers
G
 
  • Like
Reactions: fbifido
To get the best performance out of CEPH your want to use KRBD with VM's, however you can't use KRBD on the same kernel as a server that is running OSD's. On the RAM side you can limit ram with the new bluestore versions, however it's only a target and it can still increase sometimes over this limit. Id really sugest with that many disks of that size you'd need and want the whole servers recources for CEPH operations alone.

So you will either need to find some extra servers to use for VM hosts's or as you said move to something else than CEPH, however you will lose out on some of the benefit's and simplicity of CEPH vs other multi layered setups.
Where did you read that KRBD cannot be used on the same machine that is running OSDs, since I'm having trouble finding information about that.
 
Hi a4t3rburn3r

DRBD and Linstore are a good alternative to Ceph, we are going to be testing this solution soon.

There are a few other posters on here testing drbd as well.

There are modules available for DRBD that hook into ProxMox and allow for HA failover as well so it’s a good alternative with less resources overhead.

Just food for thought.

“”Cheers
G

Hi bro! I'm testing DRBD right now as main solution. But there is question not aboud DRBD, but about LVM. DRBD can use thin LVM or ZFS as backend. ZFS is RAM hungry. LVM is fast oldscool. Thin LVM can make all we need - speed, clones, snapshots. With new feature as lvm raid - we can get good reliable solution. But at this moment, i can't restore thin lvm raid 10 pools. I build thinlvm raid 10 pool, than use it as storage. All is fine, but when i remove disk from this thinlvm pool, pool works in degraded state and i can't replace this disk without killing all data or thinpool. All i can do now - backup VM's, kill pool, create pool with new disk in VG, and restore VM's. And i don't know now - how will DRBD work if thin lvm pool will be in degraded state. I will be simulating situation, when thinlvm raid 10 pool is in degraded state in cluster with DRBD, and will check if i can kill pool without loosing data and with migrating VM's to another working node.
 
Where did you read that KRBD cannot be used on the same machine that is running OSDs, since I'm having trouble finding information about that.

Has been mentioned many times on the Ceph Mailing list, it causes a kernel deadlock/crash, fine it you dont use KRBD as then the RBD mount is outside of the kernel space.
 
  • Like
Reactions: skywavecomm
Hi bro! I'm testing DRBD right now as main solution. But there is question not aboud DRBD, but about LVM. DRBD can use thin LVM or ZFS as backend. ZFS is RAM hungry. LVM is fast oldscool. Thin LVM can make all we need - speed, clones, snapshots. With new feature as lvm raid - we can get good reliable solution.

Hi a4t3rburn3r

you wouldn't use ZFS on top of DRBD there will be a performance hit by ZFS and yes you are correct other memory overhead to consider + more to manage.

But at this moment, i can't restore thin lvm raid 10 pools. I build thinlvm raid 10 pool, than use it as storage. All is fine, but when i remove disk from this thinlvm pool, pool works in degraded state and i can't replace this disk without killing all data or thinpool. All i can do now - backup VM's, kill pool, create pool with new disk in VG, and restore VM's.

Did you use software RAID or Hardware Raid?
With hardware raid just replace the drive and it should just auto rebuild with no downtime as long as you dont lose the mirrored drive in the Raid 10 pair.

And i don't know now - how will DRBD work if thin lvm pool will be in degraded state. I will be simulating situation, when thinlvm raid 10 pool is in degraded state in cluster with DRBD, and will check if i can kill pool without loosing data and with migrating VM's to another working node.

From my understanding because DRBD is purely 1:1 or 1:N (many) replication it will just operate on 1 host (or many) until such time the other host/s is fixed/ rebuilt and reconfigured to join the cluster.

there shouldn't be any data loss if all the VM's are replicated between the 2 or more hosts.
1 host down that's ok as the replication factor of 1:1 should mean all VM's are operating on the master host.

Im still learning my self but have asked the team at Linbit a lot of questions, they are super helpful and look like a good fit for our future DRBD project.

hope the above helps.

""Cheers
G
 
Did you use software RAID or Hardware Raid?
I will use software lvm raid - new lvm emplementation that uses md without mdadm layer. But there is no way to make lvmraid thinpool, but i can make lvmraid pool and than convert to thinlvmraid. It will work and will have redundancy. But not thinlvm or lvmraid examples in man are working when this thinpool lvmraid is in degraded state.
I use this commands on lab machine:

for n in {b,c,d,e}; do sgdisk -N 1 /dev/sd$n; done
pvcreate /dev/sd{b1,c1,d1,e1}
vgcreate r10 /dev/sd{b1,c1,d1,e1}
lvcreate --type raid10 --mirrors 1 --stripes 4 -l 70%FREE --name r10_thinpool r10 /dev/sd{b1,c1,d1,e1}
lvcreate --type raid10 --mirrors 1 --stripes 4 -l 50%FREE --name r10_thinmeta r10 /dev/sd{b1,c1,d1,e1}
lvconvert --thinpool r10/r10_thinpool --poolmetadata r10/r10_thinmeta

Got lvmthin raid pool and proxmox uses it without problem. No data loose after one disk failure.
But can't reassamble it after failed disk replacement.
 
Last edited:
From my understanding because DRBD is purely 1:1 or 1:N (many) replication
Yep, something like this. But my problem is in lvmthin usage with DRBD. If i cant reassamble thin lvmraid, i need move all VM's to another node or move VM's disks to another thinpool on this node, than remove degraded thinpool and recreate it with new disk. Than move VM's or VM's disks back. So the question is - do i need to kill all DRBD pools on all nodes if i will kill thinpool on one node? Need some experiment.
 
Yep, something like this. But my problem is in lvmthin usage with DRBD. If i cant reassamble thin lvmraid, i need move all VM's to another node or move VM's disks to another thinpool on this node, than remove degraded thinpool and recreate it with new disk. Than move VM's or VM's disks back. So the question is - do i need to kill all DRBD pools on all nodes if i will kill thinpool on one node? Need some experiment.
Hi a4t3rburn3r

Yes software raid in my opinion is pretty useless and needs refinement.

I would stick with a hardware raid option of using ssd.

Eject faulty > insert replacement > rebuild hands free > job done.

Zfs raid is better but too many overheads if your looking for performance.

If you’re using nvme drives if I remember correctly they can’t be hardware Raided only software then your back to the same issue as you have now.

Thoughts ?
 
Yep, something like this. But my problem is in lvmthin usage with DRBD. If i cant reassamble thin lvmraid, i need move all VM's to another node or move VM's disks to another thinpool on this node, than remove degraded thinpool and recreate it with new disk. Than move VM's or VM's disks back. So the question is - do i need to kill all DRBD pools on all nodes if i will kill thinpool on one node? Need some experiment.

I believe this clearly explains why your LVM is failing as it’s not actually raid it’s an extension layer to increase space between physical drives.

LVM no redundancy.
I.e. 10 + 10 = 20

Raid 1 mirror replication
10 + 10 = 10

Raid 5 parity 1 drive for fault tolerance
10 + 10 + 10 = 20

Etc.

http://www.cyberphoton.com/questions/question/what-is-the-difference-between-lvm-and-raid

Maybe I’ve misinterpreted what you are referring to but I’m basing my points on the terminology that you have used.

Hope this helps.

“”Cheers
G
 
Yes software raid in my opinion is pretty useless and needs refinement.
There is always variants - yep hardware raid is easy to maintain, but what if raid controller is dead?
Software raid on zfs actually not to hard to maintain, one simple command ang go. But overheads are huge, and first of all RAM consumption.
I can insert software raid disks in any hardware and recover data. Speed penalty is not so big if you have fast cpu.

If you’re using nvme drives if I remember correctly they can’t be hardware Raided only software then your back to the same issue as you have now.
NVMe ssd can be hardraided.

I believe this clearly explains why your LVM is failing as it’s not actually raid it’s an extension layer to increase space between physical drives.
You are wrong. Look here:
https://access.redhat.com/documenta...guring_and_managing_logical_volumes-en-US.pdf
Page 50. I am about this lvmraid.
 
  • Like
Reactions: velocity08
Hi a4t3rburn3r

thanks for pointing out what you where referring to reguargind the LVM Raid and i agree with some of your points.

In relation to the hardware raid for NVME every vendor we have spoken with for quotes recently (in the last 3 weeks) Dell, HPE and SuperMicro none of them offer a Hardware Raid option for NVME and all of them are advising that its not possible at this point in time.

Im wondering why this is the case?

I did some googling and have come across some obscure vendors offering hardware raid for NVME and even Intel as long as you are running an Intel CPU but again not supported buy the big 3 and im at a little bit of a loss as to why?

I guess my question is why are none of the 3 biggest players offering a hardware raid option for NVME?

Its probably a little off topic and i was just trying to cover potential reasons for the Raid group failing looping in NVME.

So i guess my question is why is your software LVM Raid failing if this is an acceptable solution?

Like you I'm in a similar predicament looking to minimise the amount of items that could potentially cause problems in the system design and removing a hardware raid controller is just another link in the chain that could be removed.

What Raid group are you running 1, 5, 6?
What size and types of drives?
I can ask one of our team who are more familiar with Linux raid than i to provide some insight.

""Cheers
G
 
In relation to the hardware raid for NVME every vendor we have spoken with for quotes recently (in the last 3 weeks) Dell, HPE and SuperMicro none of them offer a Hardware Raid option for NVME and all of them are advising that its not possible at this point in time.
I found this regarding HW NVME RAID:
http://www.highpoint-tech.com/USA_new/series-ssd7120-specification.htm
I don\t know how it's working but it exist.

the 3 biggest players
I think there is bigger fish, maybe LSI/AVAGO/BROADCOM - because a lot of SM/HP/DELL raid controlles are simply OEM LSI. But who knows.

So i guess my question is why is your software LVM Raid failing if this is an acceptable solution?
My english is far from perfect =) maybe i told not so clear - i dont have problems with stability or performance or usage of LVM RAID. I am trying to simulate situation with raid degradation in case of disk failure. There is some algorithm to replace disk and recover degradet array, but i can't find it. Every time i try to recover raid using commands from man i loose all data or simply can't recover it. Googling is not helpful. I found nothing.

What Raid group are you running 1, 5, 6?
What size and types of drives?
I can ask one of our team who are more familiar with Linux raid than i to provide some insight.
I be very grateful if you can ask - all i do i already wrote:

# for n in {b,c,d,e}; do sgdisk -N 1 /dev/sd$n; done
# pvcreate /dev/sd{b1,c1,d1,e1}
# vgcreate r10 /dev/sd{b1,c1,d1,e1}
# lvcreate --type raid10 --mirrors 1 --stripes 2 -l 70%FREE --name r10_thinpool r10 /dev/sd{b1,c1,d1,e1}
# lvcreate --type raid10 --mirrors 1 --stripes 2 -l 50%FREE --name r10_thinmeta r10 /dev/sd{b1,c1,d1,e1}
# lvconvert --thinpool r10/r10_thinpool --poolmetadata r10/r10_thinmeta

Here i made RAID 10 thinpool. This commands are from man thinlvm. But if i remove for example disk /dev/sdb and replace it with new disk - i cant rebuild this thinpool array with new disk. I don't understand what command in what order i need to do. Befire i begin rebuilding and recovering all data are on degraded thinpool array, but after my manipulations all data lost, and thinpool array is not rebuilded.
 
Last edited:
  • Like
Reactions: velocity08
I found this regarding HW NVME RAID:
http://www.highpoint-tech.com/USA_new/series-ssd7120-specification.htm
I don\t know how it's working but it exist.


I think there is bigger fish, maybe LSI/AVAGO/BROADCOM - because a lot of SM/HP/DELL raid controlles are simply OEM LSI. But who knows.


My english is far from perfect =) maybe i told not so clear - i dont have problems with stability or performance or usage of LVM RAID. I am trying to simulate situation with raid degradation in case of disk failure. There is some algorithm to replace disk and recover degradet array, but i can't find it. Every time i try to recover raid using commands from man i loose all data or simply can't recover it. Googling is not helpful. I found nothing.


I be very grateful if you can ask - all i do i already wrote:

# for n in {b,c,d,e}; do sgdisk -N 1 /dev/sd$n; done
# pvcreate /dev/sd{b1,c1,d1,e1}
# vgcreate r10 /dev/sd{b1,c1,d1,e1}
# lvcreate --type raid10 --mirrors 1 --stripes 4 -l 70%FREE --name r10_thinpool r10 /dev/sd{b1,c1,d1,e1}
# lvcreate --type raid10 --mirrors 1 --stripes 4 -l 50%FREE --name r10_thinmeta r10 /dev/sd{b1,c1,d1,e1}
# lvconvert --thinpool r10/r10_thinpool --poolmetadata r10/r10_thinmeta

Here i made RAID 10 thinpool. This commands are from man thinlvm. But if i remove for example disk /dev/sdb and replace it with new disk - i cant rebuild this thinpool array with new disk. I don't understand what command in what order i need to do. Befire i begin rebuilding and recovering all data are on degraded thinpool array, but after my manipulations all data lost, and thinpool array is not rebuilded.

Hi a4t3rburn3r

i've done some digging on a few different fronts and also have some questions before we can determine what's causing the LVM rebuild issue.

I found this regarding HW NVME RAID:
http://www.highpoint-tech.com/USA_new/series-ssd7120-specification.htm
I don\t know how it's working but it exist.

in relation to hardware Riad for NVME it can be done at the cost of the bus speed, so having to traverse the PCI bus will mean you won't be getting the 64k IO queue or the NUMA parallel processing which is what makes NVME faster than SATA and SCSI.

Using a hardware raid device creates the same bottle neck you currently get with SAS/ SATA devices hence why no vendors are offering true NVME raid for production environments who wish to use the full capabilities of NVME, creating this bottle neck will make the use of NVME redundant so may as well use SATA or SAS and a raid controller as the performance will be similar.

that's what our Supermicro rep explained to us today :)

# for n in {b,c,d,e}; do sgdisk -N 1 /dev/sd$n; done
# pvcreate /dev/sd{b1,c1,d1,e1}
# vgcreate r10 /dev/sd{b1,c1,d1,e1}
# lvcreate --type raid10 --mirrors 1 --stripes 4 -l 70%FREE --name r10_thinpool r10 /dev/sd{b1,c1,d1,e1}
# lvcreate --type raid10 --mirrors 1 --stripes 4 -l 50%FREE --name r10_thinmeta r10 /dev/sd{b1,c1,d1,e1}
# lvconvert --thinpool r10/r10_thinpool --poolmetadata r10/r10_thinmeta

In relation to the RAID issue is the OS drive seperate to the Data drive?
are you running 2 LVM RAID drives or is this all living on the same Raid 10 LVM group? (related to previous question)

i found a few good articles on building and restoring for Debian below that may be helpful until i can check with our internal resources.

Rebuild LVM Raid
https://serverfault.com/questions/9...rive-failure-on-lvm-software-raid-10-in-linux

Create LVM Raid Ubuntu
https://help.ubuntu.com/lts/serverguide/advanced-installation.html

let me know how you go with those links and ill ask internally tomorrow.

""Cheers
G
 
In relation to the RAID issue is the OS drive seperate to the Data drive?
are you running 2 LVM RAID drives or is this all living on the same Raid 10 LVM group? (related to previous question)

Hi. Thx for your links, i will read.
1. Yes, OS drive is separate, it is zfs mirror from 2x 500GB SSD. And for data i use separate disks. I fixed my commands to build lvm raid thin pool:
# for n in {b,c,d,e}; do sgdisk -N 1 /dev/sd$n; done
# pvcreate /dev/sd{b1,c1,d1,e1}
# vgcreate r10 /dev/sd{b1,c1,d1,e1}
# lvcreate --type raid10 --mirrors 1 --stripes 2 -l 70%FREE --name r10_thinpool r10 /dev/sd{b1,c1,d1,e1}
# lvcreate --type raid10 --mirrors 1 --stripes 2 -l 50%FREE --name r10_thinmeta r10 /dev/sd{b1,c1,d1,e1}
# lvconvert --thinpool r10/r10_thinpool --poolmetadata r10/r10_thinmeta
Stripes not 4, but 2. There was 8 HDDs and commands was for 8 HHDs, but i corrected for 4 HDDs.
2. One ZFS mirror for OS from 2x 500GB ssd, and one lvm raid thin pool for images and containers from 4x 8TB HDD. But i plan to try 12x 8TB HDD in one pool.
 
Ok, i already saw this articles, ubuntu help is useless, it covers basic operations and links are for very old LVM howto and other outdated info. And yep, i tryed serverfault tip, but without success.
 
  • Like
Reactions: velocity08
I'm wondering if this article can explain why its now working?
https://pve.proxmox.com/wiki/Software_RAID

i've opened a ticket with Proxmox about software raid support and what options we have as far as the above link shows its apparently not supported :(

let's see what they come back with.

""Cheers
G
 
I'm wondering if this article can explain why its now working?
https://pve.proxmox.com/wiki/Software_RAID

i've opened a ticket with Proxmox about software raid support and what options we have as far as the above link shows its apparently not supported :(

let's see what they come back with.

""Cheers
G
You don't understand. Its work. LVM RAID works as expected. And proxmox uses it without problem. This is the question not to proxmox team. The question is about how lvm raid works if it will be converted to thinlvm poll. How to rebuild this converted to thin pool lvm raid array if disk failure occured.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!