New cluster hardware details, suggestions for edits to the wiki

andydills

Active Member
Jan 5, 2013
60
1
28
We're in the process of setting up a Proxmox cluster, and I wanted to share my thoughts about the hardware we chose, and suggest a few improvements to the wiki.

For our cluster, in the short term, we've decided to do three nodes, with one of those being a simple FreeNAS quorum and backup storage server. The two KVM nodes will do DRBD. In the long term, we'll move to a Netapp-served iSCSI implementation.

As far as hardware, for each of the two KVM servers, we went with:

The Supermicro chassis has eight drive slots, and the cable supplied with the Adaptec card was an easy replacement on the top row of the backplane. The flash module was an easy install.

The motherboard has the Intel i350 quad gigabit NIC controller with one port connected to a protected public network, two ports bonded directly between the two KVM servers for drbd and live migration, one port connected to the internal VLAN for the quorum server, and the dedicated Realtek IPMI interface connected to the private VLAN as well.

The i350 has a feature called Virtual Machine Devide Queues, which I'm curious about. Here's a snippet from Intel:
Virtual Machine Device Queues (VMDq) is a technology designed to offload some of the switching done in the VMM (Virtual Machine Monitor) to networking hardware specifically designed for this function. VMDq drastically reduces overhead associated with I/O switching in the VMM which greatly improves throughput and overall system performance
In the official Intel brief I linked to above, it mentions that the feature is supported by VMWare and Microsoft...I'm curious if it's supported in Proxmox?

We setup RAID-1 on the system drives with mdadm based on these instructions, which worked very well. This leads me to my other big point; suggestions for wiki improvements. I agree with the decision to not support software RAID for image hosting, but I think there should at least be a wiki page for setting up system drive redundancy when you have shared storage or dedicated hardware RAID for images.

On the DRBD page, you should mention in "Disk for DRBD" section, that fdisk doesn't work on partitions larger than 2 TB, and if you need to go larger, you need to use gparted.

On the Proxmox VE 2.0 Cluster page, you say "Changing the hostname and IP is not possible after cluster creation." However, I found this not to be case...you may not be able to change hostname, but you can change the IP by editing /etc/hosts and running ssh-copy-id. In my case, I had to move the cluster to a different IP configuration and triggered this issue which I missed when I was reading the wiki the first time. I was relieved to find out I didn't have to start over, which is what it was starting to seem like. Also, I had mistakenly setup the cluster over the public interface by using the public IP with 'pvecm add'...you may want to make it clear that if you have a private connection for DRBD, when you setup the cluster, you should specify the IP of that interface so that the live migration can benefit from the improved bandwidth.

That's all I've got so far. Next week will be when I setup the quorum and fencing, then finish implementing the HA configuration, and finally I'll need to implement backups. I'll update with my thoughts on those projects.


 
the wiki is open for everyone, so feel free to add your comments and improvements, just register and we enable your wiki account.

see also http://pve.proxmox.com/wiki/Software_RAID (as we do not want to force mdraid, we do link to external pages)
 
Just a peace of advice. iSCSI on FreeBSD is broken to a point of being useless. Under heavy load the server will become catatonic. Confirmed for both FreeBSD 8.x and 9.x. Search FreeBSD iSCSI heavy load.

you can change the IP by editing /etc/hosts and running ssh-copy-id.

I guess this means you had to recreate the private key? If not I see no reason to copy the public key once again.
 
the wiki is open for everyone, so feel free to add your comments and improvements, just register and we enable your wiki account.

Thanks, I guess I assumed only developers would have that ability.

Any comment on the VMDq support?
 
Ok, I finally have an opportunity to update with my progress.

I've learned quite a bit about drbd this week, and I have some knowledge to share.

First and most important thing to understand, and I'm not sure why this isn't presented in BIG BOLD LETTERS somewhere in the wiki, drbd in 'c' mode (meaning 'connected', as in the two resources are in primary/primary mode), is constrained by the bandwidth connecting the two nodes. So, regardless of how fast your raid array can write, if you only have one (or even two) gigabits of ethernet bandwidth, you will be limited on the network layer even before your powerful RAID gets involved. So, in any serious implementation of a simple 2 node (+quorum disk) proxmox cluster, you simply need 10gb ethernet. There's really no other sensible solution, and I'm surprised it's not mentioned in the wiki (I promise I'll put in some time with that soon).

In my current situation, I have two ports directly wired between the two servers. I found I was still only getting 110M when I would do some simple dd testing:
Code:
# lvcreate -n test -L 10G (physical volume)
# mkfs.ext3 /dev/(physical volume)/test 
# mount /dev/(physical volume)/test /mnt
# pveperf /mnt
# dd if=/dev/zero of=/mnt/file.tmp bs=512M count=1 oflag=direct

I needed to switch the bond_mode to balance-rr in order to exceed 1gbps. I've heard balance-rr can be less than great when used for VM traffic, but this is a drbd mirroring and live migration interface only. You'll want to enable jumbo frames as well:
Code:
auto bond0
iface bond0 inet static
        slaves eth1 eth2
        bond_miimon 100
        bond_mode  balance-rr
        address  10.1.1.1
        netmask  255.255.255.0
        post-up ifconfig bond0 mtu 9000

With that, I got better results, but still nowhere what I get when I'm running in disconnected mode. To demonstrate, try rebooting the secondary server, and test on the primary while the seconday is offline, and you'll see just how much you're losing to network bandwidth and latency. For me, I would need to have 4gbps of ethernet bandwidth, which is unrealistic to expect to achieve on any sort of bonded interface. Clearly I will need to move to 10gb cards, I just wish I had known from the start. Keep in mind a decent 10gb card is cheaper than a decent 4 port RAID with BBU, so it's not an unreasonable suggestion to say that if you get one, you really need to get the other.

A couple of other things to tune with drbd (and I'm not yet sure these are optimal, will need the 10g interfaces first), but if you have a BBU raid, you should enable this in your drbd config:
Code:
disk {
    no-disk-barrier;    
    no-disk-flushes;
    no-md-flushes;
}
I also had better throughput when I let the send buffer size grow dynamically:
Code:
net {
    ...
    sndbuf-size 0;

}
I'm still experimenting with unplug-watermark.

So, long story short, if you're going to run a 2 node drbd primary/primary setup, get a 10gb card, and invest some time in understanding the drbd tuning.

On another note, the quorum disk was trivial in setting up, as well as the IPMI fencing...overall very impressed with the HA clustering. Setting up backup (via an NFS share on the same FreeNAS server providing the iscsi target) was also just very straightforward, no surprises.

Overall proxmox is really a fantastic set of technologies put together in a relatively easy to use package. Kudos to the developers.
 
First and most important thing to understand, and I'm not sure why this isn't presented in BIG BOLD LETTERS somewhere in the wiki, drbd in 'c' mode (meaning 'connected', as in the two resources are in primary/primary mode), is constrained by the bandwidth connecting the two nodes.

Thank you captain obvious :| How do you think the data is replicated with drbd? Avian carriers?
 
Thank you captain obvious :| How do you think the data is replicated with drbd? Avian carriers?

The attitude isn't appreciated. I assumed the updates to the other node would happen semi-asynchronously and thus the write speeds to the local node would not be constrained by the connecting bandwidth. For example, I've learned people will go to the lengths of having to do stupid things like disconnect the nodes in order to do VM restores because of this limitation.

You may feel this limitation is obvious. It's not. There's no suggestion anywhere in the proxmox wiki or much of anyplace else that I've seen, that if you have a BBU backed RAID and are doing drbd, you NEED a 10gb connection. If what I said was so "obvious", then everything in the wiki is so blindingly obvious that it should just be deleted and people should just use their abilities to spot the "obvious" to set everything up. Right?

Edit: It occurs to me that perhaps I wasn't clear in my post, and that maybe you thought I was saying "the rate of speed at which the local node can update the connected node is constrained by the connecting bandwidth"...that's indeed obvious. What I'm saying, and what wasn't obvious to me, is "the rate of speed at which the local node can be written to is constrained by the speed of the connected node".
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!