Proxmox 4.1 2 nodes

Also 3.4 does not work reliable with 2 nodes.

For HA, 3 nodes were always needed and recommended.
 
Hi Tom.. Thank you so much for your reply. I have been running 2 nodes for over a year some run mostly with very little issues (all quorum disks). Are you then saying 4.1 cannot do 2 node clusters? If we do go 3 nodes how can I do disk replication (aka no SAN) DRBD 3 node setup? I really liked the 2 node + DRBD setup its a great budget cluster works super.
 
PVE 4 really does not have the software to use a quorum disk, so you need to go to 3 nodes. I find that even running non-HA with two nodes is inconvenient, because sometimes VMs just don't start on reboot of one of the boxes. To solve that, my project this week is to procure a nice cheap box to just be a quorum participant. My leading contender right now is http://store.netgate.com/ADI/RCC-DFF-2220.aspx but I need to make sure PVE will install on a system with no graphics display device (ie, only serial console).
 
I've a 3 node cluster, 2 nodes with drbd9 storage (I cross my fingers...), and the 3° is a cheap Mitac Pluto E220 with a small mSata ssd and 8GB of RAM. The only drawback so far is that if you buy enterprise repository access, the key for the cheap node must also be included in the purcase.
 
Can yo confirm this please :
- if you don't need automagical HA, you can have a 2 nodes cluster, but you will have to fight when you will want to migrate VMs from a down node (is it at last possible ?)
- if you want automagical HA, you need at last 3 working nodes because of the quorum algorithm (and when a node fails, do quick resolve the problem because a problem on a second node could lead to an instable/bistable cluster)
 
Can yo confirm this please :
- if you don't need automagical HA, you can have a 2 nodes cluster, but you will have to fight when you will want to migrate VMs from a down node (is it at last possible ?)
- if you want automagical HA, you need at last 3 working nodes because of the quorum algorithm (and when a node fails, do quick resolve the problem because a problem on a second node could lead to an instable/bistable cluster)

You always have to fight to migrate vms (I've asked multiple times to be able to do it through GUI, search the forum), you have to "mv" config files in the cluster path of the surviving server, having the quorum or not. Without quorum you just have, in addition, to manually force it to 1 to be able to manipulate configuration for the mv above and have the node work fine.
Only HA does the migration "automatically", you can't click on the died host in the gui and select "migrate", nor you have other GUI way to do so.
In any case, a cluster needs minimum 3 nodes, forcing quorum to 1 COULD be dangerous if you are not aware of the real (physical) state of the died server (starting the same VM in 2 nodes destroyes it), and with proxmox 4.1 HA with 2 nodes is out of questions.
All the above AFAIU and from what I've experienced so far (testing "bad conditions" in a virtualized 3 nodes cluster).
AFAIU, es. in case of a SAN, VMWare has a better model where HA with 2 nodes is safe (the node is considered alive based on multiple conditions, and the more clever one is the ability to contact the storage, not only the other nodes like in Proxmox).
 
mmenaz : thank for your experience sharing.
I thinked about a potential upgrade in proxmox HA communication :
- either ha layer could discuss with the other nodes on every ethernet interface and on several IPs or at ethernet level (this last is less whicheable in terms of WAN topology) - ie communication on data or service interfaces
- either a big ethernet bonding (2,3,4 eths) with vlans and virtual bridges on it (admin, service, data) - no need to modify HA software - but less secure in terms of flow segregation (physical vs software)

And while discussing bonding : my point of view differs from the PVE doc : if you have to choose bonding between admin and data, preafer data because there is nothing worse than a VMs which couldn't dump fresh data (but VM started several times - and this case should be managed by HA software).
 
Since we tried building a similar setup in the last weeks and are in the process of giving up at the moment, I want to share our thoughts/experiences/concerns/... maybe we will find a more doable solution in the near future:

Our first plan was also to build a cluster with two identical nodes (enough CPU/RAM/Storage to host all our VMs if necessary) paired with a Quorum Disk on an old Zotac ZBox to keep our cluster quorate. When we realised that this wouldn't be possible anymore in PVE4, we just made a small change of plans and planned to set up a third diskless node instead as a regular cluster member but without the capability to host any VMs. Our plans for shared storage were to use the whole storage of our two big nodes in a dual-primary DRBD setup - didn't work at all with the actual DRBD 9 beta version. Turns out this is also not the best practice and having two separate DRBD volumes with primary/secondary configuration is better, making the whole process easier if a split-brain should happen. But even this setup was only giving us headaches and was not behaving as expected (nodes staying both secondary after reboot,...). Downgrading to DRBD 8 seems also not being a very sexy solution (http://coolsoft.altervista.org/it/b...rnel-panic-downgrade-drbd-resources-drbd-9-84) so we more or less ran out of options for the moment and will use our two nodes separately, sacrificing every convenience we had hoped to get with our two nodes.

My last thing to do is ask at Linbit, if they can say when they expect to get dual-primary up and running in DRBD 9, then we can maybe wait or we will do as I suggested and give up our HA/DRBD setup.

Cheers, Johannes
 
I've the feeling you have done something wrong.
First of all, for DRBD9 you don't have to crete 2 volumes to better cope with split brain, since in drbd9 every VM is a resource and is active only in the node where it runs, so no split/brain problems as it was in drbd8.
The 2 storage server, 1 server for quorum setup I have is working, the only bad thing is the "thin provisioning" that does not work at all, and I will have to retry and reconstruct without it as soon as I have time to do so.
At the end, wondering what a "diskless node" mean, a node that boots from PXE? Our "quorum node" has a small disk (so with a small local storage), and can autonomously boot (and then join the cluster).
 
Hi mmenaz,

I've the feeling you have done something wrong.
that's basically the feeling I somehow deep down still have, but I spent so much time on it now without success. And that's also why I did this "last" post before giving up. Thank you very much for your answer!

since in drbd9 every VM is a resource and is active only in the node where it runs
Somehow I begin to understand where my wrong thoughts were. My plan was to take my big RAID-Volume, set up one big DRBD-Resource in dual primary mode on it and use that resource as a physical volume for LVM, which I can use more or less like I used to do the last five years on my other old KVM-Machine by creating logical volumes for my VMs. As far as I understand, that's the way it was done in the old PVE (which is not best practice for PVE4?). So did I really miss the fact that PVE is now able to create and use whole drbd-resources directly for providing storage to the VMs? Or do you setup a big LVM-Device and put drbd resources on top of it? I'm a little bit confused what to pack on top of what by now... ;-)

Regarding your setup: So you have absolutely no problems with resources not becoming primary as expected during reboots of the host etc.? Aside from our by now solved problems with the infiniband drivers which prevented the whole drbd service from starting, we regularly saw our drbd-resource being secondary-secondary after reboots (that's what I meant with "unexpected behaviour").

"thin provisioning" that does not work at all
Thin provisioning is no thing at all for me. I have enough disk space in both my nodes to host all VMs at one time and there are still 4 HDD slots left for future upgrades - I think I'lll get along without it. And increasing the size of a drbd-resource should be doable and is unlikely to happen in the future. I'm not concerned with "wasting" a little bit of space by over-dimensioning the virtual disks for my VMs.

At the end, wondering what a "diskless node" mean
I'm meaning/having the same thing as you do. "Diskless" of course is not meant literally, I just wanted to point out, that it does not have a big enough disk space to host even only one of our VMs.

Cheers, Johannes
 
Hi mmenaz,
So did I really miss the fact that PVE is now able to create and use whole drbd-resources directly for providing storage to the VMs? Or do you setup a big LVM-Device and put drbd resources on top of it? I'm a little bit confused what to pack on top of what by now... ;-)
I've just followed the wiki, but having 2 HD for node, I've created the volume group with /dev/sdb1 and /dev/sdc1 (# vgcreate drbdpool /dev/sdb1 /dev/sdc1)
http://pve.proxmox.com/wiki/DRBD9

Regarding your setup: So you have absolutely no problems with resources not becoming primary as expected during reboots of the host etc.?
So far I only had problems when I "ifdown eth1" and the sync did never started again with ifup eth1. I've searched a lot and then found that is needed an "drbdadm adjust all" that I've also put in the post-up of the /etc/network/interfaces

Thin provisioning is no thing at all for me. I have enough disk space in both my nodes to host all VMs at one time and there are still 4 HDD slots left for future upgrades - I think I'lll get along without it. And increasing the size of a drbd-resource should be doable and is unlikely to happen in the future. I'm not concerned with "wasting" a little bit of space by over-dimensioning the virtual disks for my VMs.
I don't need thin provisioning also, but following the wiki I have it. Now I would love to know how do the same setup but without it (don't understand where is activated since I don't see a "--thinpool" flat in lvcreate, nor not installing "thin-provisioning-tools" is a solution since I've seen in git that they added that package as a dependency for 4.2 setup).
I'll try with 3 virtualized pve if I will have time, if you in the meanwhile find the solution let me know :)
 
Last edited:
Hey mmenaz,

thank you so much for your help! After solving my SSH-configuration problems (got my DRBD-interface locked out by restricting access to the main network interface and the cluster join failed) I have now a running DRBD9 Cluster with 2 nodes, an IPoIB Infiniband connection bonded with a backup Gigabit Ethernet connection and all seems to work as it should.

About your thoughts on thin-provisioning: As far as I understand, the wiki is already outdated because DRBD9 seems to support qcow now (at least it's listed here in the WebIF - haven't tested it deeply yet, but the VM creation didn't fail at least). qcow is able to grow and because of that the underlying logical volume/DRBD-device must be able to grow too (at least that is the only way it makes sense to me). And that's the whole point of thin provisioning(?) Being able to tell a VM that it has loads of space on a huge virtual disk, which will be allocated when it's needed. Unfortunately we cannot multiply our real available space by just setting higher values than our actual storage has - that would be the solution to all storage problems ;-P But a normally initialized LVM added to the PVE via WebIF doesn't allow you to choose anything different than "raw" so I think it should be fine like this. Although I must say I haven't looked into the git-repo (I'm not that deeply interested to be fair), so I can't tell you how they do it behind the curtains.

EDIT: It somehow changed it's type to "raw" or I did a mistake, so forget the part about qcow. But nevertheless - when I add a new VM with a virtual disk (raw, 1TB), I would expect to see that half of my available space (2TB) in my volume group should be gone afterwards, which isn't the case. So the thin-provisioning seems to work with raw images as well as it does not create a logical volume with the actual size of 1TB - and that's all.

Cheers, Johannes
 
Last edited:
Hey mmenaz,
It somehow changed it's type to "raw" or I did a mistake, so forget the part about qcow. But nevertheless - when I add a new VM with a virtual disk (raw, 1TB), I would expect to see that half of my available space (2TB) in my volume group should be gone afterwards, which isn't the case. So the thin-provisioning seems to work with raw images as well as it does not create a logical volume with the actual size of 1TB - and that's all.

Have a look at usage on the OTHER node (if you create the vm on node1, have a look at node2), it DOES allocate all the space. And if you perform a restore, all the space is allocated (this time in all nodes).
Also proper size is miscalculated when you have a look at storage size (the available size is doubled or something like that).
In short, a real mess!
Can you test, confirm, and cry loud like I do? :)
 
Hey,
yeah, I somehow stumbled upon all that last week too, but didn't have the time to reply during Easter. At a second try with another 1TB virtual disk, suddenly my diskspace of 2TB) was used by 95%... And it looked soooo good on the first try....

Yeah, the size calculation is not very useful when it takes the total amount of storage over both nodes, ignoring the fact that half of it isn't useable because of the need for redundancy. At least the calculation of empty space starts at 50% making it arithmetically correct, but that's still not relly how it should work. Although I must say that the setup all in all seems to work, at least regarding the storage facility with DRBD (except the mentioned size miscalculation and the not present thin-provisioning). No more split-brains etc. like before.

I'm only having some problems with online migrating HA-managed machines, sometimes they get stuck in the migration process and corrupt their disks. Yet there is no pattern recognizable whether it's only happening with machines where HA was activated correctly (during downtime) or falsely (while they were already running). See for example here: https://forum.proxmox.com/threads/filesystem-corruption-on-ha.12246/

UPDATE: It seems to have something to do with the fact that some resources have secondary/secondary status, which is ok I would say for machines that are shut down but can't be a good state for a machine trying to boot on a node.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!