LizardFS / Comments req (Modest size cluster HA Storage options) ?

fortechitsolutions · Feb 2, 2019

Hi, I am curious to ask if anyone can comment.

I'm looking at setting up for a client, a modest sized Proxmox cluster. It will be running on OVH hosted environment, ie, we have access to various pre-configured server builds and my goal is to get a config which optimizes (monthly spend, features - CPU,Ram, and storage).

Client has a requirement for HA VM storage. So My options for this appear to be
(a) Something like hyper-converged storage solution built in at proxmox node level. Ceph is not a great fit given the size of their environment I think (ie, probably 5 physical server nodes approx, and we lack true 10gig ether for ceph, the 'vrack' OVH offers now seems to cap at 3gig or 30% of a 10gig pipe for the most part?)
(b) possibly something like LizardFS is good candidate. I've used this before in collaboration with a colleague, but not as HA primary storage for VMs with proxmox, rather just for "VM Backup storage target" where a few dedicated nodes acted as storage servers; the lizard server roles were not on proxmox at all/ proxmox hosts acted purely as Lizard clients. In this config it has been smooth/stable but I do realize that primary VM storage is very different role /config than this.
(c) not sure if I am missing some other 'good option' which is actually stable and managable.

I am not keen on anything based on DRBD - I tested it a few years ago and it always felt 'too delicate' in case of any server going offline / the recovery process was 'delicate and sometimes outright painful or terrible' at least in my testing. Possibly I just didn't do things right, but end of day it was enough hassle I was not open to going this route.

I realize there are now more ZFS Integration features in proxmox under the hood than was true a few years ago. Including async ZFS replication, which is 'pretty close to HA' for many use case scenarios. I don't think (?) there is a ZFS realtime sync HA I can easily do which is also 'reliable, and easy' but if I'm wrong on this please let me know.

Anyhow. Any comments (ideally grounded in real-world experience) are very much appreciated.

Thanks,

Tim

fortechitsolutions · Feb 2, 2019

I forgot to say / can't edit my post (?) - The storage performance requirements are not 'heavy' - this is more about 'reliable HA performance with modest IO performance'. So we may in fact be ok with gig-ether connectivity between nodes for syncing storage / not requiring faster vrack than this. The bigger concerns are - that it works - reliably and is able to really tolerate a fail <> HA event and then an associated recovery event without a lot of drama. The storage capacity requirement is not 'huge' (ie, modest Tb or few - maybe 2 Tb usable space ?) - although possibly it will grow to ~4-8Tb in coming years so a growth path option in this scope would be nice.

Clearly lizard with extra standalone block servers (not running proxmox at all) would be one way to just add more capacity. Or even just 'maybe better way to do lizard outright'. Setup 4-5 proxmox nodes, no lizard other than lizard client. And use a pool of ~4-6 standalone lizard hosts which do no proxmox, just lizard roles. Use the 'unused' storage local space on proxmox nodes for 'misc backups storage' and get on with life.

Also I know OVH offer a "HA NAS" storage product which I have actually tested, and I think it is an OK candidate. It is slightly less optimal as it has only 'public facing access routing' so we can't use Vrack/ private gig ether for this - which is 'meh not great' - would be far nicer if we had HA NAS over dedicated private gig to talk to proxmox nodes. But it is functional enough / works well and is easy. So nearly is a tempting 'good enough' option. Pricepoint is not 'cheap' but as old adage says, we don't generally get to pick all 3 of (cheap, reliable, easy). (or cheap/reliable/fast)

Anyhow. Maybe end of day I am just seeking,
(a) real world experience comments from someone using LizardFS in production with proxmox for VM HA repository
(b) or similar config with other straightforward (?!) option (ZFS or ?other? based - probably not ceph, drbd, since they are not really good candidates here IMHO)

Partially I just wish this project client would give me OK for a shared-nothing proxmox cluster (ie, not with HA) - since I find generally the uptime and reliabiltiy of the OVH servers is 'better than good enough'. But in part the client fears the lack of HA means we will have lots of outages. Which in my own real-world experience is not really the case. But anyhow. Client is one who sets the requirements.

Tim

guletz · Feb 3, 2019

Hi Tim,

You must ask yourself why do you want a HA? What services do you run and they need HA?

In many situation is cheap (as for resurces) and more simple to use other tools that will could run like HA. As a simple example, if you have a http server,
is far more simple to have 2 vm for this service and use a http/tcp balancer (load balancer and fail over).

I also use lizardfs, but like you only for store backups/. For very light vm like a dns is ok. For the free version no any update in almost a year.

fortechitsolutions · Feb 3, 2019

I agree, asking "Why HA" is important. In this case I'm asking for feedback assuming proxmox HA is on the 'required features list'. Not so much that I agree with it being on the required feature list for this project, but that the client wants the feature absolutely, and it is not negotiable. I also agree that HA can be made a non-issue via other levels of redundancy (ie, pair of HA Nginx hosts, pair of back-end web app servers, etc). But just for the sake of the question here to the forum I was asking about feedback specifically around Lizard and other possible options for filesystems that could meet the target of providing 'shared storage to a proxmox HA cluster'.

Thanks very much for your feedback re: Lizard use / as good candidate for backups datastore for proxmox!

Tim

guletz · Feb 5, 2019

fortechitsolutions said:
Lizard use / as good candidate for backups datastore for proxmox!

... more like not so bad as you think

Alwin · Feb 5, 2019

fortechitsolutions said:
(a) Something like hyper-converged storage solution built in at proxmox node level. Ceph is not a great fit given the size of their environment I think (ie, probably 5 physical server nodes approx, and we lack true 10gig ether for ceph, the 'vrack' OVH offers now seems to cap at 3gig or 30% of a 10gig pipe for the most part?)

Either way, Ceph or LizardFS, if you don't have a dedicated bandwidth with a low latency, then any distributed storage will not be reliable.

fortechitsolutions said:
I realize there are now more ZFS Integration features in proxmox under the hood than was true a few years ago. Including async ZFS replication, which is 'pretty close to HA' for many use case scenarios. I don't think (?) there is a ZFS realtime sync HA I can easily do which is also 'reliable, and easy' but if I'm wrong on this please let me know.

Our storage replication (pvesr) has a configurable sync interval (minimum 1 minute) and can be used with HA (see limitations).
https://pve.proxmox.com/pve-docs/chapter-pvesr.html

fortechitsolutions said:
Thanks very much for your feedback re: Lizard use / as good candidate for backups datastore for proxmox!

Besides the above, you can use the client and configure a directory storage.

guletz · Feb 5, 2019

Alwin said:
Either way, Ceph or LizardFS, if you don't have a dedicated bandwidth with a low latency, then any distributed storage will not be reliable.

Hi,

Low latency and higher bandwidth (even dedicated) is not possible. As bandwidth will increase from 0 to max limit will decrease the latency. And for lizardfs is not so important the latency. If you have a goal like raidz2, if you have at least one replica write(and the data block), you can go forward. In the backward lizard will try to write the second replica block when is possible. And is very reliable in this scenario.
And latency must be define. It is icmp, or whatever. Also the low latency.

It is lizardfs reliable? Yes. I can stop 3(5 in total) of my lizard nodes/brick. In this moment my pmx nodes can not operate with lizard(3 data nodes + 2 parity nodes).
And if I start this stoped nodes < 12 seconds, any write op is successfully with no data lost(as a side note any of my lizards is on top of zfs)

So, if you know your tool, and if you used for the right task is ok

I do not say that my oppinions are ok in any case, but this is what I see from my own experience. Any others divergent ideaa or experience are welcome from my point of view. And I am very happy with others oppinions because I can LEARN something(from unsuccessful and successful setups )

Good luck !

fortechitsolutions · Feb 6, 2019

HI Alwin, I guess for clarity. The environment would have dedicated 10gig "VRack" interfaces which are capped at 3gig throughput, if I understand the current OVH offer correctly. So - I am not sure if you would believe that 3gig dedicated interface can be sufficiently low latency and high bandwidth to meet a 'modest performance storage back-end' target goal or not. In my own testing I think this is operational and sufficient for my needs (ie, test was done using HA-NFS running over same vrack back end). ... But your input is certainly appreciated to clarify!

fortechitsolutions · Feb 6, 2019

Hi guletz, many thanks for your feedback! If I understand correctly, you are using Lizard FS as a "VM storage" primary datastore for Proxmox, and it is working well for you in terms of reliability. May I ask for clarity. (a) were you using the proxmox nodes as pure Lizard clients, or do you have Lizard server roles distributed across your proxmox nodes? (b) If you have Lizard server roles on the promox nodes, do you have them running on the 'base proxmox' host, or wrapped up inside LXC Containers? I've read one discussion on proxmox Lizard deploy which suggested for simpler management, wrapping the Lizard Servers as Proxmox LXC Containers 'made sense'. I am not entirely convinced, but would be interested to better understand the build / detail of your lizardFS Proxmox environment. Many!! thanks for your comments/feedback! --Tim

Alwin · Feb 6, 2019

fortechitsolutions said:
The environment would have dedicated 10gig "VRack" interfaces which are capped at 3gig throughput, if I understand the current OVH offer correctly. So - I am not sure if you would believe that 3gig dedicated interface can be sufficiently low latency and high bandwidth to meet a 'modest performance storage back-end' target goal or not.

10 GbE with max 3 GbE throughput, seems odd and is certainly only QoS. This form of "dedicated network" (vRack) only provides a private network segment over a public (suppose so) network (to OVH services). All the other participants on that "public" network can and most likely will interfere.

Besides the storage, for HA you need low and stable latency, as otherwise corosync will not be able to keep a stable membership for quorum. With resources under HA, the nodes will reset if they lose quorum. In my experience with OVH, this will only work if those nodes are connected to the same switch.

Or see it from this perspective. OVH's own openstack clusters are only located within one data center. So, if OVH itself is not distributing its cluster over different data centers, then it might not be a viable solution for your setup either.

fortechitsolutions · Feb 6, 2019

Hi Alwin, I've done testing (2+ years ago) FYI and I had zero problems with Proxmox cluster on OVH dedicated hardware with the Vrack solution. I encourage you to read a bit about Vrack at OVH maybe before speculating further. (They do have fair bit of info on their website / Docs / FAQs / etc.) Broadly speaking the vrack network hardware appears to be separate from the 'public network' interfaces / hardware. I do know for sure VRack network supports multicast and works fine with proxmox cluster. In contrast multicast does NOT work on the public interfaces of OVH hosts. OVH do go out of their way to indicate that when you have a 'vrack network connection' you do get the bandwidth promised / and privacy assured for your connectivity. I assume it is being done via some form of switch-level management (VLan or otherwise, hidden from our view). Openstack does not enter into it in any way. Not sure why you mention openstack. I'm talking about 'dedicated hardware server nodes' for the ovh environment. I do agree that the 3Gb cap on the vrack performance sounds like it would be enforced via QoS measures on the 'Vrack switch infrastructure' ie, again hidden from our view. I believe as far as the dedicated hosts are concerned, they have 10Gig connections lit up that just happen to never go faster than 30% theoretical throughput max (ie, 3/10 Gbps approx).

Otherwise. As another reference point, I did setup another 'test proxmox cluster' using cheaper "So You Start" OVH dedicated hosts, which lack vrack features. THey have only 'public interface' network. For getting proxmox cluster working here, I used "TINC" to create a private VPN bridge layer between my 3 physical hosts / bridged to a secondary private interface. Then used this private VPN network range for creation of my multicast friendly proxmox cluster subnet. As well as for bridging a 'private lan segment' for my proxmox VM guests. This config also worked flawlessly, and I never had any problem with TINC VPN / proxmox cluster / heartbeat or quorum. Despite running on top of the 'public ovh network' as the underlying network transport. So this is actually a viable way to get proxmox clusters alive on OVH if you want 'shared nothing' proxmox hosts with 'modest price' OVH hardware which lacks VRack features.

Anyhow. That last point - simply to say - that in the real world tests that I have done, I did not have any problem or concern around the OVH network / in terms of it being able to support Proxmox cluster.

Rather, my question to you was more around specifically trying to clarify your comments re: Latency and Performance concerns of 10gig vs 3gig vs 1gig for a 'shared storage fabric' for proxmox cluster. But I guess end of the day, it really just comes down to

-- what kind of IO performance requirements you have for your VMs running on proxmox
-- how good or poor the network is that supports this 'storage fabric'
-- if the latency and bandwidth are suitable and sufficient, then it will work within scope sufficiently well.
-- and if requirements exceed capacity of the underlying network (ie, bandwidth or latency) - then clearly performance will be 'less than sufficient' (ie, poor to bad). Which is more a matter of 'goodness of fit'. ie, don't try to put 10 pounds of potatoes into a 5 pound bag.

Tim

guletz · Feb 6, 2019

Hi Tim, again!

fortechitsolutions said:
(a) were you using the proxmox nodes as pure Lizard clients, or do you have Lizard server roles distributed across your proxmox nodes?

- mixed: 2 Lizard servers on 2 PMX nodes/hosts, and 3 others Lizard servers on others non-PMX nodes

I think it is possible and more convinient to use Lizard servers in a CT, but in this case I think that is not so realible(in case of reboot of the node, the indisponibility time is higher, because you need to wait until CT is started)

guletz · Feb 6, 2019

fortechitsolutions said:
I am not entirely convinced, but would be interested to better understand the build / detail of your lizardFS Proxmox environment

The big problem was to make master(M) and shadow-master(SM) to be inter-changable(M is down -> SM will be promoted as M). Anyway, with ucarp and some custom scripts(up/down) I was able to solve this problem. M and SM are using a dedicated SSD with zfs on top(M and SM are on a separate non-PMX server each). This 2 non-PMX servers are connected with the PMX dedicated storage network(2 x Gb switch, with 10 Gb inter-switch link). On the same servers as M and SM I also have the data storage Lizard servers+1 other non-PMX. The rest(2) are located on 2 PMX nodes.
I am waiting to get more networks interface, so I could use 2 x NIC/PMX-nodes and for non-PMX nodes(for PMX network storage and for lizardfs).

fortechitsolutions · Feb 6, 2019

Hi Guletz, thanks for the added detail. Much appreciated! I agree with your thought, ie, wrapping lizard inside LXC might add some delays due to 'added time for container startup' which makes me feel a bit uneasy about that approach. I think what might work most simply, is to run Lizard right on proxmox host layer. I also agree, it sounds like the failover process around the master / shadow-master is going to be either 'manual' or 'semi-automated via scripts'. I'm not sure if you are open to the idea of sharing your scripts for possible use by others?

Short term I think my client is going to use something 'simpler' and 'modest performance' with OVH for shared storage; and we'll see how that works out of the gate. But if Lizard is an option I could add in later without too much drama, it will be a good/tempting option to test and use if tests look good.

Thanks again for all your feedback/comments!

Tim

guletz · Feb 6, 2019

fortechitsolutions said:
sounds like the failover process around the master / shadow-master is going to be either 'manual' or 'semi-automated via scripts'.

Hi,

is not a rocket science. Ucarp use a cluster ip=vip. When vip is up, you must kill any mfs backup and start mfs master. When vip is down, you kill any mfs master and start mfs backup.

Alwin · Feb 7, 2019

@fortechitsolutions AFAICS on the website, the 10Gb/s option for a vRack, has a bandwidth of 3Gb/s and burst 10Gb/s.

As you can see from the vRack examples[1], it only isolates network segments by putting them into a VXLAN. The media underneath is still shared with everyone/everything else. Also from their description the vRack can span a network[2] over different datacenters and therefore have wide varied latency.

For corosync[3], the underlying system on our HA stack, a stable and low latency is imperative. Else a quorum will never be achieved or kept reliably. The consequence is that with varied latency, corosync might run into a flapping state or a diverging truth. This in turn makes the decision taking for the rest of the HA stack unreliable and worthless. This counts for distributed storage as well (eg. Ceph/LizardFS).

[1] https://www.ovh.ie/solutions/vrack/
[2] https://www.ovh.ie/solutions/vrack/network-technology.xml
[3] https://en.wikipedia.org/wiki/Corosync_Cluster_Engine

Search

Search

LizardFS / Comments req (Modest size cluster HA Storage options) ?

fortechitsolutions

Renowned Member

fortechitsolutions

Renowned Member

guletz

Distinguished Member

fortechitsolutions

Renowned Member

guletz

Distinguished Member

Alwin

Proxmox Retired Staff

guletz

Distinguished Member

fortechitsolutions

Renowned Member

fortechitsolutions

Renowned Member

Alwin

Proxmox Retired Staff

fortechitsolutions

Renowned Member

guletz

Distinguished Member

guletz

Distinguished Member

fortechitsolutions

Renowned Member

guletz

Distinguished Member

Alwin

Proxmox Retired Staff

We value your privacy