OFED for PVE

Gurn_Blanston · May 13, 2016

Greetings,

This is my first post so please be gentle.

Has anybody had success in installing OFED for Debian Jessie? I tried the install script from Mellanox but it seems to be hard-coded to Debian 8.2 or 8.1. PVE 4.1 is on Debian 8.3, which is what I was running when I spent two days trying to get it to install. I am now on PVE 4.2, which is on an even newer version of Jessie.

Assuming Mellanox will never keep up with versions of PVE, has anyone found a way to better integrate Mellanox Infiniband ConnectX cards? I am finding IB over IP to be especially unstable in the new PVE cluster hardware I am using. It worked fairly well in my lab environment, which was built on five year old Dell Precision Workstations, although never getting up past about 12 Gbps. The new hosts are brand new 4-U Supermicro servers with DX10 motherboards, 256 GB RAM and 2 X 10-core Xeon v.4 CPUs and while I can get about twice the throughput with Iperf (25 Gbps), file transfers tend to crash part way in.

A few details:
Using ZFS dataset on a 9-disk all SSD zpool for source (VM lives on it in RAW format) and target is a 4-disk spindle based ZFS dataset with the sharenfs property set.

Is this a hopeless case? I am less interested in achieving the full 40 Gbps of QDR Infiniband than I am in getting vzdump backups to run without bringing down the host server, which is generally what I have to do to make NFS work again.

Thanks to anyone who has read this all the way to the end and an extra thanks to anyone who is kind enough to reply!

Justin

LnxBil · May 14, 2016

Could you please explain what really your technical problem is?

I have no idea about those IB HBAs but I suppose they need some kind of driver in form of a kernel module which is provided by Mellanox (as I interpret your text). Problem is that PVE and Debian Jessie have one BIG difference, the used kernel. Therefore, the module for Jessie will (as a binary blob) not work on PVE. PVE itself uses the Ubuntu LTS Kernel (or at least use it as a base of the PVE kernel), so maybe you can use Mellanox drivers for Ubuntu LTS if they provide binary drivers/modules for that.

Maybe I'm totally wrong, but there are often problems with binary drivers and non-RHEL/non-SLES environments.

Gurn_Blanston · May 16, 2016

Do you know which LTS version? There are some Ubuntu versions available.

Ubuntu 15.10
Ubuntu 15.04
Ubuntu 14.10
Ubuntu 14.04
Ubuntu 12.04

Is it on 14.04 or 15.04? I don't know how to pick the OS apart enough to reveal this.

GB

fabian · May 17, 2016

Our current (pve-kernel-4.4.x-...) kernel is based on Ubuntu's 4.4 kernel from Xenial (16.04). We provide a pve-headers-xx package for each kernel image so you can build kernel modules yourself (if you have the module source

).

LnxBil · May 17, 2016

And for the record:

Ubuntu LTS are the x.04 where x mod 2 = 0, so every 2 years, the april edition. Please refer to https://wiki.ubuntu.com/LTS

Gurn_Blanston · May 17, 2016

Well slow down Fabian! Let Mellanox catch up!

By the time they release a package for 16.04, PVE will be on 17.04. See my dilemma?

I don't know if Mellanox provides source. Do you think it will result in a more stable experience with IP over IB? Can you point me to a recipe? In my short history with a half dozen nix type operating systems, one lesson I have learned is that you deviate from the package repository at your peril. The Infiniband docs on the PVE Wiki are pretty old and only cover IPoIB, although there is a reference to OFED. The reference to OFED is a statement saying that Proxmox has been unable to compile it into the PVE kernel so I don't anticipate that I would have a better outcome. What about RDMA? has anyone been able to integrate that with PVE? I did find in the forums a post about using SRP but it wasn't the step by step hand holding that I would need. Also, SRP is also kind of long in the tooth although so is iSCSI. My customer is unenthusiastic about iSCSI and I agree that it less than ideal in the day of 25,000 IOPS, 500 MB/s commodity SSDs.

Regarding troubleshooting NFS issues, I have done more experimentation on our new cluster. The new Supermicro hosts shipped with 10 Gbit Ethernet so I straight piped a two interface bond between two hosts and redid my NFS shares to use that interface. It Iperfs at about 16 Gbps (round robin linux bond) I had no issues whatsoever. Well, performance was less than awesome. The bit throughput never went faster than about 2 Gbps. Sort of disappointing but far more stable than IPoIB and I don't yet know where the actual bottleneck is. ZFS has been difficult to make go real fast so far. Especially in conjunction with NFS.

I guess I am really trying to decide whether to write off Infiniband for this cluster. I want fast backups and fast live migrations. Gbit Ethernet is so 2002 and seems barely adequate for an enterprise storage network nowadays. Gbit bonding helps some but it is also a mixed bag in terms of weird side effects and troubleshooting complications. Are most PVE customers tossing Infiniband onto the pile of Betamax video recorders and going with 10Gb Ethernet? The glut of cheap Infiniband stuff on eBay might be the answer to this question but it is also what makes it attractive to the consumer. Also, leaving iSCSI behind is appealing at least on paper. What would you do if it were your environment? What does Proxmox use for its virtual environment? Could I could open a ticket regarding my IPoIB instability? Is that a supported configuration? I only have so many tickets to spend and if I am on a lost cause type of mission it isn't worth it and writing off Infiniband while we have only sunk a few thousand dollars on it might be the best course. I need some fatherly advice!

GB

*

LnxBil · May 18, 2016

Gurn_Blanston said:
Well slow down Fabian! Let Mellanox catch up! By the time they release a package for 16.04, PVE will be on 17.04. See my dilemma?
*

As I said already, 17.04 will not be a LTS release. Proxmox will probably go with 18.04 (which will be LTS) but then there will be (hopefully) Debian 9 Strech in play.

fabian · May 18, 2016

Gurn_Blanston said:
Well slow down Fabian! Let Mellanox catch up! By the time they release a package for 16.04, PVE will be on 17.04. See my dilemma?

We will stay (based) on the 16.04 kernel for some time (it is ubuntu's LTS kernel after all).

I don't know if Mellanox provides source. Do you think it will result in a more stable experience with IP over IB? Can you point me to a recipe? In my short history with a half dozen nix type operating systems, one lesson I have learned is that you deviate from the package repository at your peril. The Infiniband docs on the PVE Wiki are pretty old and only cover IPoIB, although there is a reference to OFED. The reference to OFED is a statement saying that Proxmox has been unable to compile it into the PVE kernel so I don't anticipate that I would have a better outcome. What about RDMA? has anyone been able to integrate that with PVE? I did find in the forums a post about using SRP but it wasn't the step by step hand holding that I would need. Also, SRP is also kind of long in the tooth although so is iSCSI. My customer is unenthusiastic about iSCSI and I agree that it less than ideal in the day of 25,000 IOPS, 500 MB/s commodity SSDs.

This is probably better answered by users that use various IB products together with Proxmox in production. Unfortunately, the IB kernel stack is rather prone to breakage (lots of rarely updated, vendor-provided code..). Building a kernel module should be pretty straight-forward though if you have the source and install all the build tools.

Regarding troubleshooting NFS issues, I have done more experimentation on our new cluster. The new Supermicro hosts shipped with 10 Gbit Ethernet so I straight piped a two interface bond between two hosts and redid my NFS shares to use that interface. It Iperfs at about 16 Gbps (round robin linux bond) I had no issues whatsoever. Well, performance was less than awesome. The bit throughput never went faster than about 2 Gbps. Sort of disappointing but far more stable than IPoIB and I don't yet know where the actual bottleneck is. ZFS has been difficult to make go real fast so far. Especially in conjunction with NFS.

I guess I am really trying to decide whether to write off Infiniband for this cluster. I want fast backups and fast live migrations. Gbit Ethernet is so 2002 and seems barely adequate for an enterprise storage network nowadays. Gbit bonding helps some but it is also a mixed bag in terms of weird side effects and troubleshooting complications. Are most PVE customers tossing Infiniband onto the pile of Betamax video recorders and going with 10Gb Ethernet? The glut of cheap Infiniband stuff on eBay might be the answer to this question but it is also what makes it attractive to the consumer. Also, leaving iSCSI behind is appealing at least on paper. What would you do if it were your environment? What does Proxmox use for its virtual environment? Could I could open a ticket regarding my IPoIB instability? Is that a supported configuration? I only have so many tickets to spend and if I am on a lost cause type of mission it isn't worth it and writing off Infiniband while we have only sunk a few thousand dollars on it might be the best course. I need some fatherly advice!

GB

*

CBdVSdFSMB · May 19, 2016

Hey guys,

this could become a long one, let's see if I can cover some of the questions.

First of all, I have a very similar setup, so I can say that I already had my fair share of trouble with the thing too ;-) A Supermicro Twin Server with 2 nodes, X10 Mainboards and onboard Mellanox Connect X-3 Infiniband cards. My planned setup differs a little bit from yours, as I want to use DRBD as storage backend to get a shared storage. A third small node on an old KVM-QEMU machine gives my cluster quorum, but doesn't contain any storage nor VMs at any time. The Infiniband connection is used as fast replication connection for DRBD - but with IPoIB because this allows me to set up an ethernet bond with one of my spare Gigabit-Ethernet interfaces and also because the DRBD manual says that the simple solution with SDP is deprecated. The only alternative left would be to use regular RDMA, but I didn't go further into this, as I never saw any CPU limitations during my tests (could be because at the moment I only have 4 disks in RAID 6 in my nodes, giving me an overall throughput at around 500 MB/s). And on top I don't need subnet managers and all the other typical Infiniband stuff.

All in all that doesn't work YET - there are still some problems with DRBD and Proxmox itself, why I had to put the project on hold and to wait for DRBD to become more than a technology preview. But I managed to get my Infiniband cards to run with the official Mellanox OFED drivers - with some workarounds.

Gurn_Blanston said:
seems to be hard-coded to Debian 8.2 or 8.1

You can override the distro check of the Mellanox OFED install script with the option "--distro debian8.2". This should work. I also added the maximum verbose level and two switches to tell the script, what to install. All in all this results in:

Code:

./mlnxofedinstall -vvv --distro debian8.2 --basic --without-fw-update

During my two or three test installs, I observed the following behaviour: When there were no previous OFED drivers installed, it went over the first steps of the uninstallation of the old packages, presented a list of packages intended to install and then got stuck in a circular dependency with the pve-manager and some other absolutely necessary proxmox packages. You can only stop the script as it won't finish ever. Later installations with already installed OFED packages (as an update may be) even get stuck earlier at the deinstallation of these old packages. You can see dpkg idle around forever with nearly no load and it will never finish. So we need to stop it, too.

But as it already presented the list with packages it intended to install, I took the list, just simply went into the "DEBS" subfolder and installed them manually with dpkg.

Code:

dpkg -i ofed-scripts* mlnx-ofed-kernel-utils* mlnx-ofed-kernel-dkms* iser-dkms* srp-dkms* libibverbs1* ibverbs-utils* libibverbs-dev* libibverbs1-dbg* libmlx4-1* libmlx4-dev* libmlx4-1-dbg* libmlx5-1* libmlx5-dev* libmlx5-1-dbg* libibumad* libibumad-static* libibumad-devel* ibacm* ibacm-dev* librdmacm1* librdmacm-utils* librdmacm-dev* libibmad* mstflint* ibdump* libibmad-static* libibmad-devel* libopensm* opensm* opensm-doc* libopensm-devel* infiniband-diags* infiniband-diags-compat* mft* kernel-mft-dkms* srptools* mlnx-ethtool*

That works, only one error pops up, related to a "...-guest..." package, which doesn't seem to be relevant. The dpkg messages also tell, that it does a regular dkms-installation, so everything seems to be fine - even for the future and newer kernels.

As fabian already said, you need previously the "pve-headers-xx package" installed. Only downside is at the moment, that the pve-headers package isn't kept up to date automatically during distro-upgrades. Apt installs all new packages including the new kernel files, but simply omits the corresponding headers package. But if you install it manually after the apt-get upgrade, you can safely reboot and won't miss your infiniband interface after the reboot. You even don't need to install the packages again - hopefully this stays as it is. But this is also a reason, why I'm totally happy with the IPoIB solution, as DRBD then is able to work on (at least with 1GBit/s).

Gurn_Blanston said:
I am finding IB over IP to be especially unstable in the new PVE cluster hardware I am using

I can't completely understand, what you mean with "unstable". Leaving all the driver install issues apart, I can say that it works.

Gurn_Blanston said:
although never getting up past about 12 Gbps

Could it be, that your old machines don't have (fast) enough PCIe-lanes for the full Infiniband speed? In the company where I work, we tested some infiniband cards and it took us some time to realise that PCIe 1.0 was the limit, not the CPUs nor the limitations of the IPoIB stack. Speaking of, I achieved 37 GBit/s with nearly no CPU load when I played around with the iperf settings. With DRBD my harddrives/RAID were the limit, but that is what I expected in the first place.

Gurn_Blanston said:
I don't know if Mellanox provides source.

Yes they do. But I didn't try them out yet, as the whole project is on hold at the moment. You can find them here: http://www.mellanox.com/downloads/ofed/MLNX_OFED-3.2-2.0.0.0/MLNX_OFED_SRC-debian-3.2-2.0.0.0.tgz

Gurn_Blanston said:
16 Gbps (round robin linux bond)

Why do you use round robin? Afaik round robin only allows one interface for reception of data and only is able to use both of them for sending. This could at least be one reason for your low rate.

fabian said:
Building a kernel module should be pretty straight-forward though if you have the source and install all the build tools.

Maybe I'll give it a try, although I'm a little bit confused of the whole source-package at first glance. I'm not that experienced with building stuff ;-)

Gurn_Blanston · May 19, 2016

Hello CBdVSdFSMB,

Some of those consonants must be silent so I will just call you Jeff.

Thanks for the considerate response. I have pretty much written Infiniband off at this point but now you have me wondering if that was premature. I will try to reply to your points.

Regarding the Mellanox install script, I did try the command switches, or at least the "skip distro check" switch in the lab cluster with no success. I also played with the PVE-headers as you mention and just got myself royally confused. I went part way down the path of chasing down the individual package list and then got disgusted/frustrated and gave up on RDMA and decided to be content with the built in drivers and IPoIB. That approach worked OK in the lab, at least for NFS.

I also played with DRBD using bonded 1 Gb NICs (as many as 8 in a host) and I was able to get so so performance with replication. Maybe about 100 MB/s. My zpools could manage about 500 MB/s by themselves but by the time you turn your zpool into zvols and then zvols into DRBD block devices and then add them to PVE via LVM, you have pretty much put Rube Goldberg to shame. In other words, super complicated for our purposes. Now the approach is to use PVE-Zsync for replication (over infiniband originally, now maybe 10G) between a 9 disk all SSD zpool using 2 TB Samsung EVOs in two RAIDZ vdevs with one spare and a large spindle based pool of mirrors with SSD based ZIL and L2ARC. This approach leaves out one crucial thing, however. How to live migrate? That leads us back to either NFS or SMB or shared storage on iSCSI. NFS seems to be the best integrated with ZFS so I also have an NFS share on each node. Migrating is a little clumsy since you have to basically do it twice, once to the NFS share on the destination host and then once more from the NFS share to the ZFS dataset on SSD where the VMs want to live. All this explanation is so that I can describe what I mean by NFS being unstable on Infiniband. In the lab environment, also using Infiniband (IPoIB) for NFS, this worked OK, although performance was generally limited by the lab quality hodgepodge of disks we had lying around rather than the switching fabric. In the new cluster, I was unable to complete a single copy job over NFS. Not one. It would transfer a few GB and then the share would become unavailable and I would have to go through all the hassle of trying to bring back the share (usually required rebooting). These hosts take a long time to reboot. Not too good. This is what led me to looking for better drivers or newer kernel modules and I don't know other than OFED where to get them. Also, Infiniband cables are really expensive!

At first I didn't know whether the lockups were because of NFS or because of IB. I was able to try the same setup using the 10Gb interfaces that came with the new servers. This is rock steady with NFS so I am laying blame at the feet of IB. It could be that the PCIe in the lab computers was not as fast but the slot the IB card sits in is an 8-lane slot. Near as I can tell, the backplane is PCIe 2.0. Dell Precision T7500 if you are interested. I had as many as 10 disks crammed into that thing! Now it is mostly there for quorum.

Regarding RR bond mode, I know of no other mode that gives more than one interface worth of single threaded throughput. I don't know how it handles receive packets but in tests, overall throughput DOES stack with each interface you add. I could get about 300 MB/s with a four NIC bond. I have never tried RR in production, however. It doesn't seem to like being on a switch. In production, I will use LACP since the core switches support it. This will limit my single threaded throughput to 1 Gbps, however. We will break down and get a 10G switch I suppose.

I might try your suggestion with Mellanox OFED. Which package did you use? Which would you use for PVE 4.2? Also, will this break my support subscription? I am also nervous about trying to build my own modules. I am barely experienced with Linux in the first place. I have spent the past twenty years clicking "next" for a living. Compiling kernel drivers sounds over my head at this point and also a bit old fashioned.

Thanks for all the help Jeff.

GB

fabian · May 20, 2016

CBdVSdFSMB said:
But as it already presented the list with packages it intended to install, I took the list, just simply went into the "DEBS" subfolder and installed them manually with dpkg.

Code:

dpkg -i ofed-scripts* mlnx-ofed-kernel-utils* mlnx-ofed-kernel-dkms* iser-dkms* srp-dkms* libibverbs1* ibverbs-utils* libibverbs-dev* libibverbs1-dbg* libmlx4-1* libmlx4-dev* libmlx4-1-dbg* libmlx5-1* libmlx5-dev* libmlx5-1-dbg* libibumad* libibumad-static* libibumad-devel* ibacm* ibacm-dev* librdmacm1* librdmacm-utils* librdmacm-dev* libibmad* mstflint* ibdump* libibmad-static* libibmad-devel* libopensm* opensm* opensm-doc* libopensm-devel* infiniband-diags* infiniband-diags-compat* mft* kernel-mft-dkms* srptools* mlnx-ethtool*

That works, only one error pops up, related to a "...-guest..." package, which doesn't seem to be relevant. The dpkg messages also tell, that it does a regular dkms-installation, so everything seems to be fine - even for the future and newer kernels.

As fabian already said, you need previously the "pve-headers-xx package" installed. Only downside is at the moment, that the pve-headers package isn't kept up to date automatically during distro-upgrades. Apt installs all new packages including the new kernel files, but simply omits the corresponding headers package. But if you install it manually after the apt-get upgrade, you can safely reboot and won't miss your infiniband interface after the reboot. You even don't need to install the packages again - hopefully this stays as it is. But this is also a reason, why I'm totally happy with the IPoIB solution, as DRBD then is able to work on (at least with 1GBit/s).

thanks for the detailled writeup! seems quite cumbersome, but at least it works..

Yes they do. But I didn't try them out yet, as the whole project is on hold at the moment. You can find them here: http://www.mellanox.com/downloads/ofed/MLNX_OFED-3.2-2.0.0.0/MLNX_OFED_SRC-debian-3.2-2.0.0.0.tgz

Maybe I'll give it a try, although I'm a little bit confused of the whole source-package at first glance. I'm not that experienced with building stuff ;-)

if DKMS works for you, you don't need to build from source manually (DKMS does basically the same thing).

CBdVSdFSMB · May 20, 2016

@fabian:

fabian said:
if DKMS works for you, you don't need to build from source manually (DKMS does basically the same thing).

You're absolutely right. My only hope was/is to solve this workaround more cleanly. Still I'm not completely happy with the solution because you always have to pay attention to update the kernel-headers manually. Is there a technical reason why they aren't upgraded automatically when the kernel itself is upgraded, too?

@Gurn: My name is Johannes - you were close with Jeff, at least with the first letter ;-) It's a shared account for our group of administrators. Since we're all students donig it in our free time, our personnel changes very often, why we try to use one account for all the time and to build up some history and knowledge for the guys who work with our servers in the future. It's (one more) a stupid abbreviation - Germans like these ;-)

Gurn_Blanston said:
with the built in drivers and IPoIB.

I must say that I didn't waste even a single thought on using them... I know that in Redhat/Suse/... the whole Infiniband stuff is included in the distro, but I didn't expect to find something similar in the Debian world, so I went straightforward to the Mellanox drivers.

Gurn_Blanston said:
I was able to get so so performance with replication

If you talk about the initial replication - yes, that seems to be absolutely normal. DRBD does some indexing stuff in the background and says (somewhere in the manual) that you cannot expect it to run with full speed at the beginning. I did later some real performance testing by writing random numbers directly on an already synced DRBD volume with dd. Then it reached full RAID-speed on both nodes.

Gurn_Blanston said:
Which package did you use? Which would you use for PVE 4.2?

These ones: http://www.mellanox.com/page/mlnx_o...X_OFED_LINUX-3.2-2.0.0.0-debian8.2-x86_64.tgz
Just google for "Mellanox OFED", then select your version(newest)/Distro/architecture and take the (non-SOURCE) tgz-Archive. Unpack and then you should get along with my already written infos.

Gurn_Blanston said:
In other words, super complicated for our purposes.

I have the feeling that I still don't understand, what you are packing on top of what and why. But I understand the confusion as I struggled with the new DRBD9 structure as well in the beginning. See here: https://forum.proxmox.com/threads/proxmox-4-1-2-nodes.26172/#post-132976
Maybe you could draw a small diagram and upload it? Or start at least with the information how many nodes you are having (that information seems to miss)? Do you have some sort of shared storage like SAN (why you talk about NFS) or do you want to use internal disks? Depending on that there are some very easy reccommendations on the storage technique. What's the plan with backups? Do other servers need access to the storage too somehow? I'm quite confused at the moment.

Regarding bonding modes: I didn't find a very good explanation of the different modes in English quickly (I used one in German in the past), but there must be some. I always try to use "the standard" LACP (802.3ad) when the network interfaces allow it, because there I know that the bond uses both interfaces at the same time for all purposes and both the switch and the server only see one Interface (group). Our switch supports this and this works very well on most of our servers (only the older ones don't support it). And you can even build up direct connections back to back if you want to. Round robin should do the job too, but neither your local interface nor the switch can guarantee that all packets are transmitted and arrive in order. http://serverfault.com/questions/445839/what-are-differences-between-balance-rr-and-802-3ad

Gurn_Blanston said:
Also, will this break my support subscription?

I would say definitively not - as long as you don't bother the Proxmoxians with Infiniband-Problems. In the end you should have Interfaces acting as normal Ethernet Interfaces and PVE won't know anything about the underlying hardware. And on top if using DRBD, Proxmox is not really interested in your dedicated replication link used for DRBD.

Gurn_Blanston said:
I am barely experienced with Linux in the first place.

We were all once and I might add, that I still am when I look at my other colleague, who has been using Linux as regular OS at home since he started using a computer.

Gurn_Blanston · May 20, 2016

Regarding your being confused, I am distilling about a year's worth of dicking around with PVE (and others like SmartOS) into a few paragraphs out of laziness on my part so I don't take that personally and I am sorry if it is unclear. I take it you aren't using ZFS in your setup.

Here goes:

Initially, I was excited about DRBD because I came upon it while trying to get some understanding of PVE's storage model. I found some discussions about using DRBD in conjunction with ZFS. Consider ZFS a design requirement. My customer wants it so I have agreed to figure out how to implement it. Another requirement is that we don't want a SAN. We have one (EMC CX-4) that is out of support along with our two R710 hosts and redundant iSCSI switches for storage network. The current production environment is VMWare 5.x and has been since the installation of the Dell/EMC cluster we put in about six years ago. It has been pretty reliable but every year, EMC and VMware want about 20% of the original price as a support tax. Every year we feel less and less inclined to line EMC and VMware's pockets for what is now pretty outdated stuff. Shared storage comes with its own tax in the level of complexity it brings to a small environment. Thus, it seems like maybe nowadays we can jettison the storage network and SAN, which take up 1/3 of the cabinet, and either use near-real-time asynchronous replication like PVE-zsync or DRBD. My experiments with DRBD weren't especially encouraging. Yes, I could make it work but if you aren't working with bare disks (we are not, since we are using ZFS at the bare disk level), you have to use "fake real disks", which in ZFS land are called zvols. Linux will treat a zvol just like it does any block device and thus DRBD can use it as a DRBD block device. Once you have your DRBD device created, you can then make it part of an LVM volume group, which is necessary so that PVE can add it to both the source server and the DRBD target server as if it were a single shared volume rather than a replicated DRBD backed volume. So, in order from lowest to highest you have:

Physical Disk
ZFS zpool made of physical disks
ZFS zvol made from zpool to make it a virtual block device for DRBD
DRBD bound to zvol to make a DRBD volume
LVM using the DRBD block device as a PV (physical volume)
LVM using the PV in a VG (volume group)
You add the VG to PVE

That for me, is so complicated, it is sort of silly. It also comes with some side effects that needed to be worked out. I just can't trust that design for a production system. It seems as complicated or even more so than a proper storage network and SAN.

With this understanding, I looked for a simpler alternative that is also compatible or even complimentary to ZFS. This led to thinking about PVE-zsync, which at the time was a "technology preview". Initial testing worked pretty well but it did have a tendency to crap out after a variable length of time, a few days, or a few minutes, depending. It was usually a problem with ZFS being unwilling to accept another snapshot. I set the number of allowed snaps to 1 and that seemed to help some but it still failed eventually. By default, it runs at 15 minute intervals and I think that is good enough for us. Maybe we can make it run every minute or so, which would be better, assuming I can make it work stably.

The problem with this approach is that it is not trivial to start the VM from the sync target if something happens to your source volume. You have to edit some files manually (never actually tried it). It also isn't a substitute for a proper backup. Lastly, you can't use PVE-zsync for guest migration between hosts. For this you either need shared storage on iSCSI/FC or NFS/SMB or the like. NFS is fairly well integrated into ZFS and PVE seems to like it. This leads to the current plan for final design:

Three hosts: Two badass Supermicro 4U hosts each with 9 x 2 TB Samsung EVO SSDs and 5 x 4 TB Constellation spindles and a third host for quorum. The third host is just a leftover lab machine (Dell Precision T7500)

Each host has a Mellanox ConectX 2 card on its own network connected to an Infiniband switch. You can find this stuff for cheap on eBay.

IPoIB makes up the "storage network". There is no SAN. Just the SSD Zpool and the spindle Zpool. The idea is to replicate from the SSD on host A to the spindle zpool on host B and vice versa using PVE-zsync. For migration between hosts, I created an NFS zfs dataset (a way of creating a discreet share for NFS within a given zpool) on the spindle zpool of each host, then added them to PVE as NFS storage. This way, I can live migrate from the native ZFS SSD pool on host A to the NFS share on host B (but the OS thinks it is on host A in /mnt/pve/nfssharename) and then one more time from the NFS share on host B to the native ZFS pool on host B. It sucks but what can you do? We want ZFS at the top level for its a variety of reasons.

If this is at all comprehensible to you then you will understand why I care about NFS stability over an Infiniband fabric. I haven't even started playing with PVE-zsync on the new cluster until I can get NFS to be reliable over IB. I could just use NFS and skip the PVE-zsync but NFS doesn't allow for native ZFS snapshots, which we want. Also, my customer, who is an experienced Debian developer, HATES NFS so I have to keep that part on the down low

. At this time, NFS over IB doesn't work reliably for me in the new cluster. I have tried the same approach using 10G, which the Supermicro servers shipped with. I have no 10G switch, however, so it only works for two nodes. NFS over 10G works great, by the way.

I have spent the better part of today trying to find an affordable 10G switch that doesn't suck and costs less than a new car. My findings have not been encouraging. Therefore, I have resolved to give IB another try. Next week. I am going home for the weekend!

If one of you fine individuals out there read through all this and have a better idea I would love to hear about it. By now, I would have built a SAN and called it a day but this is offensive to my customer's cheapskate sensibilities and I don't particularly blame him. He can pay me to flail around all he wants but eventually, I have to deliver a working cluster. Did I mention that we want it to be fast? I thought I would mention that part before getting replies about doing CEPH or GlusterFS and the like. Performance comes first, reliability hopefully a close second. Also, cheapness, or else we would have just stuck with VMware or even HyperV (which is my background). Actually, Hyper-V would probably be the cheapest way to go but my customer HATES, really HATES Microsoft.

Sorry about all the idiomatic expressions but your English reads as if written by a native. I took one term of German in college but remember very little. Let me know if you need more clarity. I would be interested to hear about your DRBD setup. How are you presenting the DRBD backed volume to PVE? is your DRBD bound to individual spindles or some sort of logical volume in a RAID setup? ZFS wants complete access to the physical disk, which is why all the complication as described above. Some people have talked about using DRBD to back each spindle, then using DRBD block devices for the disks in a zpool. This might be doable but ZFS would prefer you let it see the actual physical disk, rather than a virtual block device or RAID virtual volume.

Thanks Jeff!

GB

Gurn_Blanston · May 23, 2016

Hello again,

This morning I had a thought. The Infiniband adapters each have two ports and on two of my three nodes, I had both ports patched into a switch with IPs assigned. I did that by way of troubleshooting connectivity issues I was encountering during setup. It dawned on me that this could be my problem with backup jobs running over the IPoIB network. If the sending and receiving host both are in essence "multihomed" on the IB subnet then maybe packets could get mixed up in some way. So, I took steps to ensure that only one interface per host was active and tried a vzdump job again. Just to keep things as simple as I could I set compression to "none". It worked! I got all the way through the backup without errors. It wasn't very fast, though. It backed up 106GB in 563 seconds. This is an average throughput of ~200 MB/s. I was expecting (hoping for) something more like 500 MB/s. It was afterall backing up from a zpool of SSDs to a zpool of SSDs that I know can write at around 750 MB/s sustained. Iperf shows the IPoIB connection able to handle about 27 Gbps so why can't I even attain 2 Gb/s? It DID complete, however, so I thought I would try once more but using LZO compression. Well, it was off to a fast start, ~400 MB/s up to %21 completion and then it just stopped working as before. Here is what I find in the Syslog.

May 23 12:16:46 pve-1 pvedaemon[27384]: INFO: Starting Backup of VM 204 (qemu)
May 23 12:16:47 pve-1 qm[27387]: <root@pam> update VM 204: -lock backup
May 23 12:16:47 pve-1 systemd[1]: [/run/systemd/system/204.scope.d/50-Description.conf:3] Unknown lvalue 'ase' in section 'Unit'
May 23 12:16:47 pve-1 kernel: device tap204i0 entered promiscuous mode
May 23 12:16:47 pve-1 kernel: vmbr0: port 2(tap204i0) entered forwarding state
May 23 12:16:47 pve-1 kernel: vmbr0: port 2(tap204i0) entered forwarding state
May 23 12:17:01 pve-1 CRON[27478]: pam_unix(cron:session): session opened for user root by (uid=0)
May 23 12:17:01 pve-1 CRON[27479]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
May 23 12:17:01 pve-1 CRON[27478]: pam_unix(cron:session): session closed for user root
May 23 12:19:06 pve-1 pmxcfs[9300]: [status] notice: received log
May 23 12:19:29 pve-1 pvedaemon[26115]: worker exit
May 23 12:19:29 pve-1 pvedaemon[9446]: worker 26115 finished
May 23 12:19:29 pve-1 pvedaemon[9446]: starting 1 worker(s)
May 23 12:19:29 pve-1 pvedaemon[9446]: worker 27682 started
May 23 12:19:58 pve-1 kernel: INFO: task lzop:27437 blocked for more than 120 seconds.
May 23 12:19:58 pve-1 kernel: Tainted: P O 4.4.6-1-pve #1
May 23 12:19:58 pve-1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 23 12:19:58 pve-1 kernel: lzop D ffff8815eafd3a58 0 27437 27384 0x00000000
May 23 12:19:58 pve-1 kernel: ffff8815eafd3a58 0000000000017180 ffff881ff29144c0 ffff881fbefb8dc0
May 23 12:19:58 pve-1 kernel: ffff8815eafd4000 ffff881fff997180 7fffffffffffffff ffffffff81843eb0
May 23 12:19:58 pve-1 kernel: ffff8815eafd3bb0 ffff8815eafd3a70 ffffffff818435c5 0000000000000000
May 23 12:19:58 pve-1 kernel: Call Trace:
May 23 12:19:58 pve-1 kernel: [<ffffffff81843eb0>] ? bit_wait_timeout+0xa0/0xa0
May 23 12:19:58 pve-1 kernel: [<ffffffff818435c5>] schedule+0x35/0x80
May 23 12:19:58 pve-1 kernel: [<ffffffff81846805>] schedule_timeout+0x235/0x2d0
May 23 12:19:58 pve-1 kernel: [<ffffffff810381f9>] ? sched_clock+0x9/0x10
May 23 12:19:58 pve-1 kernel: [<ffffffff810b8b95>] ? put_prev_entity+0x35/0x750
May 23 12:19:58 pve-1 kernel: [<ffffffff810f562c>] ? ktime_get+0x3c/0xb0
May 23 12:19:58 pve-1 kernel: [<ffffffff81843eb0>] ? bit_wait_timeout+0xa0/0xa0
May 23 12:19:58 pve-1 kernel: [<ffffffff81842adb>] io_schedule_timeout+0xbb/0x140
May 23 12:19:58 pve-1 kernel: [<ffffffff81843ecb>] bit_wait_io+0x1b/0x70
May 23 12:19:58 pve-1 kernel: [<ffffffff8184397f>] __wait_on_bit+0x5f/0x90
May 23 12:19:58 pve-1 kernel: [<ffffffff81843eb0>] ? bit_wait_timeout+0xa0/0xa0
May 23 12:19:58 pve-1 kernel: [<ffffffff81843a31>] out_of_line_wait_on_bit+0x81/0xb0
May 23 12:19:58 pve-1 kernel: [<ffffffff810c3d70>] ? autoremove_wake_function+0x40/0x40
May 23 12:19:58 pve-1 kernel: [<ffffffffc0b3cbb4>] nfs_wait_on_request+0x34/0x40 [nfs]
May 23 12:19:58 pve-1 kernel: [<ffffffffc0b418ee>] nfs_updatepage+0x15e/0x920 [nfs]
May 23 12:19:58 pve-1 kernel: [<ffffffffc0b31d14>] nfs_write_end+0x154/0x500 [nfs]
May 23 12:19:58 pve-1 kernel: [<ffffffff813fe1bf>] ? iov_iter_copy_from_user_atomic+0x8f/0x230
May 23 12:19:58 pve-1 kernel: [<ffffffff8118cda4>] generic_perform_write+0x114/0x1c0
May 23 12:19:58 pve-1 kernel: [<ffffffff8118ef36>] __generic_file_write_iter+0x1a6/0x1f0
May 23 12:19:58 pve-1 kernel: [<ffffffff81229523>] ? touch_atime+0x33/0xd0
May 23 12:19:58 pve-1 kernel: [<ffffffff8118f064>] generic_file_write_iter+0xe4/0x1e0
May 23 12:19:58 pve-1 kernel: [<ffffffffc0b3148a>] nfs_file_write+0x9a/0x160 [nfs]
May 23 12:19:58 pve-1 kernel: [<ffffffff8120bf3b>] new_sync_write+0x9b/0xe0
May 23 12:19:58 pve-1 kernel: [<ffffffff8120bfa6>] __vfs_write+0x26/0x40
May 23 12:19:58 pve-1 kernel: [<ffffffff8120c619>] vfs_write+0xa9/0x190
May 23 12:19:58 pve-1 kernel: [<ffffffff8120d3f5>] SyS_write+0x55/0xc0
May 23 12:19:58 pve-1 kernel: [<ffffffff818476f6>] entry_SYSCALL_64_fastpath+0x16/0x75
May 23 12:21:58 pve-1 kernel: INFO: task lzop:27437 blocked for more than 120 seconds.

This series of messages just repeats in a gradual loop as timeouts are reached. The NFS share is no longer "visible" to the hosts although the IPoIB network is still up (I can ping between hosts on it's subnet). As a matter of fact, the NFS shares are still connected and the folder contents are there. There is an 18GB .dat file, which is how far the backup job progressed before becoming hozed.

I will have to force the job to quit now and forcibly remove the lock on the image.

Can any of you out there interpret the syslog entries? I can see that lzop is timing out as it is "blocked", whatever that means. I don't know what "tainted" means in this context but it doesn't sound good. Is any of this a clue?

Before I run the risk of making things worse by attempting to install new drivers/modules per Jeff's kind advice, perhaps someone could identify the offending agent and if it turns out NOT to be related to the fact that NFS is shared over IPoIB then all the time and effort to replace the drivers would be for nothing. The way I read the syslog entries is that something happens during backup and then there are various results of this failure and THAT is what I am seeing in the syslog, not the actual thing that is causing the errors in the first place. At first I blamed it on lzop but then subsequent attempts with compression turned off would also exhibit this problem. Same errors minus the bits about lzop. Also, I don't understand the message about "promiscuous mode". Is it relevant? I am not sure what that interface (tap204i0) is in the first place.

Fabian, I will open a ticket if you think that is the best way forward.

Thanks for the help,

GB

Gurn_Blanston · Jun 2, 2016

Guten Tag Meiner Damen und Herren,

I am just coming back around to this and upon revisiting Mellanox.com to get the packages for manual installation, I find now that there is a package for Ubuntu 16.04. I am going to attempt to install this using the intended install script. If it seems to be another rat hole I will try installing manually as Jeff (Johannes) suggests. More to come...

CBdVSdFSMB · Jun 3, 2016

Good morning,

sorry for the long silence on my end, I'm quite busy at the moment. Nevertheless, I tried the new package yesterday and it shows the same behaviour as the other ones - stuck at package uninstall.

The package list stayed the same, manual install works, one additional error pops up, now the install oftwo packages encounters errors (infiniband-diags-guest and mstflint).

Cheers, Johannes

Gurn_Blanston · Jun 3, 2016

Hello,

I have the same result. I did a short ping test prior to updating the drivers and averaged
0.110 with a 0.025 moving deviation. The min/max values are 0.049 and 0.149. Do you get that much noise in a simple ping? That is a three fold difference although I will grant that it is overall much lower than my 1 Gb pings. After updating the drivers a similar test shows an average ping time of 0.102, mdev of 0.031, with min/max of 0.075/0.170. It seems quicker but no less noisy. I am not sure how valid these numbers are statistically. I only let the ping run for about 20 counts. From the naked eye's perspective it did seem quicker.

Once I had two nodes with updated drivers, I did another test backup to an NFS share backed by a ZFS dataset. The destination zpool is only a four disk pool of two mirrors but it does have an SSD (of middling quality) ZIL. The job generally ran at around 200 MB/s but at 100% completion it just hung. Check out the syslog:

Jun 02 16:15:59 pve-1 pvedaemon[16091]: INFO: starting new backup job: vzdump 204 --compress lzo --mode snapshot --remove 0 --storage pve3backup --node pve-1
Jun 02 16:15:59 pve-1 pvedaemon[16091]: INFO: Starting Backup of VM 204 (qemu)
Jun 02 16:16:00 pve-1 qm[16094]: <root@pam> update VM 204: -lock backup
Jun 02 16:16:00 pve-1 systemd[1]: [/run/systemd/system/204.scope.d/50-Description.conf:3] Unknown lvalue 'ase' in section 'Unit'
Jun 02 16:16:00 pve-1 systemd[1]: Failed to reset devices.list on /system.slice: Invalid argument
Jun 02 16:16:00 pve-1 kernel: device tap204i0 entered promiscuous mode
Jun 02 16:16:00 pve-1 kernel: vmbr0: port 2(tap204i0) entered forwarding state
Jun 02 16:16:00 pve-1 kernel: vmbr0: port 2(tap204i0) entered forwarding state
Jun 02 16:17:01 pve-1 CRON[16301]: pam_unix(cron:session): session opened for user root by (uid=0)
Jun 02 16:17:01 pve-1 CRON[16302]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Jun 02 16:17:01 pve-1 CRON[16301]: pam_unix(cron:session): session closed for user root
Jun 02 16:22:28 pve-1 pveproxy[15932]: worker exit
Jun 02 16:22:28 pve-1 pveproxy[10039]: worker 15932 finished
Jun 02 16:22:28 pve-1 pveproxy[10039]: starting 1 worker(s)
Jun 02 16:22:28 pve-1 pveproxy[10039]: worker 17064 started
Jun 02 16:23:48 pve-1 kernel: nfs: server 10.0.0.80 not responding, still trying
Jun 02 16:25:37 pve-1 pvedaemon[10030]: <root@pam> successful auth for user 'root@pam'
Jun 02 16:25:39 pve-1 kernel: nfs: server 10.0.0.80 OK
Jun 02 16:25:47 pve-1 pveproxy[10040]: worker exit
Jun 02 16:25:47 pve-1 pveproxy[10039]: worker 10040 finished
Jun 02 16:25:47 pve-1 pveproxy[10039]: starting 1 worker(s)
Jun 02 16:25:47 pve-1 pveproxy[10039]: worker 17529 started
Jun 02 16:33:05 pve-1 pveproxy[17064]: worker exit
Jun 02 16:33:05 pve-1 pveproxy[10039]: worker 17064 finished
Jun 02 16:33:05 pve-1 pveproxy[10039]: starting 1 worker(s)
Jun 02 16:33:05 pve-1 pveproxy[10039]: worker 18555 started
Jun 02 16:33:32 pve-1 rrdcached[9801]: flushing old values
Jun 02 16:33:32 pve-1 rrdcached[9801]: rotating journals
Jun 02 16:33:32 pve-1 rrdcached[9801]: started new journal /var/lib/rrdcached/journal/rrd.journal.1464903212.971187
Jun 02 16:33:33 pve-1 pmxcfs[9948]: [dcdb] notice: data verification successful
Jun 02 16:33:43 pve-1 kernel: nfs: server 10.0.0.80 not responding, still trying
Jun 02 16:35:15 pve-1 kernel: INFO: task task UPID

ve-1:16091 blocked for more than 120 seconds.
Jun 02 16:35:15 pve-1 kernel: Tainted: P O 4.4.6-1-pve #1
Jun 02 16:35:15 pve-1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 02 16:35:15 pve-1 kernel: task UPID

ve-1 D ffff883f8de7bd98 0 16091 10029 0x00000000
Jun 02 16:35:15 pve-1 kernel: ffff883f8de7bd98 ffff883fc37309b0 ffff881ff295c4c0 ffff883fad951b80
Jun 02 16:35:15 pve-1 kernel: ffff883f8de7c000 ffff883fc3730a5c ffff883fad951b80 00000000ffffffff
Jun 02 16:35:15 pve-1 kernel: ffff883fc3730a60 ffff883f8de7bdb0 ffffffff818435c5 ffff883fc3730a58
Jun 02 16:35:15 pve-1 kernel: Call Trace:
Jun 02 16:35:15 pve-1 kernel: [<ffffffff818435c5>] schedule+0x35/0x80
Jun 02 16:35:15 pve-1 kernel: [<ffffffff8184387e>] schedule_preempt_disabled+0xe/0x10
Jun 02 16:35:15 pve-1 kernel: [<ffffffff81845589>] __mutex_lock_slowpath+0xb9/0x130
Jun 02 16:35:15 pve-1 kernel: [<ffffffff8184561f>] mutex_lock+0x1f/0x30
Jun 02 16:35:15 pve-1 kernel: [<ffffffffc0a1d095>] nfs_file_fsync+0x45/0x130 [nfs]
Jun 02 16:35:15 pve-1 kernel: [<ffffffff8124084d>] vfs_fsync_range+0x3d/0xb0
Jun 02 16:35:15 pve-1 kernel: [<ffffffff812408dc>] vfs_fsync+0x1c/0x20
Jun 02 16:35:15 pve-1 kernel: [<ffffffffc0a1d596>] nfs_file_flush+0x46/0x60 [nfs]
Jun 02 16:35:15 pve-1 kernel: [<ffffffff8120959f>] filp_close+0x2f/0x70
Jun 02 16:35:15 pve-1 kernel: [<ffffffff8122ae23>] __close_fd+0xa3/0xc0
Jun 02 16:35:15 pve-1 kernel: [<ffffffff81209603>] SyS_close+0x23/0x50
Jun 02 16:35:15 pve-1 kernel: [<ffffffff818476f6>] entry_SYSCALL_64_fastpath+0x16/0x75
Jun 02 16:36:53 pve-1 kernel: nfs: server 10.0.0.80 OK
Jun 02 16:37:02 pve-1 kernel: zd64: p1 p2
Jun 02 16:37:02 pve-1 kernel: zd32: p1
Jun 02 16:37:02 pve-1 kernel: vmbr0: port 2(tap204i0) entered disabled state
Jun 02 16:37:04 pve-1 pvedaemon[16091]: INFO: Finished Backup of VM 204 (00:21:05)
Jun 02 16:37:04 pve-1 pvedaemon[16091]: INFO: Backup job finished successfully
Jun 02 16:39:52 pve-1 pvedaemon[10029]: worker exit

On a somewhat related ticket I opened with Proxmox I complained that the job failed because that is what it looked like when I had to leave work yesterday. I had to leave immediately or else I would have killed the backup job, unlocked the VM, then rebooted the server. Luckily, I didn't do those things. When I got back to the office today I found that the job finished. Fifteen minutes into the backup job is where it reported that 100% had been backed up. Then it just sat there for a few minutes and I had to leave because everyone was going home. I went home with a heavy heart. If I had been able to wait another five minutes I could have gone home with the feeling of a job well done!

Why is it doing this? I have highlighted some of the obviously backup related messages just to save you from having to sift through all the lines. I don't know what half of this stuff means but I definitely don't like the sound of the word "Tainted" at Jun 02 16:35:15.

The fact that it completes at all is an improvement over the previous state. I could speculate that the ZIL is fooling the NFS client into thinking that all of the backup's bits have been sent and therefore received by the NFS server. In actuality, the receiving server is still trying to get everything written to spindles. Eventually, everything is written to spindle and the NFS server will then agree to respond to the NFS client and the backup process can proceed with closing and unlocking the VM. It could be that if I forced synchronous writes I could get the backup to complete without errors but I bet it would take a long time. Something about NFS server is not willing to trust that the bits have all been committed to storage. I believe that ZFS by design is supposed to fool the kernel that the bits actually HAVE been committed but for some reason this isn't happening.

I will experiment further by backing up to a faster SSD based pool. I have a feeling that I am still missing something. It seems like my zpool should be able to write a backup file of 91 GB in less than 21 minutes. That is only a rate of 74 MB/s. I could go that fast with 1 Gb Ethernet! I would expect my zpool to handle 200 MB/s for a bulk copy even without an SSD ZIL. I will post what I find out.

Any guidance is appreciated.

Sincerely,

GB

fabian · Jun 6, 2016

The ZIL is flushed every 5 seconds and is only used for sync writes (which is why it is important to use SSDs with good sync performance if you want to see a real benefit). You would need to post more details of your setup (NFS mount options, ZFS settings, ...) to find out what causes the delay you experienced. The error message in the syslog itself does not say much - NFS experienced a hang when a file descriptor (most probably of the backup archive) was closed (which triggers a flush/fsync). The "tainted" is because of the ZFS module being loaded, that is nothing to worry about.

I would first separately benchmark your zpool locally (with and without sync to see the effect of the ZIL), and then over NFS and post the exact benchmarks and results.

Gurn_Blanston · Jun 8, 2016

Thanks Fabian. I have been focused on problems I am having with using a Linux bond for the vmbr0 bridge. I just can't make it work on PVE 4. I have upgraded to PVE 4.4.8 with some difficulty (thanks to installing OFED).

On that subject, for what it is worth, here is the short version of what happens when you try to do a dist-upgrade after installing OFED.

The upgrade fails with a dependency check. One of the packages that is installed by Jeff's script, mstflint, requires version 5.2 or higher of libc++6. This package is essential to Proxmox but it includes an older version, ver 4.9. Therefore, mstflint's installation fails. Because of this, the kernel upgrade refuses to install.

# apt-get purge mstflint

This removes the "bad package". Then do

# apt-get autoremove

This removes a few unnecessary libraries that came with mstflint. Now you can do the kernel upgrade. Then make sure to install the new kernel headers

# apt-get install pve-headers-4.4.8-1-pve

reboot to apply the new kernel and guess what happens.

Yup, the Mellanox drivers are back to the old version from 2014. Back to the new PVE Kernel's version as it happens. So you have to install OFED again. Remember Jeff's dpkg command? Here it is again minus the mstflint package. Mstflint is the Mellanox firmware installer. I asked Wolfgang at Proxmox if it would be OK to upgrade libc++6 to version 5.2 and he didn't think it was a good idea so I will live without mstflint or maybe flash my cards on a different physical system.

Code:

dpkg -i ofed-scripts* mlnx-ofed-kernel-utils* mlnx-ofed-kernel-dkms* iser-dkms* srp-dkms* libibverbs1* ibverbs-utils* libibverbs-dev* libibverbs1-dbg* libmlx4-1* libmlx4-dev* libmlx4-1-dbg* libmlx5-1* libmlx5-dev* libmlx5-1-dbg* libibumad* libibumad-static* libibumad-devel* ibacm* ibacm-dev* librdmacm1* librdmacm-utils* librdmacm-dev* libibmad*  ibdump* libibmad-static* libibmad-devel* libopensm* opensm* opensm-doc* libopensm-devel* infiniband-diags* infiniband-diags-compat* mft* kernel-mft-dkms* srptools* mlnx-ethtool*

After rebooting you once again are back to the new driver set and everything still works as it did before upgrading the kernel. After I did this process a few times, I begged Wolfgang to see about including the newer drivers into the PVE Kernel and he kindly has agreed to attempt it so this procedure may not be necessary in the near future.

I will delve into the VZDump/NFS/ZFS situation as time allows. It is important to point out that the new drivers seem to do a much better job. I was able to migrate a 100 GB Windows Server 2012 R2 raw image on native ZFS to an NFS share on another host over the updated IP over IB fabric in right around 90 seconds. That is the sort of performance I had hoped for originally in my design.

Regarding Fabian's comments around ZIL limitations do realize that I can't expect super fast performance from a four spindle zpool of two mirrored vdevs and a second rate, NON-Enterprise grade SSD for both ZIL and L2ARC. It probably can't manage more than 300 GB/s sustained sequential write performance but it is a heck of a lot faster than my 4 TB 7200 RPM Seagate Constellation spindle so I would expect better throughput that what I am getting. I would expect something on the order of 180 MB/S sequential, which would translate to a VZdump backup job of 100 GB image lasting about 10 minutes or slightly less. I am already using compression on the zpool so vzdump might run faster with compression set to none. Hmm. I wonder if that has something to do with why it seems to hang.

Fabian, I appreciate the comments about how the PVE internals work together. I have an all SSD zpool made up of two 4-drive-raidz vdevs and it seems able to sustain a steady 1000 MB/s sequential write (as in a virtual disk migration). I expected the raidz to come with a higher performance penalty and was reluctant to use raidz at all but my customer insisted because you lose so much potential storage space with mirrored vdevs. The 2 TB Samsung Evo 850s we got use "3d NAND",which I guess is still potentially outperformed by SLC but it is pretty fast, reasonably long lived, far cheaper than SLC, making an all SSD 12 TB zpool feasible. Anyway, I digress. My point is that I should be able to do VZDump backups really quickly with what I have. I will do more tests and report back.

Sincerely,

GB

fabian · Jun 9, 2016

you should not need to reinstall all those packages (unless they were removed together with mstflint or by the autoremove command) - a simple "dkms install --all" should rebuild all DKMS kernel modules for all installed kernels (note: you need the header packages for this to work).

OFED for PVE

Member

Distinguished Member

Member

Proxmox Staff Member

Distinguished Member

Member

Distinguished Member

Proxmox Staff Member

Renowned Member

Member

Proxmox Staff Member

Renowned Member

Member

Member

Member

Renowned Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member