Multpath link to SAN for concurrent VMs with high and constant writing rates

zamo2k · Jul 8, 2014

Hello everybody,

I'm new to PROXMOX and to Linux/iSCSI/multipath in general, I'm trying to learn, therefore I'm asking for your suggestions in order to tune my system, or maybe change approach if you have better ideas. It will be a long post...

I'm trying to set up PROXMOX on a server machine in order to run Windows 2003 Server guests that, each one, run a video survellance software which takes live H.264/MPEG4 unicast streams which are redirected to clients in multicast, and at the same time takes MJPEG streams and writes them to disk for recording. The "disk" is a SAN.

The server is a Dell PowerEdge R720 with 2 x esa-core CPUs, 32 Gbit RAM, and one quad GBit Ethernet NIC (4 x 1 GBit ports).
The SAN is an Enhance ES3160P4 with 4 x 1 Gbit ports x 2 controllers, and is accessed through iSCSI.

On the SAN for each controller a 4 x 1 GBit LACP link has been set up, which is attached to a Juniper Virtual Chassis, with LACP enabled.

On the server I set up 2 separate networks: the "stream network" is used by two physical interfaces, which are configured with a LACP bond (and the same is done on the switches), while the "SAN network" is used by the remaining 2 interfaces. Each one of the latter is set up separately with a different IP address on the "SAN network".

For the "stream network" I set up a virtual bridge on the bond, that is then used for connecting the W2k3 VMs on the "stream network" itself.

On the "SAN network" I set up iSCSI and multipath and got that every LUN of the SAN is seen through 4 paths (2 interfaces x 2 controllers) with this default multipath configuration:

defaults {
polling_interval 2
path_selector "queue-length 0"
path_grouping_policy multibus
getuid_callout "/lib/udev/scsi_id -g -u -d /dev/%n"
rr_min_io 1000
failback immediate
no_path_retry fail
}

I modified the timeouts in the configuration files of the iSCSI targets as suggested in the PROXMOX documentation.

Using multipath aliases, I get the devices in /dev/mapper/, so using pvcreate, vgcreate and lvcreate I set up one pv/vg/lv per LUN and mounted each one in a folder in /mnt/. In each LUN I created an EXT3 filesystem.

Then, in each LUN/mnt folder (that is seen as "local" in the PROXMOX web interface) I created a virtual disk that is then attached to a single W2k3 machine as a VIRTIO disk that is used only for the recording part of the software.

That is, each VM sees two virtual disks, one for system and one as "recorder", and the recorder is created in a mount that corresponds to a LUN of the SAN. The VMs also see one VIRTIO network card, which is connected to the "stream network" through the aforementioned virtual bridge.

If I run only one VM (which handles 10 digital videocameras), everything goes fine. One single VM needs to write approximately 8 Mbyte/sec to the SAN.
If I start to run more VMs, the IO delay of the server starts to increase (reaching 5-7-10% with 8 VMs) and the recording part of the software run by the VMs becomes very slow and not usable, while the live part runs ok.

Looking at the statistics on the physical links on the switch, I noticed that both links on the "SAN network" show traffic and that it is balanced between the two links, as expected. However, the sum of the traffic going out from the server NIC doesn't go higher than about 300 Mbit/s (about 150 Mbit/s for each link). Starting in sequence 8 VMs shows that the total traffic increases and then saturates on this threshold when 4-5 VMs are running. It seems to me then that there is a sort of bottleneck in the link between the server and the SAN, maybe due to a not optimal configuration... and that's because I'm here writing for asking your suggestions.

I also had a look at the syslogs of the server, and I found many and repeating errors regarding multipath that I can't interpret, like (as examples):

Jul 8 16:31:56 vmcluster-tvcc01 multipathd: vol-storage-236: sdq - directio checker is waiting on aio
Jul 8 16:31:56 vmcluster-tvcc01 multipathd: vol-storage-231: sdu - directio checker is waiting on aio
Jul 8 16:31:56 vmcluster-tvcc01 multipathd: vol-vm-base: sdy - directio checker is waiting on aio
Jul 8 16:31:58 vmcluster-tvcc01 multipathd: vol-storage-237: sde - directio checker is waiting on aio

.............

Jul 8 16:35:37 vmcluster-tvcc01 kernel: session20: session recovery timed out after 15 secs
Jul 8 16:35:37 vmcluster-tvcc01 multipathd: vol-storage-235: sdk - directio checker is waiting on aio
Jul 8 16:35:37 vmcluster-tvcc01 multipathd: checker failed path 8:160 in map vol-storage-235
Jul 8 16:35:37 vmcluster-tvcc01 multipathd: vol-storage-235: remaining active paths: 3
Jul 8 16:35:37 vmcluster-tvcc01 kernel: device-mapper: multipath: Failing path 8:160.

.............

Jul 8 16:35:39 vmcluster-tvcc01 kernel: sd 28:0:0:232: rejecting I/O to offline device
Jul 8 16:35:39 vmcluster-tvcc01 kernel: sd 28:0:0:232: [sdw] killing request
Jul 8 16:35:39 vmcluster-tvcc01 kernel: sd 28:0:0:232: rejecting I/O to offline device
Jul 8 16:35:39 vmcluster-tvcc01 kernel: sd 28:0:0:232: [sdw] killing request
Jul 8 16:35:39 vmcluster-tvcc01 kernel: sd 28:0:0:232: rejecting I/O to offline device

.............

Jul 8 16:35:39 vmcluster-tvcc01 kernel: end_request: I/O error, dev sdm, sector 4470628224
Jul 8 16:35:39 vmcluster-tvcc01 kernel: end_request: I/O error, dev sdm, sector 4471730304
Jul 8 16:35:39 vmcluster-tvcc01 kernel: end_request: I/O error, dev sdm, sector 4470500480
Jul 8 16:35:39 vmcluster-tvcc01 kernel: end_request: I/O error, dev sdm, sector 4472330752

I hope I've been clear enough...
Any help is welcome!
Many thanks in advance!

NdK73 · Jul 9, 2014

I think those error might be related to LVM trying to access the "dead" path (I have many similar lines when accessing LVM metadata on my SAN that uses SAS instead of iSCSI). If so, they're unrelated to the problem you're seeing.

Try benchmarking the iSCSI storage from within proxmox host, not from a VM (maybe by using bonnie++) while monitoring the bandwidth usage: if it saturates (around 100MB/s) then the problem lies in the windows virtio driver that can't keep up, or the mapping throught the filesystem layer (ext3 is not the best for huge files... have you tried using LVM-over-iSCSI, without having to mount anything on the server?).

Another thing to check is the brand of the network controller (Intel is usually good). Or, if at all possible, use a dedicated iSCSI HBA that offloads the CPU.

Just a bit of brainstorming, but hope it helps...

PS: you know that your current solution won't scale if you need to add another server, right?

zamo2k · Jul 16, 2014

Many thanks for your reply NdK73.
I will try your suggestions. In the meanwhile I also contacted Enhance in order to get the parameters to use in the configuration of the iSCSI targets that perform best with their machine...let's see whether they answer.

NdK73 said:
PS: you know that your current solution won't scale if you need to add another server, right?

What do you mean? Sorry for my probably trivial questions...

NdK73 · Jul 17, 2014

zamo2k said:
Many thanks for your reply NdK73.
I will try your suggestions. In the meanwhile I also contacted Enhance in order to get the parameters to use in the configuration of the iSCSI targets that perform best with their machine...let's see whether they answer.

Another thing to try: use Proxmox's "alb" or "tlb" bonding instead of LACP: IIUC, LACP gives you only single-link bandwidth for each connection: keep an eye on CPU and link utilization on your switch. If you really need LACP for other reasons, try using a single trunk with two VLANs on 4 interfaces instead of two tunks of two interfaces: that should allow better bandwidth use.
Moreover LACP won't allow you to have two switches (for HA), so if your switch dies, you're toast.

zamo2k said:
What do you mean? Sorry for my probably trivial questions...

Remember to never allow two VMs (or Proxmox hosts) to access the same LUN unless you're using a cluster filesystem. IIUC you currently are mounting LUNs in Proxmox (to use 'em as local storage). So, if you add another node to the cluster, you won't be able to mount the same LUNs on it -> no shared storage -> no live migration of VMs between nodes!
And you're adding an (useless?) fs layer to a latency-critical path.
Try using directly LVs as disks for the VMs (see the wiki and experiment on a different LUN) and maybe you will be able to scale better. Remember that a LV can still be active only on one VM unless the VM itself uses a clusterd fs (too slow for your use, I fear), but at least you'll be able to live migrate that VM to another host.

Hope it helps!

mir · Jul 17, 2014

NdK73 said:
Moreover LACP won't allow you to have two switches (for HA), so if your switch dies, you're toast.

This is not true. If you have stakable switches a LACP bond can span several switches.

NdK73 · Jul 17, 2014

Good to know, but I fear it's really non-standard.
IIUC (from many online references), IEEE 802.3ad mandates that the aggregated ports must be on the same physical switch. Maybe 802.3ax relaxes it, or it's a vendor-specific extension, or I didn't understand correctly...

mir · Jul 17, 2014

The idea behind stabable switches is that they are perceived as one physical switch. The concept is a standard implemented by all major switch manufactures. The only catch is that you can only stack switches from the same manufacture.

NdK73 · Jul 17, 2014

Then we're using the same term for different things.
My experience is limited mostly to HP switches, where stacking only helps management: you access the (single!) controller, and from there you're able to access all the members. But members remain independent.
So at least one big vendor is not implementing it... Given that, I'd prefer not to count on it, preferring alb or even simple rr...

NdK73 · Jul 17, 2014

Found the reason for "unbalanced channel use" with LACP: see wikipedia page about LACP, "Order of frames" section. That's probably the saturation zamo2k is seeing.

mir · Jul 17, 2014

That is because you properly have not used advanced procurve switches?

In HP terms it is called 'distributed trunking'. Read -> http://cdn.procurve.com/training/Ma...0-8200-MCG-June2009-59923059-12-PortTrunk.pdf

So yes, HP supports LACP over multiple switches given the switch has the correct feature set. I think the cheapest HP switch to support distributed trunking is procurve 2920.

mir · Jul 17, 2014

NdK73 said:
Found the reason for "unbalanced channel use" with LACP: see wikipedia page about LACP, "Order of frames" section. That's probably the saturation zamo2k is seeing.

You can overcome this by configuring the bond using a L4 hash:
bond-xmit-hash-policy layer3+4

NdK73 · Jul 17, 2014

mir said:
You can overcome this by configuring the bond using a L4 hash:
bond-xmit-hash-policy layer3+4

Err... isn't iSCSI using a single connection per login? If so, it wouldn't help, unless the load is balanced across multiple iSCSI targets (not OP's case).

NdK73 · Jul 17, 2014

mir said:
That is because you properly have not used advanced procurve switches?

All our switches are HP Procurve, but mainly 2510 and 2610 series

mir said:
In HP terms it is called 'distributed trunking'. Read -> http://cdn.procurve.com/training/Ma...0-8200-MCG-June2009-59923059-12-PortTrunk.pdf

So yes, HP supports LACP over multiple switches given the switch has the correct feature set. I think the cheapest HP switch to support distributed trunking is procurve 2920.

Tks for the info.
Just seen some prices... I don't need such feature so badly

Good to know they can do it, but I prefer to look at more affordable (and cross-platform) solutions

mir · Jul 17, 2014

For a 'cheap' solution I would use two LACP bonds each connected to a different switch and then use multipath (active/active for performance or active/passive for safety)

mir · Jul 17, 2014

NdK73 said:
Err... isn't iSCSI using a single connection per login? If so, it wouldn't help, unless the load is balanced across multiple iSCSI targets (not OP's case).

If you use multipath a iSCSI connection can use multiple connections per login.

zamo2k · Aug 1, 2014

I'm back!
Many thanks to everybody for the suggestions.
Regarding the discussion about LACP I was using it only for the "stream network", on 2 aggregated ports, each one linked to a different Juniper switch. Both switches are part of a so called "Virtual Chassis", that is they appear as they were only one switch (VC is a Juniper proprietary technology). However, I believe that issues related to LACP vs number of switches are not relevant here for my problems.

Following the suggestions and googling more around, I made the following changes:
- instead of creating a LV with an EXT3 filesystem for each VG, I directly added the VG into the storage view of the PROXMOX web gui, adding an LVM storage based on "Existing volume groups"
- I added to lvm.conf the following line:
filter = [ "r/disk/", "a/.*/" ]
- I changed the values in multipath.conf to reflect what is suggested in the PROXMOX FAQs (I realized there were some differences due to some tests I did)

I could not tune the ISCSI parameters to fit the vendor suggestions, simply because Enhance is not giving me any suggestion... they answered they are working on it and will let me know......

After these changes things improved: no more warnings or errors in dmesg regarding disks or multipath, and IO delay reduced. I tried to launch 7 VMs all together (handling 70 videocameras in total) and each one still gave a good "user experience", that is it was possible to use Windows with no problems. Live videos are streamed ok to the clients, and the machines store correctly the streams on the SAN (through virtual disks).

However, there are still problems when a VM wants to read the recordings. Reading is very slow, so slow that becomes not usable for actual clients that want to retry recordings.

I also tried to change the way the server is connected to the SAN from two separated links (each one with its own IP address, and thus 4 paths to the SAN, which has 2 controllers) by aggregating such links using balance-tls (so now I've 2 paths to the SAN), but it doesn't seem to change much the final result. Out traffic is correctly balanced across the two interfaces, although in traffic is not. But in traffic is required only when a client wants a recording and that's not much traffic.

Note that I'm sure that downloads can be faster because I've other physical servers doing the same work. On each one the videosurveillance software runs into a real (not virtual) Windows environment and handles 30 cameras. When I try to download videos from one of these servers I get them like 3-4 time faster than from a VM handling 10 cameras on a server with 3 VMs running (so again 30 cameras per server).

Any more suggestion on how to improve readings? Or maybe some test I can make to discover where's the bottleneck?

Anyway, many thanks again for the suggestions, things already improved a lot!
Ciao!

mir · Aug 1, 2014

You could try with another filesystem. I would suggest ext4 and xfs. Test tools which is relevant for you would be iozone, iometer and other tools like that that measure the speed of the filesystem.

zamo2k · Aug 3, 2014

Not sure I understood correctly your suggestion.
At the moment there's not filesystem involved anymore. I was using EXT3 before, mounting each LUN in a folder, but now I removed that filesystem layer by adding directly the VGs to the PROXMOX's storage view as LVM storage. This seemed to improve performance.

Do you mean to create again manually LVs on the VGs, format them as EXT4 or XFS, mount them in a folder, and then add the folder as a storage to PROXMOX? Basically, the same I was doing before but with other filesystems?

Many thanks again!

Search

Search

Multpath link to SAN for concurrent VMs with high and constant writing rates

zamo2k

Member

NdK73

Renowned Member

zamo2k

Member

NdK73

Renowned Member

mir

Famous Member

NdK73

Renowned Member

mir

Famous Member

NdK73

Renowned Member

NdK73

Renowned Member

mir

Famous Member

mir

Famous Member

NdK73

Renowned Member

NdK73

Renowned Member

mir

Famous Member

mir

Famous Member

zamo2k

Member

mir

Famous Member

zamo2k

Member

We value your privacy