Serious performance and stability problems with Dell Equallogic storage

Jiri Pokorny

New Member
Apr 17, 2018
5
0
1
36
Hello,

some time ago we switched from vmware to proxmox. Everything was smooth and fine at the beginning and we were very happy. But these days, we have about 120 VMs connected via iSCSI to the network storage that consist of two Dell PSM4110XS. For partitioning we use shared LVM. This type of storage has both SSD and HDD drives, so high IO rates are available. We expect throughput about 150-250MB/s from each VM. Dell Eqallogic show us that:

- max possible IOPS - 4667
- total IOPS - 752
- I/O load - low (so there is still some reserve)
- read write values - 30/70%
- iscsi connections - 34
- network usage - 0,5%
- total link speed - 2,35 GB/s
- TCP retransmmit < 0,1%

Each Blade server (node) in our chassis is connected via 10Gbit/s interface to the storage. So there is no bottleneck. We ran a few test with dd (read and write) and we got only around 8MB/s. During this test online storage monitoring shows us average queue depth in thousands, e.g. 100 000. When it was on the vmware, it showed max 10. We have tried changing network card drivers, tune TCP performance and iscsi parameters but nothing helped.
So we put online our another backup storage PS4100 (only 1Gbit/s connection) to eliminate actual real traffic load for more objective test results. So we have created new volume for performance tests only and ran same dd command (dd if=/dev/urandom of=/root/img bs=4M count=1024 status=progress) from vmware 6.5 and then from latest proxmox. vmware runs at 106MB/s (r/w interface limit) and proxmox again only 8MB/s. Both systems has one iscsi connection to the backup storage, same hw.
We have also experience with read only filesystem. Some VM randomly switch to readonly FS and we must reboot them. It is not so often, but it happens sometime. We think that this issue is related to iscsi problem too.
Where could be a problem?
We like very much the Proxmox interface and ecosystem around (Debian), but we are in a blind alley ... We don't know where we are making a mistake.

Thank you for any help, we don't want switch back to the VMware.
 
Last edited:
Hi,

the problem with such Hardware is that it is highly configurable and a misconfiguration can lead to such problems.

I would start with a plain Linux like Ubuntu 16.04 LTS and test what seed you get if you benchmark with fio.
4K/1M read/write sync and async
 
We have found some several causes.

1) Every device in network path must have MTU 9000.
2) VMs with kernel 2.6.x or 3.2.x have performance issues. Kernel 4.9.0-4-amd64 is running fine.

So my question is, which kernel version do you recommend for guests? Why there is such a big performance gap? 8MB vs 100MB/s? They are on same node, same volume over iscsi, same hw settings (scsi0, only different kernel).

Both kernels have same modules enabled:

Module Size Used by
binfmt_misc 20480 1
ppdev 20480 0
joydev 20480 0
bochs_drm 20480 1
evdev 24576 1
ttm 98304 1 bochs_drm
pcspkr 16384 0
serio_raw 16384 0
drm_kms_helper 155648 1 bochs_drm
sg 32768 0
drm 360448 4 bochs_drm,ttm,drm_kms_helper
parport_pc 28672 0
shpchp 36864 0
parport 49152 2 parport_pc,ppdev
button 16384 0
ip_tables 24576 0
x_tables 36864 1 ip_tables
autofs4 40960 2
ext4 585728 1
crc16 16384 1 ext4
jbd2 106496 1 ext4
crc32c_generic 16384 2
fscrypto 28672 1 ext4
ecb 16384 0
glue_helper 16384 0
lrw 16384 0
gf128mul 16384 1 lrw
ablk_helper 16384 0
cryptd 24576 1 ablk_helper
aes_x86_64 20480 0
mbcache 16384 2 ext4
hid_generic 16384 0
usbhid 53248 0
hid 122880 2 hid_generic,usbhid
sd_mod 45056 3
sr_mod 24576 0
cdrom 61440 1 sr_mod
ata_generic 16384 0
virtio_scsi 20480 2
virtio_net 28672 0
psmouse 135168 0
ata_piix 36864 0
uhci_hcd 45056 0
ehci_hcd 81920 0
floppy 69632 0
virtio_pci 24576 0
virtio_ring 24576 3 virtio_net,virtio_scsi,virtio_pci
virtio 16384 3 virtio_net,virtio_scsi,virtio_pci
i2c_piix4 24576 0
usbcore 249856 3 usbhid,ehci_hcd,uhci_hcd
usb_common 16384 1 usbcore
libata 249856 2 ata_piix,ata_generic
scsi_mod 225280 5 sd_mod,virtio_scsi,libata,sr_mod,sg
 
So my question is, which kernel version do you recommend for guests?
IMHO newer kernels are better. But sometimes slower or faster.

Why there is such a big performance gap? 8MB vs 100MB/s?
I don't know but maybe the kernel uses other parameters like write barriers or ignores sync.
 
Hi,

we will try to compare this parameters barriers or ignores sync. Thank you.

After the kernel change it seems almost perferct, throughput rates are 10 x better. But issue with the queue remained the same. As you can see in the attached pictures before kernel version change there was up to 400 000 queue depth, after the change "only" 20 000. Even if I switch off all VM, it cycles between 20000 and 0 avg. queue depth. With vmware it was up to 10 only. So there must be something in node itself. Does anybody have same experience with the Dell Equallogic and Proxmox? Or another storage connected via iSCSI?

Thank you
 

Attachments

  • queue.PNG
    queue.PNG
    44 KB · Views: 24
  • queue2.PNG
    queue2.PNG
    73.8 KB · Views: 21
  • ononff.PNG
    ononff.PNG
    47.3 KB · Views: 20
OK, we are not happy with this decision but we have to migrate back to VMware. It just works far more better than Proxmox on the Dell Equallogic storage and nobody could help us.
We're afraid of touching on any VM in our node, you must have really good luck to migrate or restore anything ... If you have no luck, then you will lose partitions on some disk, it is terrible ... Proxmox + Equallogic + LVM simply doesn't play together. So be careful what you choose for virtualization.
 
EqualLogic iSCSI Multipathing works quite well on VMware, that is true. The througput is around 150MB/s in all VMs, with Proxmox only 10MB/s .. it is very strange. Also when we tried to recover hdd from a backup, the node management freezed up. When it has been finished, it worked again. Everying is stable with local storage, that is the strong part of Proxmox. So thank you.
 
Hi, did you solve the Avg Queue Depth with proxmox and EQL? we're seeing the same problems here!!! Avg queue depth is above 5000, but the performance on guest is good!!! around 500mb/s for read/write.. firmware 10.0.3 on EQL..
 
Just to update... I had great performance improvement on queue depth disabling the LRO/GRO on iscsi interfaces of the proxmox.. the main problem is because the debian does not have a way to disable DELAY ACK on tcp packet transmission.. soh disabling the hardware acceleration options in interfaces had a great impact on queue depth.. in my case, reduced from 30.000 in-flight queues i/o to only 1000.

also I tunning the iscsi.conf for these options to help within the problem:

node.session.cmds_max = 1024 (default 128)
node.session.queue_depth = 128 (default 32)
node.session.iscsi.FastAbort = No (default Yes)

after all these iscsi tunnings, the problem is almost resolved...

hope this information can help someone!!!
 
  • Like
Reactions: spirit

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!