Mild Disk IO brings VM to its knees

Cecil · Nov 6, 2017

I'm running a windows server 2016 VM on proxmox, so far it has no roles defined or anything just core + gui installed.

I have one shared folder and I'm busy copying data from our current file shares onto it using Allwaysync software over the network.
While this copy process is happening the whole VM becomes extremely slow, like when you click on a floppy disk in windows xp and the entire OS freezes u[ until it read the contents. Seems like some kind of insane HW interupts or something. The CPU and RAM use is not spiking or high at all.
Clicking on the start menu takes about 20 seconds to pop up and typing a search into it takes another 20 seconds before the typed text appear.
Task manager shows disk response time average between 1500 and 4000ms.

I'm using the virtio SCISI driver for the disk, should I be using something else or turning on write cache?
(all these things are greyed out so is it still possible to change them?)
For the network I'm using the VirtIO network driver.

Has anyone else noticed something similar?

udo · Nov 6, 2017

Hi,
you wrote nothing about your underlying IO-system.

If you "stess" your win-VM do you see IO Delay in the summary tab of the node?

Udo

Cecil · Nov 6, 2017

You're right sorry I should have given more detail.
This is a fresh install of PVE 5.1 on an IBM SR550 server. specs:

CPU(s) 32 x Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz (2 Sockets)
Kernel Version Linux 4.13.4-1-pve #1 SMP PVE 4.13.4-25 (Fri, 13 Oct 2017 08:59:53 +0200)
PVE Manager Version pve-manager/5.1-36/131401db

I have two drive controllers,
1 onboard raid 1 SSD's (2x 32Gb) Proxmox installed on this.
2 LSI 930-8i-@gb cache RAID controller with 6x 4TB SAS drive in Raid 5 configuration.

I extended the LVM volume that starts on the SSD's (about 7gb) to also include the larger raid 5 volume with RAW filesystem.

I read in another thread that someone loaded the new virtio 1.141 drivers and that helped for them, so I also loaded those for the disk/network/balloon and qemu agent. Powered down the VM and then also turned on the write back cache option.

I'm trying again to sync the data but its still in the analyzing phase, will report back the IO waits in the PVE screen as soon as the copy starts again.

Cecil · Nov 6, 2017

Initially after the restart it looked better but after a few minutes it became so slow that clicking on things in RDP took more than a minute before anything happened and then soon after that it completely stopped responding to any new commands.

The IO Meter on the host showed max spike of 10% IO delay but it was around the 5% mark most of the time.

Infact its so slow now that I tried logging into the machine via the console in proxmox and its been spinning the "welcome.." screen now for about 4minutes and still nothing happening

looking at the mesg log in putty it seems like something it hung or non responsive, here is a paste of the entire log (it also includes the previous session before the restart)

https://pastebin.com/skW33Ump

udo · Nov 7, 2017

Cecil said:
I extended the LVM volume that starts on the SSD's (about 7gb) to also include the larger raid 5 volume with RAW filesystem.

Hi,
that's an bad idea!
Your 2*32-SSD setup sound not for an high performance (and good write endurance) setup!
I would say - it's dangerous!

If you have space on the first PV (ssd-raid1) this will be used - with an bad performance.

Use your big raid as own lvm-storage and i guess you will get an much better performance.

Udo

Cecil · Nov 7, 2017

The SSD's are IBM PCI-E riser cards enterprise grade. This server has been running all of 1 week I can't imagine that write endurace would be the issue here.
Not sure how LVM fills the drives but the SSD's make up 7Gb of 20 TB and I already put over 1Tb of data on the two combined VM's. Sure the SSD's are long ago out of play at this stage?

But I'm willing to try anything, is there a way to gracefully separate the SSD section away from the hdd section without having to lose everything and start over?

Cecil · Nov 7, 2017

Ok so I have more info that might help.

hdd tune read test:
The IO delay is < 1% on reading and the server is still extremely responsive with no problems, the read rate is ~200mb/s
In windows disk response time was <50ms the entire test.
So reading = no issue.

Next up I used crystaldisk bench and ran a 1Gb file read/write test (default test)
Reading - same as before nothing to talk about.
Writing: sequential write test, no problems.
First Random write test - everything died, errors on syslog VM locks up/becomes unresponsive etc.
IO delay on node shoots to 8%

I'm guessing my only option is to start over and wipe everything out, create a separate LVM for the hdd's and use that for my VM's

I was hoping its because I'm using the disk and network at the same time hat it happened (saw another similar post on here) but it seems to be tied to the local machine.

udo · Nov 7, 2017

Hi,
if you have a spare hdd (due performance reason not usb connected), which is big enough for your VM-data, you can do this without an new installation.

1. connect hdd, like 4TB-Sata-disk
2. create an vg on this hdd like "raidvg" and create an lvm-thin pool
3. define this lvm-thin pool as pve storage
4. move all VM-disk from the gui with storage migration

now should be the pve-data thin pool be empty?! If yes,

5. remove pve-data
6. reduce the sata-raid from pve-vg with vgreduce.
7. use pvremove to erase the lvmsignature from the sata-raid
7.b recreate pve-data (still on 32GB ssd)

and now back to normal,
8. extend the vg raidvg with the sata-raid
9. do an pvmove to move all data from the temporarly hdd to the sata-raid
10. reduce the temp. hdd from raidvg with vgreduce
11. shutdown / disconnect temp. hdd

After that you can expand your lvm to the right size and should get an better performance (hopefully).

Udo

Cecil · Nov 7, 2017

Hey Udo

Thanks a bunch for the info, I don't really have any spare drives at the moment, only usb ones. I'm just going to backup the entire VM to a NAS and then wipe out all and start over using your instructions.

So what I want in the end is to create a new VG with the SAS drives only like RaidVG and use that for my VM disk storage right?

My problem before was that the local folder from the 32GB drives was already too small to put my ISO files etc into and I kept getting stuck trying to figure out and in the end I just merged the large raid into the lvm to resize the local storage to become big enough for iso's.

I'm pretty noob at all this and struggling a lot to figure out how lvm works and how to properly separate the disks. Initially my thought was that if I installed proxmox on the SSD's that would ensure that it has less impact on the overall system and I could use the SAS drives for my main storage. I did not really want things merged together but I read a bunch of guides on adding storage to proxmox and it was suggested to just extend the lvm to cover both drives.

So my question is, if I wipe out the lvm signature of the 20Tb raid 5 disk and re-create it as a new separate lvm, how do I make a small 50Gb partition to take over the job of the "local" storage in proxmox for uploading my iso's?
( I think I can just create a logical volume and format it with ext3 or 4 and then manually mount that with an entry in fstab ?)

RedneckBob · Nov 10, 2017

Cecil said:
Writing: sequential write test, no problems.
First Random write test - everything died, errors on syslog VM locks up/becomes unresponsive etc.
IO delay on node shoots to 8%
.

My own experience with write performance on SSDs that have a capacity smaller than 128G is they suck really really bad.

A couple cheap consumer grade SSDs that I own which are 64G write about 512MB then lock up the entire system until the entire write finishes and the sync closes. They are horrible.

Cecil · Nov 14, 2017

RedneckBob said:
My own experience with write performance on SSDs that have a capacity smaller than 128G is they suck really really bad.

A couple cheap consumer grade SSDs that I own which are 64G write about 512MB then lock up the entire system until the entire write finishes and the sync closes. They are horrible.

Well even after I rebuilt everything on just the SAS drives the result was the same. The problem in the end was my own stupidity... I had forgotten that somone else deleted my array and recreated it when we got the server and they did not enable the controller cache settings again. Once I turned that on the performance was 9001x better ><

Search

Search

Mild Disk IO brings VM to its knees

Cecil

Well-Known Member

Attachments

udo

Distinguished Member

Cecil

Well-Known Member

Cecil

Well-Known Member

udo

Distinguished Member

Cecil

Well-Known Member

Cecil

Well-Known Member

udo

Distinguished Member

Cecil

Well-Known Member

RedneckBob

Renowned Member

Cecil

Well-Known Member

We value your privacy