Very high load at end of restore proxmox 4.3

jmpfas

Member
Oct 29, 2015
14
4
23
I am in the process of upgrading a cluster from proxmox 3 to 4.3
I am doing the upgrade by migrating all VMs off a node, removing that node form the cluster and doing a clean install of 4.3, partly just so it is clean, partly because I am changing how I am setting up the volumes for VM disk images. I am going to lvm-thin

When I do a restore on ANY of the volumes, either the SSD LVM-thin volume or the HDD local LVM volume, I see load on the server and on every VM spike very, very high. This does NOT occur while the actual restore from the backup is running, but instead after the restore hits 100%. From that point to when it finally finishes is about ten minutes, and during that time the load spikes to 25 or higher. VMs go unresponsive etc.

It appears to me that it occurs when the restore starts the process of actually adding the VM or container to the cluster. It occurs even if all the existing VMs are on one SSD volume and I am restoring to a different SSD volume, so it is not contention there.

I was assuming that the load starts in the base server, but it seems to get worse the more running VMs are on the node. i.e. when I restored the second VM the load spiked to 7. when I did the third the load spiked to 10. Now, with 10 VMs it is spiking to 25 (on the base server) and about 10 or 15 on each VM.
So, now I am wondering if something in the cluster is driving load way up in each VM.

After the restore finishes and everything settles down it is all perfect. No performance issues. No errors in logs.

Anyone seen this?
 
Anyone seen this?
Yes, but not with such a big impact to the whole system. Must have something to do with LVM thin, because with qcow2 this does not occur.
We're seeing extremly high write IOPS (up to 50k/sec on SSD, see attached image - this was only the import of a small 50G VM from VMA to LVM thin!) when restoring a VM to LVM thin, which seems not normal.

The chunk size PVE uses on LVM thin is 1 MB (default is 64k), maybe it has something to do with that, because it has to allocate lots of these large chunks?

# lvs -o+chunksize pve/data
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Chunk
data pve twi-aotz-- 1.62t 30.46 15.30 1.00m
[/quote|
 

Attachments

  • 4TkOnzNE3EY9VABf.png
    4TkOnzNE3EY9VABf.png
    48.4 KB · Views: 5
Must have something to do with LVM thin, because with qcow2 this does not occur.
Yes i can agree. Running in the same issue. But i can say it is not really an issue. Thin LVM is a very young extension. It is very depending how old and how much power has your server. With new actual machines there are no problems... but better follow my story: https://forum.proxmox.com/threads/r...pgarded-server-pve4-lvm-thin-very-slow.30341/

Actually a doing an benchmark with the storages, i post it there.
What say
Code:
pveperf
 
With new actual machines there are no problems.
We're using HP DL160 gen9 with Dual Xeon E5 and 2 x Samsung SM863 SSD, so imho this does exist on actual machines, too :/
Problem occured with the first import on that empty and new machine (20+ loadavg and 50k IOPS while importing).

CPU BOGOMIPS: 134284.32
REGEX/SECOND: 2076866
HD SIZE: 49.09 GB (/dev/dm-0)
BUFFERED READS: 1011.51 MB/sec
AVERAGE SEEK TIME: 0.14 ms
FSYNCS/SECOND: 7269.14

Our problem ist not slowness, but heavy load when importing ;) This is not a very big problem because imports are fast, but if there will be more and more VMs on this host it can become a serious problem.
 
This is really really strange... so on what depending it really? o_O
BTW: Benchmark written in the other thread.
 
I do not think it does have to do with chunk size or anything like that. It is something more...profound.
two things to consider:
1) as I said, the high load occurs AFTER the restore hits 100%, and has at that point finished writing the disk image file to the volume. It is only about 16 seconds from start to that point. Then it sits there for up to ten minutes with load growing and VMs dying.
2) othe rVMs actually have their disks time out and start dumping errors tot heir console.
It is almost like the last stage of the restore somehow crashes the volume group in some way.
 
and...getting weirder.
I just did 6 restores on one server that had no other existing VMs. East restore took longer, and load and IO delay went higher. This was without even starting the VMs. Just restore, then again, then again, creating multiple new VMs.
Then I tried doing a restore to an NFS volume. No issue - total time from when it said 100% to finishing was about 8 seconds.
Then I did a disk move for that VM from the NFS to the local-LVM (not a thin volume). It spiked the load up.
Then I did a disk move from that volume to the lvm-thin volume and it spiked load up.
Each time I saw the load start to climb after the physical image had been written to the volume.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!