KVM guests freeze (hung tasks) during backup/restore/migrate

gkovacs

Active Member
Dec 22, 2008
503
45
28
Budapest, Hungary
This issue has been with us since we upgraded our cluster to Proxmox 4.x, and converted our guests from OpenVZ to KVM. We have single and dual socket Westmere, Sandy Bridge and Ivy Bridge nodes, using ZFS RAID10 HDD or ZFS RAIDZ SSD arrays, and every one of them is affected.

Description
When there is high IO load on the ZFS pools during vzdump, restore or migrate, the guests' IO slows down extremely or even freezes for a few seconds, resulting in:
- lost network connectivity (Windows guests often lose Remote Desktop connections)
- huge latency in network services
- CPU soft lockups
- CPU / rcu_scheduler stalls
- application blocked in syslog or even stack dump

We run monitoring services that poll the websites and other network services served by these guests every minute, that's how we started to realize we have a problem, because we started getting alerts during nightly backups.

This is today's soft lockup in a Debian 7 guest during the restore of another KVM to zfs-local (6x HDD ZFS RAID10) on a single socket Ivy Bridge system. There was no load on the system apart from the restore, other guests are mostly inactive:

Also a Windows KVM was unreachable during that time.

Mitigation steps
We have tried many tweaks to eliminate the problem:
- disabling C-states on Westmere systems
- enabling performance governor
- recommended swap settings from the ZFS wiki, also vm.swappiness=1
- increasing vm.min_free_kbytes on both hosts and guests
- decreasing vm.dirty_ratio to 5-15, vm.dirty_background_ratio to 1-5
- installing NVME SSDs as SLOG/L2ARC, also for swap

Some of these tweaks helped a little, but the issue is still happening, maybe less during backups but still heavily during restores and migrations. The issue seems to be connected to the Linux kernel's VM (virtual memory) subsystem, because if you set vm.vfs_cache_pressure to a high value (like 1000) in a KVM guest, the lockups happen much more often. Also Debian looks more sensitive to it than Ubuntu for example, but Windows guests are also affected.

Help needed
I am looking for input from others who also run KVM guests on zfs-local (zvols), interested if they also experience these symptoms (you won't see a 1-2 minute freeze unless you run some kind of monitoring). I would also welcome advice on how to diagnose it further, to decide if the issue is in QEMU/KVM, ZFS or some other parts of the kernel.
 
Last edited:
  • Like
Reactions: William Edwards

Nemesiz

Active Member
Jan 16, 2009
678
42
28
Lithuania
At the high load ZFS can hold r/w requests and making them to be in queue or even hold accepting new request until it finish with the firsts requests. You can feel it even with zfs commands like #zfs list at high load. It will respond after few seconds.

I suggest you to use snapshot and copy (if you need) from it with #rsync --bwlimit

To see hdd/sdd load I use atop in host machine. Its shows how hdd/sdd are heavy loaded.
 

fireon

Well-Known Member
Oct 25, 2010
3,052
190
63
Austria/Graz
iteas.at
Can you post also one VMconfig please?
Looks strange, have also 6 HDDs in Raid 10, and RaidZ mit SSD's. When the backup runs you will not notice anything at all. Have here some LXC's, some KVM's, also WinServer2016 and Ubuntu's.

Can post also the Smartdatas from all your disks? (smartctl -a /dev/disk/by-id/....) Only this sections:
Code:
=== START OF INFORMATION SECTION === 

=== START OF READ SMART DATA SECTION ===
And the output of "zpool get all".

Thanks
 

gkovacs

Active Member
Dec 22, 2008
503
45
28
Budapest, Hungary
Can you post also one VMconfig please?
Looks strange, have also 6 HDDs in Raid 10, and RaidZ mit SSD's. When the backup runs you will not notice anything at all. Have here some LXC's, some KVM's, also WinServer2016 and Ubuntu's.
There is not much point in posting vm configs or S.M.A.R.T. reports, because this issue affects all of our servers and many different VMs.

But you can reproduce it easily: according to our tests, when VMs have their RAW disks on ZFS Zvols, and you start to restore another (big) VM to the same host, the already running VMs get starved of IO and their apps hang, their CPU locks and kernel freezes. It's very important to use ZVOLs with cache=none (the Proxmox recommended configuration), as guests having QCOW2 disks with cache=writeback are much less sensitive to the IO starvation, probably because their virtual memory is cached on the host (and not by ZFS).

It currently looks like it's a ZFS scheduling / caching bug, and the ZFS on Linux issue tracker has a few similar problems already:
https://github.com/zfsonlinux/zfs/issues/1538
https://github.com/zfsonlinux/zfs/issues/5857
https://github.com/zfsonlinux/zfs/issues/5867

@fabian do you have the chance to try to reproduce this?
 

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
3,399
529
113
There is not much point in posting vm configs or S.M.A.R.T. reports, because this issue affects all of our servers and many different VMs.

But you can reproduce it easily: according to our tests, when VMs have their RAW disks on ZFS Zvols, and you start to restore another (big) VM to the same host, the already running VMs get starved of IO and their apps hang, their CPU locks and kernel freezes. It's very important to use ZVOLs with cache=none (the Proxmox recommended configuration), as guests having QCOW2 disks with cache=writeback are much less sensitive to the IO starvation, probably because their virtual memory is cached on the host (and not by ZFS).

It currently looks like it's a ZFS scheduling / caching bug, and the ZFS on Linux issue tracker has a few similar problems already:
https://github.com/zfsonlinux/zfs/issues/1538
https://github.com/zfsonlinux/zfs/issues/5857
https://github.com/zfsonlinux/zfs/issues/5867

@fabian do you have the chance to try to reproduce this?
just tried, cannot reproduce. no message on the host side, no messages on the vm side.
 

guletz

Active Member
Apr 19, 2017
998
140
43
Brasov, Romania
Hello gkovacs,

1. what kind of hdd do you have (4k sector size?)
2. do you use same kind of raid controller?
3. do you use dedup and/or compression ?
4. do you have 8k zvol sector size (default in proxmox)
5. do you have ecc ram ?

I can not know what is the problem (I read the bugs report links), but I tell what I will try if I have this problem:

1. use at least 32 k for zvol sector-size(iops is huge for 8k compared with 32k)
2. put the all disks from the affected server into another non-Intel server (it is good to have different cpu servers in case of ...) and see if you have any problems.
3. try to use a vanilla debian kernel, with the last zfs modules available (this week we have get a new zfs upgrade)

I can guess that yours problems are related only with debian/kernel, because I do not see guys that are complaining about this on others linux distributions, but maybe I am wrong, as usually ;)
 

gkovacs

Active Member
Dec 22, 2008
503
45
28
Budapest, Hungary
Hello gkovacs,

1. what kind of hdd do you have (4k sector size?)
2. do you use same kind of raid controller?
3. do you use dedup and/or compression ?
4. do you have 8k zvol sector size (default in proxmox)
5. do you have ecc ram ?
All of our servers are affected by this.
- some nodes use SSD drives connected to Intel ICH controller (ZFS RAIDZ)
- some nodes use Toshiba DT01ACA drives connected to Adaptec 6805E controllers (disks are Morphed JBOD, ZFS RAID10)
- we use default compression, no dedup
- we use default ZVOL block size (8k)
- some servers have ECC RAM, some don't

A bit more info:
- every node boots from ZFS
- swap is on ZFS in most nodes (rpool/swap), but in some nodes it's on NVME partition
- we have SLOG+L2ARC on NVME in some RAID10 nodes

I can not know what is the problem (I read the bugs report links), but I tell what I will try if I have this problem:

1. use at least 32 k for zvol sector-size(iops is huge for 8k compared with 32k)
2. put the all disks from the affected server into another non-Intel server (it is good to have different cpu servers in case of ...) and see if you have any problems.
3. try to use a vanilla debian kernel, with the last zfs modules available (this week we have get a new zfs upgrade)

I can guess that yours problems are related only with debian/kernel, because I do not see guys that are complaining about this on others linux distributions, but maybe I am wrong, as usually ;)
Thanks for your ideas. Unfortunately, as I wrote above, this issue does not simply affect a single node or a set of disks, it is happening on all our nodes.

Is there a way to migrate VM disks to different blocksize ZVOLs? IIRC if you change the blocksize of a ZVOL, it will not change the data that's already written...
 

fireon

Well-Known Member
Oct 25, 2010
3,052
190
63
Austria/Graz
iteas.at
- some nodes use Toshiba DT01ACA drives connected to Adaptec 6805E controllers (disks are Morphed JBOD, ZFS RAID10)
Not real SAS SATA, JBOD is not optimal...
1. what kind of hdd do you have (4k sector size?)
@gkovacs
Please post the Smartinfos. And the output of "zpool get all".

Thanks.
 

gkovacs

Active Member
Dec 22, 2008
503
45
28
Budapest, Hungary
Not real SAS SATA, JBOD is not optimal...
Morphed JBOD is the only way to give a full disk to ZFS while still being able to boot from the Adaptec card. Any more insight on why is this not optimal?

Please post the Smartinfos. Thanks.
We have a whole cluster of servers, each full of hard drives and some with solid state drives. As I wrote above, all of the servers exhibit the same problem, which is very likely a ZFS scheduler issue.

This is a HDD (out of 6 in ZFS RAID10), connected through Adaptec card in a single socket Ivy Bridge node:
Code:
=== START OF INFORMATION SECTION ===
Vendor:               Adaptec
Product:              Morphed JBOD 00
Revision:             V1.0
User Capacity:        999,643,152,384 bytes [999 GB]
Logical block size:   512 bytes
Logical Unit id:      0xb110884700d00000
Serial number:        478810B1
Device type:          disk
Local Time is:        Thu Apr 27 21:08:13 2017 CEST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
This is an SSD (out of 3 in RAIDZ), connected through Intel ICH in a dual socket Westmere node:
Code:
=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 850 EVO 500GB
Serial Number:    S2RBNX0H404795M
LU WWN Device Id: 5 002538 d40ca2381
Firmware Version: EMT02B6Q
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Apr 27 21:11:04 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
 

guletz

Active Member
Apr 19, 2017
998
140
43
Brasov, Romania
[QUOTE="gkovacs, post:
Is there a way to migrate VM disks to different blocksize ZVOLs? IIRC if you change the blocksize of a ZVOL, it will not change the data that's already written...[/QUOTE]

Yes. You need to create a new virtual disk(with a greater size like your actual disk size)using zfs comand line (see zpool history, and use volblocksize=32k or even more). Then attach a iso cd-rom with clonezilla live cd(or any clone live CD)to the Vm. boot from this iso and clone your-8k-disk to the new 32k-disk. Then use the new disk as a boot disk. If all is ok, then remove the 8k-disk from Vm.
It is only one posibility, (dd, migrate the vdisk to a nfs and then copy back to vm...). I say the what I do for many times, and it was work for me. Cloning a vdisk is the safety option if you do not mess the source with the destination vdisk . With more disks in a pool/vdevl use bigger volblocksize (you can loose some space, but you gain a lot of speed and better IOPs, and also the scrub is faster )
 
Last edited:
  • Like
Reactions: gkovacs

fireon

Well-Known Member
Oct 25, 2010
3,052
190
63
Austria/Graz
iteas.at
This is an SSD (out of 3 in RAIDZ), connected through Intel ICH
I have here also an RaidZ with 3 Samsung SSD's, but Enterprise, not cheap customermodel. Copytest goes over 1GB/s.

We use only HDD's with "4096 bytes physical". This is recommend for ZFS.

We have done tests with JBOD a time ago with HP Servers. Had really bad Results with ZFS. The same Server with an real SAS HBA Controller works really fast :) And we use only HDD's with 4K.

Your problem is a little bit strange, because you say before the update it was working fine. Maybe really that bug what you was finding out. But all us ZFS Server are running under PVE 4.4. With no one of then we have actually performance problems.

Can you tell me how much memory one of your servers have? And have you set ZFS Memory Limit?
https://pve.proxmox.com/wiki/ZFS_on_Linux#_limit_zfs_memory_usage
Thank you.
 

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
3,399
529
113
gkovacs said:
Is there a way to migrate VM disks to different blocksize ZVOLs? IIRC if you change the blocksize of a ZVOL, it will not change the data that's already written...
Yes. You need to create a new virtual disk(with a greater size like your actual disk size)using zfs comand line (see zpool history, and use volblocksize=32k or even more). Then attach a iso cd-rom with clonezilla live cd(or any clone live CD)to the Vm. boot from this iso and clone your-8k-disk to the new 32k-disk. Then use the new disk as a boot disk. If all is ok, then remove the 8k-disk from Vm.
It is only one posibility, (dd, migrate the vdisk to a nfs and then copy back to vm...). I say the what I do for many times, and it was work for me. Cloning a vdisk is the safety option if you do not mess the source with the destination vdisk . With more disks in a pool/vdevl use bigger volblocksize (you can loose some space, but you gain a lot of speed and better IOPs, and also the scrub is faster )
or you could just configure the ZFS storage in PVE with a different (vol)blocksize, and use the "move disk" feature (you can even do that with a running VM, but if you already experience I/O bottlenecks when restoring a backup this might trigger the same issue..).

you cannot change the block size of a zvol which has ever been written to, so you need to create a new zvol and rewrite/copy all the data once (one way or another - zfs send/receive, dd, qemu-img convert/offline move disk, qmp drive mirror/online move disk, ...).
 

linum

Member
Sep 25, 2011
65
1
8
Hmm, this is exactly what I see on two system here. And take a look at these threads:

https://forum.proxmox.com/threads/proxmox-4-4-13-restore-issue.34379/
https://forum.proxmox.com/threads/high-io-delay-on-restore.34358/

I think there's a general problem how the restore is writing data to the disk. I my opinion there are two many different setup that show there high IO loads and waits. And at least it seems to me that it doesn't matter what filesystem is used.

We could try to get all affected hardware combination in a google sheet to see if there's something in common...
 
  • Like
Reactions: gkovacs

guletz

Active Member
Apr 19, 2017
998
140
43
Brasov, Romania
Hmm, this is exactly what I see on two system here. And take a look at these threads:

https://forum.proxmox.com/threads/proxmox-4-4-13-restore-issue.34379/
https://forum.proxmox.com/threads/high-io-delay-on-restore.34358/

I think there's a general problem how the restore is writing data to the disk. I my opinion there are two many different setup that show there high IO loads and waits. And at least it seems to me that it doesn't matter what filesystem is used.

We could try to get all affected hardware combination in a google sheet to see if there's something in common...
Again, my own opinion is the fact that I use many servers with zfs(linux centos, kernel 2.6.x, the last zfs version, intel and amd cpu, with 2-8 hdd, and/or ssd for cache/slog) and I did not see this kind of bug, despite the fact that I have some times a very heavy load (>20), for 24-36 hours.
So I guess it is some related with debian/kernel version.
 

linum

Member
Sep 25, 2011
65
1
8
@guletz I think this is also a valid option. Question is what we can do to get this fixed? Maybe it is possible to run the centos kernel with proxmox butI think this wouldn't be a drop in replacement ... Any ideas?
 

gkovacs

Active Member
Dec 22, 2008
503
45
28
Budapest, Hungary
SATA disks and JBOD expander is not recommended with ZFS. If you use a JBOD expander you should only use SAS disks.
Not sure where you read about a JBOD expander. Never used one.

We are using an Adaptec 6805E RAID controller in many of our nodes for connecting the member SATA disks, in Morphed JBOD mode, which passes through whole disks to the OS but keeps them bootable via the controller's BIOS.
 

gkovacs

Active Member
Dec 22, 2008
503
45
28
Budapest, Hungary
or you could just configure the ZFS storage in PVE with a different (vol)blocksize, and use the "move disk" feature (you can even do that with a running VM, but if you already experience I/O bottlenecks when restoring a backup this might trigger the same issue..).

you cannot change the block size of a zvol which has ever been written to, so you need to create a new zvol and rewrite/copy all the data once (one way or another - zfs send/receive, dd, qemu-img convert/offline move disk, qmp drive mirror/online move disk, ...).
@fabian I have tried to move all the affected VMs to 32k blocksize ZVOLs, same errors. Also tried to convert all VMs to QCOW2 disks (stored on ZFS). No change. We have even tried to disable KSM, did not help either.

Unfortunately, the problem remains: every night during vzdump backups, some of the VMs become inaccessible via HTTP, and on their console there are numerous hung task timeout errors:

proxmox4-newlinux-20170616.jpg

Any idea why do KVM guests userland processes hang when their disk is being backed up by vzdump (or migrated/restored)?

As @linum posted, others are experiencing the same problem, and not just on ZFS:
https://forum.proxmox.com/threads/proxmox-4-4-13-restore-issue.34379/
https://forum.proxmox.com/threads/high-io-delay-on-restore.34358/
 
Last edited:

guletz

Active Member
Apr 19, 2017
998
140
43
Brasov, Romania
Hello @gkovacs ,


I have seen in the past something like you, in a non-Proxmox enviroment(and also non-debian). In my case it was happen when my server has go to swap(no avalaible memory)+ high IO load. At that time, I concluded that was combination of events (kernel version, zfs version, and so on). Also the most important thing, was the fact that on another server (it was a cluster with 2 hw nodes, with the same software stack like the server with the problem, except that one have more memory installed, so no swap was used)
After 1-2 month after zfs and kernel upgrade, this problem has disappeared . .. and my load was the same (some cron scripts generate the load). Another thing that I remember was the fact that for my case, the heavy load was generated by a script who was read a very big folders meta-dates (search for one small file name from 7.000.000 files)

I am note sure 100 % that what was happening to me is the same for you, but in a such case, maybe the best solution is to avoid the effects , and not the cause of the problem. So maybe you can think to use at something else, like zfs send-receive, and to avoid the problem. Another ideea could be to reserve for zfs more memory only for meta-data. For sure, by default zfs will eat at most 50 % from total memory . I do not know what is for metadata part. But I will try with 25 % at minimum from total RAM.
I am sure that vzdump will need a lot of RAM for metadata. By the way, this errors are happening before vzdump start to create the gz/archive or before(when start to find what files/folders are need to be archived)?
I am very sure that qcow2 is not the solution, because it will put additional load to zfs(compared with raw format), so it will need more IOPs from your storage.

Because you see the same problem in the migration process (so no vzdump is evolved ), I can guess that the same metadata is the first guilty for yours errors .
Good luck and reply if you have some news. I will cross my fingers for you - because I have your situation in the past.
 

gkovacs

Active Member
Dec 22, 2008
503
45
28
Budapest, Hungary
Hello @gkovacs So maybe you can think to use at something else, like zfs send-receive, and to avoid the problem. Another ideea could be to reserve for zfs more memory only for meta-data. For sure, by default zfs will eat at most 50 % from total memory . I do not know what is for metadata part. But I will try with 25 % at minimum from total RAM.
I am sure that vzdump will need a lot of RAM for metadata.
Thank you for your post. I don't think that metadata is the problem, because we are backing up single QCOW2 files at once, which hardly use lot of metadata. We don't store big filesystem trees on ZFS, only a few QCOW2 disk files on every node.

By the way, this errors are happening before vzdump start to create the gz/archive or before(when start to find what files/folders are need to be archived)?
The errors are always happening well into the backup (or disk restore / migration), but as I said above, there are no files/folders to archive, only a single ZVOL or QCOW2 disk for every VM, so it's probably the huge volume of sequential IO causing some buffers / caches to fill up and in turn block the IO of KVM guests.

I am very sure that qcow2 is not the solution, because it will put additional load to zfs(compared with raw format), so it will need more IOPs from your storage.
We now know that QCOW2 is not a solution (we tried it as an alternative to ZVOLs), and we also know now that this issue is not caused by ZFS, since this forum is full of threads reporting the same problem for KVM since Proxmox 3.x with LVM+ext4, NFS, etc.

Currently I think it's some kind of Linux kernel virtual memory / IO blocking issue that happens during high memory pressure and large sequential transfers. Interestingly it happens more frequently on multi-socket (NUMA) hosts, and in Debian (7 and 8) guests. We already thought of KSM (especially with merging between NUMA nodes), but it has been ruled out (as many suspects).

Now I suspect vm.dirty_ratio and vm.dirty_background_ ratio again, will see if I'm right tonight by changing from 15/5 to 50/1.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!