KVM guests freeze (hung tasks) during backup/restore/migrate

Discussion in 'Proxmox VE: Installation and configuration' started by gkovacs, Apr 26, 2017.

  1. gkovacs

    gkovacs Active Member

    Joined:
    Dec 22, 2008
    Messages:
    500
    Likes Received:
    43
    This issue has been with us since we upgraded our cluster to Proxmox 4.x, and converted our guests from OpenVZ to KVM. We have single and dual socket Westmere, Sandy Bridge and Ivy Bridge nodes, using ZFS RAID10 HDD or ZFS RAIDZ SSD arrays, and every one of them is affected.

    Description
    When there is high IO load on the ZFS pools during vzdump, restore or migrate, the guests' IO slows down extremely or even freezes for a few seconds, resulting in:
    - lost network connectivity (Windows guests often lose Remote Desktop connections)
    - huge latency in network services
    - CPU soft lockups
    - CPU / rcu_scheduler stalls
    - application blocked in syslog or even stack dump

    We run monitoring services that poll the websites and other network services served by these guests every minute, that's how we started to realize we have a problem, because we started getting alerts during nightly backups.

    This is today's soft lockup in a Debian 7 guest during the restore of another KVM to zfs-local (6x HDD ZFS RAID10) on a single socket Ivy Bridge system. There was no load on the system apart from the restore, other guests are mostly inactive:
    [​IMG]
    Also a Windows KVM was unreachable during that time.

    Mitigation steps
    We have tried many tweaks to eliminate the problem:
    - disabling C-states on Westmere systems
    - enabling performance governor
    - recommended swap settings from the ZFS wiki, also vm.swappiness=1
    - increasing vm.min_free_kbytes on both hosts and guests
    - decreasing vm.dirty_ratio to 5-15, vm.dirty_background_ratio to 1-5
    - installing NVME SSDs as SLOG/L2ARC, also for swap

    Some of these tweaks helped a little, but the issue is still happening, maybe less during backups but still heavily during restores and migrations. The issue seems to be connected to the Linux kernel's VM (virtual memory) subsystem, because if you set vm.vfs_cache_pressure to a high value (like 1000) in a KVM guest, the lockups happen much more often. Also Debian looks more sensitive to it than Ubuntu for example, but Windows guests are also affected.

    Help needed
    I am looking for input from others who also run KVM guests on zfs-local (zvols), interested if they also experience these symptoms (you won't see a 1-2 minute freeze unless you run some kind of monitoring). I would also welcome advice on how to diagnose it further, to decide if the issue is in QEMU/KVM, ZFS or some other parts of the kernel.
     
    #1 gkovacs, Apr 26, 2017
    Last edited: Jun 24, 2017
    William Edwards likes this.
  2. Nemesiz

    Nemesiz Active Member

    Joined:
    Jan 16, 2009
    Messages:
    627
    Likes Received:
    36
    At the high load ZFS can hold r/w requests and making them to be in queue or even hold accepting new request until it finish with the firsts requests. You can feel it even with zfs commands like #zfs list at high load. It will respond after few seconds.

    I suggest you to use snapshot and copy (if you need) from it with #rsync --bwlimit

    To see hdd/sdd load I use atop in host machine. Its shows how hdd/sdd are heavy loaded.
     
  3. fireon

    fireon Well-Known Member
    Proxmox Subscriber

    Joined:
    Oct 25, 2010
    Messages:
    2,903
    Likes Received:
    164
    Can you post also one VMconfig please?
    Looks strange, have also 6 HDDs in Raid 10, and RaidZ mit SSD's. When the backup runs you will not notice anything at all. Have here some LXC's, some KVM's, also WinServer2016 and Ubuntu's.

    Can post also the Smartdatas from all your disks? (smartctl -a /dev/disk/by-id/....) Only this sections:
    Code:
    === START OF INFORMATION SECTION === 
    
    === START OF READ SMART DATA SECTION === 
    
    And the output of "zpool get all".

    Thanks
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  4. gkovacs

    gkovacs Active Member

    Joined:
    Dec 22, 2008
    Messages:
    500
    Likes Received:
    43
    There is not much point in posting vm configs or S.M.A.R.T. reports, because this issue affects all of our servers and many different VMs.

    But you can reproduce it easily: according to our tests, when VMs have their RAW disks on ZFS Zvols, and you start to restore another (big) VM to the same host, the already running VMs get starved of IO and their apps hang, their CPU locks and kernel freezes. It's very important to use ZVOLs with cache=none (the Proxmox recommended configuration), as guests having QCOW2 disks with cache=writeback are much less sensitive to the IO starvation, probably because their virtual memory is cached on the host (and not by ZFS).

    It currently looks like it's a ZFS scheduling / caching bug, and the ZFS on Linux issue tracker has a few similar problems already:
    https://github.com/zfsonlinux/zfs/issues/1538
    https://github.com/zfsonlinux/zfs/issues/5857
    https://github.com/zfsonlinux/zfs/issues/5867

    @fabian do you have the chance to try to reproduce this?
     
  5. fabian

    fabian Proxmox Staff Member
    Staff Member

    Joined:
    Jan 7, 2016
    Messages:
    3,191
    Likes Received:
    493
    just tried, cannot reproduce. no message on the host side, no messages on the vm side.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  6. guletz

    guletz Active Member

    Joined:
    Apr 19, 2017
    Messages:
    802
    Likes Received:
    108
    Hello gkovacs,

    1. what kind of hdd do you have (4k sector size?)
    2. do you use same kind of raid controller?
    3. do you use dedup and/or compression ?
    4. do you have 8k zvol sector size (default in proxmox)
    5. do you have ecc ram ?

    I can not know what is the problem (I read the bugs report links), but I tell what I will try if I have this problem:

    1. use at least 32 k for zvol sector-size(iops is huge for 8k compared with 32k)
    2. put the all disks from the affected server into another non-Intel server (it is good to have different cpu servers in case of ...) and see if you have any problems.
    3. try to use a vanilla debian kernel, with the last zfs modules available (this week we have get a new zfs upgrade)

    I can guess that yours problems are related only with debian/kernel, because I do not see guys that are complaining about this on others linux distributions, but maybe I am wrong, as usually ;)
     
  7. gkovacs

    gkovacs Active Member

    Joined:
    Dec 22, 2008
    Messages:
    500
    Likes Received:
    43
    All of our servers are affected by this.
    - some nodes use SSD drives connected to Intel ICH controller (ZFS RAIDZ)
    - some nodes use Toshiba DT01ACA drives connected to Adaptec 6805E controllers (disks are Morphed JBOD, ZFS RAID10)
    - we use default compression, no dedup
    - we use default ZVOL block size (8k)
    - some servers have ECC RAM, some don't

    A bit more info:
    - every node boots from ZFS
    - swap is on ZFS in most nodes (rpool/swap), but in some nodes it's on NVME partition
    - we have SLOG+L2ARC on NVME in some RAID10 nodes

    Thanks for your ideas. Unfortunately, as I wrote above, this issue does not simply affect a single node or a set of disks, it is happening on all our nodes.

    Is there a way to migrate VM disks to different blocksize ZVOLs? IIRC if you change the blocksize of a ZVOL, it will not change the data that's already written...
     
  8. fireon

    fireon Well-Known Member
    Proxmox Subscriber

    Joined:
    Oct 25, 2010
    Messages:
    2,903
    Likes Received:
    164
    Not real SAS SATA, JBOD is not optimal...
    @gkovacs
    Please post the Smartinfos. And the output of "zpool get all".

    Thanks.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  9. gkovacs

    gkovacs Active Member

    Joined:
    Dec 22, 2008
    Messages:
    500
    Likes Received:
    43
    Morphed JBOD is the only way to give a full disk to ZFS while still being able to boot from the Adaptec card. Any more insight on why is this not optimal?

    We have a whole cluster of servers, each full of hard drives and some with solid state drives. As I wrote above, all of the servers exhibit the same problem, which is very likely a ZFS scheduler issue.

    This is a HDD (out of 6 in ZFS RAID10), connected through Adaptec card in a single socket Ivy Bridge node:
    Code:
    === START OF INFORMATION SECTION ===
    Vendor:               Adaptec
    Product:              Morphed JBOD 00
    Revision:             V1.0
    User Capacity:        999,643,152,384 bytes [999 GB]
    Logical block size:   512 bytes
    Logical Unit id:      0xb110884700d00000
    Serial number:        478810B1
    Device type:          disk
    Local Time is:        Thu Apr 27 21:08:13 2017 CEST
    SMART support is:     Available - device has SMART capability.
    SMART support is:     Enabled
    Temperature Warning:  Disabled or Not Supported
    
    === START OF READ SMART DATA SECTION ===
    This is an SSD (out of 3 in RAIDZ), connected through Intel ICH in a dual socket Westmere node:
    Code:
    === START OF INFORMATION SECTION ===
    Model Family:     Samsung based SSDs
    Device Model:     Samsung SSD 850 EVO 500GB
    Serial Number:    S2RBNX0H404795M
    LU WWN Device Id: 5 002538 d40ca2381
    Firmware Version: EMT02B6Q
    User Capacity:    500,107,862,016 bytes [500 GB]
    Sector Size:      512 bytes logical/physical
    Rotation Rate:    Solid State Device
    Form Factor:      2.5 inches
    Device is:        In smartctl database [for details use: -P show]
    ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
    SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
    Local Time is:    Thu Apr 27 21:11:04 2017 CEST
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    
    === START OF READ SMART DATA SECTION ===
    
     
  10. guletz

    guletz Active Member

    Joined:
    Apr 19, 2017
    Messages:
    802
    Likes Received:
    108
    [QUOTE="gkovacs, post:
    Is there a way to migrate VM disks to different blocksize ZVOLs? IIRC if you change the blocksize of a ZVOL, it will not change the data that's already written...[/QUOTE]

    Yes. You need to create a new virtual disk(with a greater size like your actual disk size)using zfs comand line (see zpool history, and use volblocksize=32k or even more). Then attach a iso cd-rom with clonezilla live cd(or any clone live CD)to the Vm. boot from this iso and clone your-8k-disk to the new 32k-disk. Then use the new disk as a boot disk. If all is ok, then remove the 8k-disk from Vm.
    It is only one posibility, (dd, migrate the vdisk to a nfs and then copy back to vm...). I say the what I do for many times, and it was work for me. Cloning a vdisk is the safety option if you do not mess the source with the destination vdisk . With more disks in a pool/vdevl use bigger volblocksize (you can loose some space, but you gain a lot of speed and better IOPs, and also the scrub is faster )
     
    #10 guletz, Apr 27, 2017
    Last edited: Apr 27, 2017
    gkovacs likes this.
  11. fireon

    fireon Well-Known Member
    Proxmox Subscriber

    Joined:
    Oct 25, 2010
    Messages:
    2,903
    Likes Received:
    164
    I have here also an RaidZ with 3 Samsung SSD's, but Enterprise, not cheap customermodel. Copytest goes over 1GB/s.

    We use only HDD's with "4096 bytes physical". This is recommend for ZFS.

    We have done tests with JBOD a time ago with HP Servers. Had really bad Results with ZFS. The same Server with an real SAS HBA Controller works really fast :) And we use only HDD's with 4K.

    Your problem is a little bit strange, because you say before the update it was working fine. Maybe really that bug what you was finding out. But all us ZFS Server are running under PVE 4.4. With no one of then we have actually performance problems.

    Can you tell me how much memory one of your servers have? And have you set ZFS Memory Limit?
    https://pve.proxmox.com/wiki/ZFS_on_Linux#_limit_zfs_memory_usage
    Thank you.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  12. mir

    mir Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 14, 2012
    Messages:
    3,476
    Likes Received:
    95
    SATA disks and JBOD expander is not recommended with ZFS. If you use a JBOD expander you should only use SAS disks.
     
  13. fabian

    fabian Proxmox Staff Member
    Staff Member

    Joined:
    Jan 7, 2016
    Messages:
    3,191
    Likes Received:
    493
    or you could just configure the ZFS storage in PVE with a different (vol)blocksize, and use the "move disk" feature (you can even do that with a running VM, but if you already experience I/O bottlenecks when restoring a backup this might trigger the same issue..).

    you cannot change the block size of a zvol which has ever been written to, so you need to create a new zvol and rewrite/copy all the data once (one way or another - zfs send/receive, dd, qemu-img convert/offline move disk, qmp drive mirror/online move disk, ...).
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  14. linum

    linum Member

    Joined:
    Sep 25, 2011
    Messages:
    65
    Likes Received:
    1
    Hmm, this is exactly what I see on two system here. And take a look at these threads:

    https://forum.proxmox.com/threads/proxmox-4-4-13-restore-issue.34379/
    https://forum.proxmox.com/threads/high-io-delay-on-restore.34358/

    I think there's a general problem how the restore is writing data to the disk. I my opinion there are two many different setup that show there high IO loads and waits. And at least it seems to me that it doesn't matter what filesystem is used.

    We could try to get all affected hardware combination in a google sheet to see if there's something in common...
     
    gkovacs likes this.
  15. guletz

    guletz Active Member

    Joined:
    Apr 19, 2017
    Messages:
    802
    Likes Received:
    108
    Again, my own opinion is the fact that I use many servers with zfs(linux centos, kernel 2.6.x, the last zfs version, intel and amd cpu, with 2-8 hdd, and/or ssd for cache/slog) and I did not see this kind of bug, despite the fact that I have some times a very heavy load (>20), for 24-36 hours.
    So I guess it is some related with debian/kernel version.
     
  16. linum

    linum Member

    Joined:
    Sep 25, 2011
    Messages:
    65
    Likes Received:
    1
    @guletz I think this is also a valid option. Question is what we can do to get this fixed? Maybe it is possible to run the centos kernel with proxmox butI think this wouldn't be a drop in replacement ... Any ideas?
     
  17. gkovacs

    gkovacs Active Member

    Joined:
    Dec 22, 2008
    Messages:
    500
    Likes Received:
    43
    Not sure where you read about a JBOD expander. Never used one.

    We are using an Adaptec 6805E RAID controller in many of our nodes for connecting the member SATA disks, in Morphed JBOD mode, which passes through whole disks to the OS but keeps them bootable via the controller's BIOS.
     
  18. gkovacs

    gkovacs Active Member

    Joined:
    Dec 22, 2008
    Messages:
    500
    Likes Received:
    43
    @fabian I have tried to move all the affected VMs to 32k blocksize ZVOLs, same errors. Also tried to convert all VMs to QCOW2 disks (stored on ZFS). No change. We have even tried to disable KSM, did not help either.

    Unfortunately, the problem remains: every night during vzdump backups, some of the VMs become inaccessible via HTTP, and on their console there are numerous hung task timeout errors:

    proxmox4-newlinux-20170616.jpg

    Any idea why do KVM guests userland processes hang when their disk is being backed up by vzdump (or migrated/restored)?

    As @linum posted, others are experiencing the same problem, and not just on ZFS:
    https://forum.proxmox.com/threads/proxmox-4-4-13-restore-issue.34379/
    https://forum.proxmox.com/threads/high-io-delay-on-restore.34358/
     
    #18 gkovacs, Jun 16, 2017
    Last edited: Jun 20, 2017
  19. guletz

    guletz Active Member

    Joined:
    Apr 19, 2017
    Messages:
    802
    Likes Received:
    108
    Hello @gkovacs ,


    I have seen in the past something like you, in a non-Proxmox enviroment(and also non-debian). In my case it was happen when my server has go to swap(no avalaible memory)+ high IO load. At that time, I concluded that was combination of events (kernel version, zfs version, and so on). Also the most important thing, was the fact that on another server (it was a cluster with 2 hw nodes, with the same software stack like the server with the problem, except that one have more memory installed, so no swap was used)
    After 1-2 month after zfs and kernel upgrade, this problem has disappeared . .. and my load was the same (some cron scripts generate the load). Another thing that I remember was the fact that for my case, the heavy load was generated by a script who was read a very big folders meta-dates (search for one small file name from 7.000.000 files)

    I am note sure 100 % that what was happening to me is the same for you, but in a such case, maybe the best solution is to avoid the effects , and not the cause of the problem. So maybe you can think to use at something else, like zfs send-receive, and to avoid the problem. Another ideea could be to reserve for zfs more memory only for meta-data. For sure, by default zfs will eat at most 50 % from total memory . I do not know what is for metadata part. But I will try with 25 % at minimum from total RAM.
    I am sure that vzdump will need a lot of RAM for metadata. By the way, this errors are happening before vzdump start to create the gz/archive or before(when start to find what files/folders are need to be archived)?
    I am very sure that qcow2 is not the solution, because it will put additional load to zfs(compared with raw format), so it will need more IOPs from your storage.

    Because you see the same problem in the migration process (so no vzdump is evolved ), I can guess that the same metadata is the first guilty for yours errors .
    Good luck and reply if you have some news. I will cross my fingers for you - because I have your situation in the past.
     
  20. gkovacs

    gkovacs Active Member

    Joined:
    Dec 22, 2008
    Messages:
    500
    Likes Received:
    43
    Thank you for your post. I don't think that metadata is the problem, because we are backing up single QCOW2 files at once, which hardly use lot of metadata. We don't store big filesystem trees on ZFS, only a few QCOW2 disk files on every node.

    The errors are always happening well into the backup (or disk restore / migration), but as I said above, there are no files/folders to archive, only a single ZVOL or QCOW2 disk for every VM, so it's probably the huge volume of sequential IO causing some buffers / caches to fill up and in turn block the IO of KVM guests.

    We now know that QCOW2 is not a solution (we tried it as an alternative to ZVOLs), and we also know now that this issue is not caused by ZFS, since this forum is full of threads reporting the same problem for KVM since Proxmox 3.x with LVM+ext4, NFS, etc.

    Currently I think it's some kind of Linux kernel virtual memory / IO blocking issue that happens during high memory pressure and large sequential transfers. Interestingly it happens more frequently on multi-socket (NUMA) hosts, and in Debian (7 and 8) guests. We already thought of KSM (especially with merging between NUMA nodes), but it has been ruled out (as many suspects).

    Now I suspect vm.dirty_ratio and vm.dirty_background_ ratio again, will see if I'm right tonight by changing from 15/5 to 50/1.
     
    #20 gkovacs, Jun 19, 2017
    Last edited: Jun 24, 2017
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice