> 3000 mSec Ping and packet drops with VirtIO under load

Discussion in 'Proxmox VE: Installation and configuration' started by Andreas Piening, Sep 2, 2017.

  1. Andreas Piening

    Joined:
    Mar 11, 2017
    Messages:
    58
    Likes Received:
    7
    I'm running PVE 5.0-30 with two KVM machines with Windows 2016 Servers installed.
    I did everything like computerized William explained in this video https://www.proxmox.com/de/training/video-tutorials/item/install-windows-2016-server-on-proxmox-ve.
    So I use ZFS and VirtIO for storage (SCSI) and network. The ISO with the drivers is "virtio-win-0.1.126.iso" which was the most recent one as I downloaded it like 8 weeks ago.

    Both systems are running stable. However if I put load on the network like a full backup from inside the VM over the network or a Datev DB check (which basically accesses the DB from a network share) my network gets unstable: While normal usage I have ping response times between 20-35 mSec on my local bridge and while putting load on the network it is between 600 - 4.000 mSec and sometimes dropping packets for several seconds. My RDP connection gets dropped and reconnection attempts do fail.
    After the network load is over, everything is fine and stable again. No network hiccups and smooth RDP. Even copying several GBs over the network with SMB is no problem.

    I never had this behavior and the only difference to my other installations is the newer PVE (5.0 instead 4.4) and Windows Server 2016 instead of 2012 but I don't believe this is a general issue.
    I'm speculating: Probably the VirtIO drivers I use for networking are the issue? Has Someone running Windows Server 2016 with stable network even under load? Which version are you using for VirtIO. My first idea is to replace the network drivers.

    All other suggestions are welcome.
    Is it a good idea to disable IPv6 completely when it is a IPv4 only network? Or to disable QoS and topology protocol layers from the network card configuration?

    Kind regards

    Andreas
     
  2. Andreas Piening

    Joined:
    Mar 11, 2017
    Messages:
    58
    Likes Received:
    7
    I thought about it and should add a few details about my network setup:
    The NICs of both Windows Server 2016 VMs are connected to a bridge:

    Code:
    brctl show vmbr1
    bridge name    bridge id        STP enabled    interfaces
    vmbr1        8000.aace816c169a    no        tap0
                                                tap100i0
                                                tap101i0
    tap0 is a OpenVPN server tap device and the other two are the VMs.
    So there is no physical device attached to this particular bridge. The NICs are shows as 10 GBit devs inside the VMs. So could it just be that the VMs can send quicker than the bridge can handle the traffic?
    Is it a good idea to us the "Rate Limit" on the NICs in the PVE configuration? Since the "outside" is connected via WAN / OpenVPN it would probably never exceed 300 MBits and this would still be enough for backups and everything else.

    Someone with experience on this? Am I thinking in the wrong direction?
     
  3. aderumier

    aderumier Member

    Joined:
    May 14, 2013
    Messages:
    203
    Likes Received:
    18
    >>While normal usage I have ping response times between 20-35 mSec on my local bridge
    Is it really from host bridge to to vm ?
    I'm around 0.1ms from host bridge to guest vm.

    more than 1ms locally is really anormal.
     
  4. micro

    micro Member
    Proxmox Subscriber

    Joined:
    Nov 28, 2014
    Messages:
    58
    Likes Received:
    12
    Not exactly the same setup (linux guests here), but I notice too in PVE 5 there are some huge network hiccups with virtio networks on virtio disk load. IO wait induce huge network latency (1000-2000ms) and packet reordering.
     
  5. Andreas Piening

    Joined:
    Mar 11, 2017
    Messages:
    58
    Likes Received:
    7
    No you are right: Same here. My ping values was from my local DSL line through OpenVPN to the bridge. Getting 0.13 mSec response time on the local bridge from the PVE host.
     
  6. Andreas Piening

    Joined:
    Mar 11, 2017
    Messages:
    58
    Likes Received:
    7
    Oh this makes sense: It happens especially when I do a backup which has high IO and network load at the same time.
    Is this a "official" issue? Is there a bug opened for that?
    I wonder which component introduces the issue: KVM version?
    Are there any workarounds known that can make it less bad? Have you tried to throttle IO?
    Would be quite painfull for me to switch back to PVE 4.4 because I noticed the problem right after going productive with the system.

    I tried to throttle the network to 30 MB / sec. but I did not notice that much of a difference. My RDP sessions are getting still dropped while I'm doing a backup or other IO intense things.
     
  7. aderumier

    aderumier Member

    Joined:
    May 14, 2013
    Messages:
    203
    Likes Received:
    18
    Note that qemu share a single thread by default, for disk,nic,....

    maybe can you try to enable iothread on disk ? (virtio or scsi + virtio-scsi-single controller).
    (note that it's not yet compatible with proxmox backup)
     
  8. micro

    micro Member
    Proxmox Subscriber

    Joined:
    Nov 28, 2014
    Messages:
    58
    Likes Received:
    12
    This is exactly what I'm wondering too. I was forced to emergency move elasticsearch from the guest machine because any 10-20 seconds there were network ping hiccups because of elasticsearch writing its shards to the disk, and they were not really big IO writes. Everything (disk/network) is virtio on the guests and the storage is SAN with about 600 fsync/s. I tried different cache modes - no cache, directsync, writethrough - no difference.
     
    Andreas Piening likes this.
  9. micro

    micro Member
    Proxmox Subscriber

    Joined:
    Nov 28, 2014
    Messages:
    58
    Likes Received:
    12
    Some iostat and ping logs to show what is happening during the hiccups (this VM is a router, the ping is to hosts behind it):
     
    Andreas Piening likes this.
  10. Andreas Piening

    Joined:
    Mar 11, 2017
    Messages:
    58
    Likes Received:
    7
    Looks similar to my ping tests but I get even over 8.000 mSec and dropped packets while doing a backup job from inside the VM (not PVE backup).
    I can't try the io-thread option at the moment because the system is used during the working hours. But I will try it tonight.

    @micro Have you tried this option already?
     
  11. micro

    micro Member
    Proxmox Subscriber

    Joined:
    Nov 28, 2014
    Messages:
    58
    Likes Received:
    12
    I didn't. I'm not sure this would help in my case. I have running iostat on the cluster nodes. On all of them during this surge there is 100% utilization of the SAN for a 2-3 seconds. I don't know why is this - I don't have any big writes which can make the SAN utilization 100%

    But the thing I'm wondering right now is why CPU iowait in guest vm (because of the host's SAN 100% utilization spikes) is delaying the routing/forwarding/shaping services provided by the guest? Is this behavior normal ? Hope somebody from Proxmox staff can answer to this question?
     
    Andreas Piening likes this.
  12. Andreas Piening

    Joined:
    Mar 11, 2017
    Messages:
    58
    Likes Received:
    7
    Good point, and I really crossed my fingers for this to help, but it did not. Same issue with iothread enabled for both virtual disks.
    I did a ping test this time while booting the machine and noticed dropped packets and ping response times over 4.000 seconds. Even starting applications on the terminal server are causing ping times to raise to multiple seconds. It is even worse than I thought in the first place.
    I don't know what to do next, since I can't downgrade PVE and I don't have a second server to migrate.
     
  13. Andreas Piening

    Joined:
    Mar 11, 2017
    Messages:
    58
    Likes Received:
    7
    micro likes this.
  14. mac.linux.free

    Joined:
    Jan 29, 2017
    Messages:
    106
    Likes Received:
    5
    did you try ovs on your host?
     
  15. Andreas Piening

    Joined:
    Mar 11, 2017
    Messages:
    58
    Likes Received:
    7
    No. Just a simple bridge setup. I don't know much about Open vSwitch but I just want my network to be reliable under load.
     
  16. mac.linux.free

    Joined:
    Jan 29, 2017
    Messages:
    106
    Likes Received:
    5
    it is really simple to try and to switch back if it's not working for you...for me it is working really well on all my pve hosts

    https://pve.proxmox.com/wiki/Open_vSwitch
     
  17. Andreas Piening

    Joined:
    Mar 11, 2017
    Messages:
    58
    Likes Received:
    7
    Sounds interesting. However I don't think it is related to my problem: There is nothing bad about a bridge setup if I don't need additional switching features. I have this setup running on a few other PVE installs with 4.4 and previous versions and never had a problem with that.
    My network is completely unusable when I have IO load in my VMs, so that I can't even ping the VM directly from the bridge anymore. I can't see that additional switching topology can make this better. And my system is connected to two sites via OpenVPN it would be too much effort to change the whole network setup.
    Thank you anyway.
     
  18. mac.linux.free

    Joined:
    Jan 29, 2017
    Messages:
    106
    Likes Received:
    5
    I see, where are your from? I'm from Stuttgart. Perhaps I can help you.
     
    Andreas Piening likes this.
  19. Andreas Piening

    Joined:
    Mar 11, 2017
    Messages:
    58
    Likes Received:
    7
    I'm from Hamburg. We probably both speak german, right? However, for a personal conversation let's choose PM or something else, I want to stay on topic in this thread.
     
  20. micro

    micro Member
    Proxmox Subscriber

    Joined:
    Nov 28, 2014
    Messages:
    58
    Likes Received:
    12
    I'm using OVS. Still have the issue.
     
    Andreas Piening likes this.
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice