30% Performance Regression after upgrading Proxmox 5.0 to 5.4

Discussion in 'Proxmox VE: Installation and configuration' started by Joffrey, May 2, 2019.

  1. Joffrey

    Joffrey New Member
    Proxmox Subscriber

    Joined:
    Feb 21, 2016
    Messages:
    2
    Likes Received:
    1
    Hello,

    I recently has a similar issue than https://forum.proxmox.com/threads/proxmox-5-4-issues-with-vm-performance.53769/ but I thought it was relevant to post a new thread not to mix the details. Please move to the same thread if you feel so.

    I am running a pool of proxmox physical hosts, each hosting on kvm web servers Laravel with nginx/PHP7.1 for a high traffic web application.

    On April 19th, we upgraded 2 hosts from Proxmox 5.0 to 5.4, to solve network issues (sudden lost of VM without ability to ping it, and need to restart it).

    API1 host was upgraded during day time
    API3 host was upgraded the same day in the evening
    You can See Throughput in Graph 1.

    During daily traffic, we observed that the 2 newly upgraded hosts have been providing lower throughput than usual, and lower throughput than the other host which was not upgraded.
    VMstat output showed:
    - 30% user CPU and 30% less traffic served on 2 upgraded hosts.
    - 20% user CPU and normal traffic served on 1 host.
    Perf output showed the same system calls and cpu usage (although 2x more events on the upgraded hosts)
    All VMs are running with TSC clocksource.

    Initially we thought about newly introduced kernel security patches (spectre/meltdown etc), and we disabled them on one of the host. Still performance was less than initially.

    Still investigating, and to eliminate a possible root cause (kernel), we rebooted one host on April 30th, with a previous kernel (without downgrading proxmox). We did not change any kernel security features on this host.
    By rolling back the kernel version from 4.15.18-13-pve to 4.10.15-1-pve, we are observing again correct performance.
    You can see Throughput in Graph 2.

    No change was performed on the host and on the VMs, excepted the Proxmox upgrade, via apt-get dist-upgrade.

    We are planning to test other Proxmox kernels between 4.10 and 4.15 to pinpoint where the regression change was introduced. It is unknown yet if the performance hit introduced comes from Proxmox patch or the Linux kernel.

    As specific linux kernel versions are tagged as dependencies for specific Proxmox releases, we also don't know if there is an impact running an older kernel. We have not encountered instabilities so far.

    We still have 1 host running Proxmox 5.4 and latest kernel, we are available to provide any perf/vmstat or similar outputs before downgrading the kernel or changing settings.

    Joffrey
     

    Attached Files:

    robertb likes this.
  2. wolfgang

    wolfgang Proxmox Staff Member
    Staff Member

    Joined:
    Oct 1, 2014
    Messages:
    4,763
    Likes Received:
    315
    Hi,

    it is hard to say with this less information about your system.
    But after the meltdown/specter the kernel gets huge changes in the core components.

    I would generally recommend you to update your bios to the last current version what has includes the latest microcode version.
    Or alternative for testing you can use the microcode Debian package [1][2].
    We recognize that if the microcode does not fit to the kernel odd things happened.

    1.) https://wiki.debian.org/Microcode
    2.) https://packages.debian.org/de/stretch/intel-microcode
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    robertb likes this.
  3. Joffrey

    Joffrey New Member
    Proxmox Subscriber

    Joined:
    Feb 21, 2016
    Messages:
    2
    Likes Received:
    1
    Hello,

    We made more investigations and realized that the server was configured with pvetest as repository.

    We have changed the following kernels to investigate where the drop in performance comes from:
    -4.10.15-1-pve - no performance issue
    -4.13.16-2-pve - no performance issue
    -4.15.18-9-pve - no performance issue
    -4.15.18-12-pve - performance drop

    Server is HP 360 Gen9 with latest Bios.

    Regards,
    Joffrey
     
  4. robertb

    robertb New Member

    Joined:
    Apr 4, 2017
    Messages:
    16
    Likes Received:
    0
    Hello,

    I am experiencing issues too - packet processing performance has dropped significantly.
    Seems like newer kernel versions are affected somehow; the same workload is just fine on a 4.15.18-3-pve #1 SMP PVE 4.15.18-22 machine.
    However, 4.15.18-13-pve #1 SMP PVE 4.15.18-37 causes massive in-discards on receiving higher pps counts.
     
  5. wolfgang

    wolfgang Proxmox Staff Member
    Staff Member

    Joined:
    Oct 1, 2014
    Messages:
    4,763
    Likes Received:
    315
    What HW do you use?
    Also, NIC types are interesting?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    robertb likes this.
  6. robertb

    robertb New Member

    Joined:
    Apr 4, 2017
    Messages:
    16
    Likes Received:
    0
    Hello,

    in our case the setup is:
    2x Xeon E5-2690v2 per server on a Supermicro Mainboard, while two onboard NICs (Intel i350) are connected via LACP to the switches.
    All machines have the same specs. However, the freshly booted ones with the newer kernel struggle massively. There are sometimes up to 100 kpps in discards measured. If we move the virtual machines to hosts with older kernel versions, this is just fine.
    That's why we think this could be related somehow.

    driver: igb
    version: 5.3.5.18

    Maybe the kernel isn't able to drain the nic buffers fast enough? This value is quite high on the freshly booted servers (10x higher than on the machines running for some months already):
    ethtool -S eth0|grep fifo
    rx_fifo_errors: 62972750


    thanks for your consideration
     
  7. wolfgang

    wolfgang Proxmox Staff Member
    Staff Member

    Joined:
    Oct 1, 2014
    Messages:
    4,763
    Likes Received:
    315
    Can't see this here on Supermirco bord with the same setting only the CPU is a E5-2620 v3.
    Do you use special network settings like OVS, nat or bridged network?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    robertb likes this.
  8. robertb

    robertb New Member

    Joined:
    Apr 4, 2017
    Messages:
    16
    Likes Received:
    0
    Hi,

    no there is no special setup involved, it's a regular (vlan aware) linux bridge where the virtual machines are connected to.
    Could it be related to some of the cpu bug fixes that happened in the recent kernel changes? Maybe v2 is somehow differently affected compared to v3, just a guess. Anyway, microcode version on the machines is the same, according to /proc/cpuinfo.

    I tried multiple - yet specwise identical - machines with the newer kernel, they all had the same trouble.
    The nic drivers are even the same on the older kernel version.

    The hypervisor itself is also quite unresponsive while having those packet discards on the i350 nic, eventhough the management ssh & cluster network use a dedicated nic with it's own lacp bond (Intel X520) in a different vlan.
     
  9. wolfgang

    wolfgang Proxmox Staff Member
    Staff Member

    Joined:
    Oct 1, 2014
    Messages:
    4,763
    Likes Received:
    315
    I guess it has something to do with the i350 igb driver.
    Because we see also customers with this nic having problems with renaming by udev.
    The problem is I need a test case to debug it.
    Do you have two socket systems or just one socket at the board?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    robertb likes this.
  10. robertb

    robertb New Member

    Joined:
    Apr 4, 2017
    Messages:
    16
    Likes Received:
    0
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice