30% Performance Regression after upgrading Proxmox 5.0 to 5.4

Hello,

I recently has a similar issue than https://forum.proxmox.com/threads/proxmox-5-4-issues-with-vm-performance.53769/ but I thought it was relevant to post a new thread not to mix the details. Please move to the same thread if you feel so.

I am running a pool of proxmox physical hosts, each hosting on kvm web servers Laravel with nginx/PHP7.1 for a high traffic web application.

On April 19th, we upgraded 2 hosts from Proxmox 5.0 to 5.4, to solve network issues (sudden lost of VM without ability to ping it, and need to restart it).

API1 host was upgraded during day time
API3 host was upgraded the same day in the evening
You can See Throughput in Graph 1.

During daily traffic, we observed that the 2 newly upgraded hosts have been providing lower throughput than usual, and lower throughput than the other host which was not upgraded.
VMstat output showed:
- 30% user CPU and 30% less traffic served on 2 upgraded hosts.
- 20% user CPU and normal traffic served on 1 host.
Perf output showed the same system calls and cpu usage (although 2x more events on the upgraded hosts)
All VMs are running with TSC clocksource.

Initially we thought about newly introduced kernel security patches (spectre/meltdown etc), and we disabled them on one of the host. Still performance was less than initially.

Still investigating, and to eliminate a possible root cause (kernel), we rebooted one host on April 30th, with a previous kernel (without downgrading proxmox). We did not change any kernel security features on this host.
By rolling back the kernel version from 4.15.18-13-pve to 4.10.15-1-pve, we are observing again correct performance.
You can see Throughput in Graph 2.

No change was performed on the host and on the VMs, excepted the Proxmox upgrade, via apt-get dist-upgrade.

We are planning to test other Proxmox kernels between 4.10 and 4.15 to pinpoint where the regression change was introduced. It is unknown yet if the performance hit introduced comes from Proxmox patch or the Linux kernel.

As specific linux kernel versions are tagged as dependencies for specific Proxmox releases, we also don't know if there is an impact running an older kernel. We have not encountered instabilities so far.

We still have 1 host running Proxmox 5.4 and latest kernel, we are available to provide any perf/vmstat or similar outputs before downgrading the kernel or changing settings.

Joffrey
 

Attachments

  • proxmox5.4_performance_graph1.png
    proxmox5.4_performance_graph1.png
    227.5 KB · Views: 31
  • proxmox5.4_performance_graph2.png
    proxmox5.4_performance_graph2.png
    278.9 KB · Views: 29
  • Like
Reactions: robertb
Hi,

it is hard to say with this less information about your system.
But after the meltdown/specter the kernel gets huge changes in the core components.

I would generally recommend you to update your bios to the last current version what has includes the latest microcode version.
Or alternative for testing you can use the microcode Debian package [1][2].
We recognize that if the microcode does not fit to the kernel odd things happened.

1.) https://wiki.debian.org/Microcode
2.) https://packages.debian.org/de/stretch/intel-microcode
 
  • Like
Reactions: robertb
Hello,

We made more investigations and realized that the server was configured with pvetest as repository.

We have changed the following kernels to investigate where the drop in performance comes from:
-4.10.15-1-pve - no performance issue
-4.13.16-2-pve - no performance issue
-4.15.18-9-pve - no performance issue
-4.15.18-12-pve - performance drop

Server is HP 360 Gen9 with latest Bios.

Regards,
Joffrey
 
Hello,

I am experiencing issues too - packet processing performance has dropped significantly.
Seems like newer kernel versions are affected somehow; the same workload is just fine on a 4.15.18-3-pve #1 SMP PVE 4.15.18-22 machine.
However, 4.15.18-13-pve #1 SMP PVE 4.15.18-37 causes massive in-discards on receiving higher pps counts.
 
What HW do you use?
Also, NIC types are interesting?
 
  • Like
Reactions: robertb
Hello,

in our case the setup is:
2x Xeon E5-2690v2 per server on a Supermicro Mainboard, while two onboard NICs (Intel i350) are connected via LACP to the switches.
All machines have the same specs. However, the freshly booted ones with the newer kernel struggle massively. There are sometimes up to 100 kpps in discards measured. If we move the virtual machines to hosts with older kernel versions, this is just fine.
That's why we think this could be related somehow.

driver: igb
version: 5.3.5.18

Maybe the kernel isn't able to drain the nic buffers fast enough? This value is quite high on the freshly booted servers (10x higher than on the machines running for some months already):
ethtool -S eth0|grep fifo
rx_fifo_errors: 62972750


thanks for your consideration
 
Can't see this here on Supermirco bord with the same setting only the CPU is a E5-2620 v3.
Do you use special network settings like OVS, nat or bridged network?
 
  • Like
Reactions: robertb
Hi,

no there is no special setup involved, it's a regular (vlan aware) linux bridge where the virtual machines are connected to.
Could it be related to some of the cpu bug fixes that happened in the recent kernel changes? Maybe v2 is somehow differently affected compared to v3, just a guess. Anyway, microcode version on the machines is the same, according to /proc/cpuinfo.

I tried multiple - yet specwise identical - machines with the newer kernel, they all had the same trouble.
The nic drivers are even the same on the older kernel version.

The hypervisor itself is also quite unresponsive while having those packet discards on the i350 nic, eventhough the management ssh & cluster network use a dedicated nic with it's own lacp bond (Intel X520) in a different vlan.
 
I guess it has something to do with the i350 igb driver.
Because we see also customers with this nic having problems with renaming by udev.
The problem is I need a test case to debug it.
Do you have two socket systems or just one socket at the board?
 
  • Like
Reactions: robertb

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!