> 3000 mSec Ping and packet drops with VirtIO under load

@AlBundy did you also had the network issues under load like lags and packet loss/high latency before?

The load is better, only when you're using the netwerk a lot (like remote backups) it still goes up again!

I tried newer kernel again and the numbers are worse again, rebooting to the older one tonight (also set to default from now on)
 
Hello,
I just installed the Kernel Linux 4.4.83-1-pve, however the IO issue remains. Have set a few VMs that did a lot of IO to IDE and load is much better now (20% -> 4%) and the packet drops are gone.

Screenshot at Oct 15 19-06-00.png
 
Changing from SCSI Virtio to IDE did not solve the problem for me.

I'm seeing 900ms ping times from the gateway to the KVM virt. Should be .36ms. VM is nothing special- Apache serving lots of static content.

VM's drive is on local-ZFS storage. Latest Proxmox 5 with all updates, VirtIO net. Gentoo guest with 4.13 kernel.

Until this is fixed, I'm kinda stuck on a half 4.4 & half 5.0 cluster.
 
Until this is fixed, I'm kinda stuck on a half 4.4 & half 5.0 cluster.
Have you checked if the same VM (exact clone) from the PVE 5.0 host running fine on 4.4?
If this is the case I would like to downgrade my hosts to PVE 4.4 since IDE runs stable but the disk performance is relatively poor then.
 
Nope, but I could. Just not right now. With all the reboots to test, there's been enough downtime on that server for the day (week!).

Going back to SCSI Virtio and changing to E1000 for the network is currently behaving decently. (I have CPU to burn.)

Have you checked if the same VM (exact clone) from the PVE 5.0 host running fine on 4.4?
If this is the case I would like to downgrade my hosts to PVE 4.4 since IDE runs stable but the disk performance is relatively poor then.
 
Nope, but I could. Just not right now. With all the reboots to test, there's been enough downtime on that server for the day (week!).

Going back to SCSI Virtio and changing to E1000 for the network is currently behaving decently. (I have CPU to burn.)

Since you are running ZFS it should be easy to create a snapshot and send / receive the snapshot to a test VM on your PVE 4.4 host without any down time. At least if you're running ZFS on the PVE 4.4 host as well and you have some spare storage left there to create the VM.
It would be really interesting to find out if this problem really is a PVE 5.0 thing.

I'm running a terminal server so E1000 is probably no option for me: The rdp protocol don't like slow networks or network latency and the users are picky about that.
 
Easy doesn't mean I have the time or inclination to do so on your schedule.

Moving the VM between hosts on a mixed cluster is not the problem. Dealing with downtime from failure due to 700ms+ ping times does. Not to mention my time constraints this week.

Plus the issue manifests under NETWORK LOAD - which can not be properly replicated in a test VM.

The evidence in this thread & the other related ones all point to something that changed between 4.4 and 5.0. A Google Doc or some other spreadsheet where folk can put their observations would be useful. If no one else has a chance to do that before I can revisit this issue, I will.

-J

Since you are running ZFS it should be easy to create a snapshot and send / receive the snapshot to a test VM on your PVE 4.4 host without any down time. At least if you're running ZFS on the PVE 4.4 host as well and you have some spare storage left there to create the VM.
It would be really interesting to find out if this problem really is a PVE 5.0 thing.

I'm running a terminal server so E1000 is probably no option for me: The rdp protocol don't like slow networks or network latency and the users are picky about that.
 
Ok something is going on: On my test system (software stack is the same as on my production system, one Windows Server 1016 has been cloned to the test system to be able to analyze this issue) I updated the system to the latest version with (upgrade / dist-upgrade, now PVE 5.1.36) and then I downloaded the newest virtIO drivers (0.1.141) because I stumbled upon this bug report: https://bugzilla.redhat.com/show_bug.cgi?id=1451978
The bug report is for a different version (not the 1.126 which is still marked as stable) but since the issue sounds related and I don't have anything to loose, I gave it a try. I updated all virtIO drivers and services (netKVM, virtio-scsi, Balloon drivers and Service, Guest Agent ...) and did a reboot. Guess what? Problem gone!
I did a ping test while doing a backup of the machine and a complete DB check at the same time and when I did this before I was kicked out of my RDP session for minutes and had huge package loss and delays. Now: Not a single glitch. I can use the browser to scroll web pages under heavy load and no single packet loss (ping max was 0.3 mSec).
I did a few more checks on my test system and after 2 days I decided to do the same updates to my production system with 2 VMs. The issue completely disappeared. In order to be able to switch from IDE to SCSI I had to create an additional small (32 GB) disk and attach it to my SCSI bus so that the RedHat VirtIO SCSI controller was active and I could install the needed drivers. After a shutdown I removed the small disk and changed my drives from IDE to SCSI and started the machine again. That worked well.

I'm still not sure if the latest PVE upgrade, the newer "testing" virtIO Drivers or a combination of both brought the cure.
However, I'm really happy that it is finally working with virtIO-SCSI since the boot times and initial application startup times are much lower compared to IDE for obvious reasons.
 
  • Like
Reactions: udo
Great.
So if anyone has a background on this (Proxmox Staff?): It would be interesting to know what the reason was behind this issue.
I can't find anything related to this in the release notes.
 
Unfortunately for me even after the upgrade to PVE5.1 I'm experiencing high IO wait and huge jump in Load during writing operations to VirtIO storage. I did a quick test - adding a new virtio disk to existing linux vm machine and simple dd if=/dev/zero of=test bs=1M count=500 bumped the Load of the host to 18 and the IO wait was huge again.
 
Great.
So if anyone has a background on this (Proxmox Staff?): It would be interesting to know what the reason was behind this issue.
I can't find anything related to this in the release notes.

most likely the newer kernel (5.0 had a 4.10 based kernel, 5.1 has a 4.13 based one)
 
Unfortunately for me even after the upgrade to PVE5.1 I'm experiencing high IO wait and huge jump in Load during writing operations to VirtIO storage. I did a quick test - adding a new virtio disk to existing linux vm machine and simple dd if=/dev/zero of=test bs=1M count=500 bumped the Load of the host to 18 and the IO wait was huge again.
What hardware do you use and how is your storage configured? What speed do you get with dd?
 
What hardware do you use and how is your storage configured? What speed do you get with dd?

The hardware, the storage and dd speed aren't relevant here. The only difference is IDE vs VirtIO (and this was discussed multiple times in this thread).
 
Just to mention here 5.1 didn't solved the issue to me. I have changed the storage and switched from FC to iSCSI. The problems are gone. I don't know what resolved it - was it something with OS support for direct attached FC or the storage itself.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!