> 3000 mSec Ping and packet drops with VirtIO under load

AlBundy · Oct 11, 2017

Phinitris said:
@AlBundy did you also had the network issues under load like lags and packet loss/high latency before?

The load is better, only when you're using the netwerk a lot (like remote backups) it still goes up again!

I tried newer kernel again and the numbers are worse again, rebooting to the older one tonight (also set to default from now on)

Phinitris · Oct 15, 2017

Hello,
I just installed the Kernel Linux 4.4.83-1-pve, however the IO issue remains. Have set a few VMs that did a lot of IO to IDE and load is much better now (20% -> 4%) and the packet drops are gone.

joshin · Oct 18, 2017

Changing from SCSI Virtio to IDE did not solve the problem for me.

I'm seeing 900ms ping times from the gateway to the KVM virt. Should be .36ms. VM is nothing special- Apache serving lots of static content.

VM's drive is on local-ZFS storage. Latest Proxmox 5 with all updates, VirtIO net. Gentoo guest with 4.13 kernel.

Until this is fixed, I'm kinda stuck on a half 4.4 & half 5.0 cluster.

Andreas Piening · Oct 18, 2017

joshin said:
Until this is fixed, I'm kinda stuck on a half 4.4 & half 5.0 cluster.

Have you checked if the same VM (exact clone) from the PVE 5.0 host running fine on 4.4?
If this is the case I would like to downgrade my hosts to PVE 4.4 since IDE runs stable but the disk performance is relatively poor then.

joshin · Oct 18, 2017

Nope, but I could. Just not right now. With all the reboots to test, there's been enough downtime on that server for the day (week!).

Going back to SCSI Virtio and changing to E1000 for the network is currently behaving decently. (I have CPU to burn.)

Andreas Piening said:
Have you checked if the same VM (exact clone) from the PVE 5.0 host running fine on 4.4?
If this is the case I would like to downgrade my hosts to PVE 4.4 since IDE runs stable but the disk performance is relatively poor then.

Andreas Piening · Oct 18, 2017

joshin said:
Nope, but I could. Just not right now. With all the reboots to test, there's been enough downtime on that server for the day (week!).

Going back to SCSI Virtio and changing to E1000 for the network is currently behaving decently. (I have CPU to burn.)

Since you are running ZFS it should be easy to create a snapshot and send / receive the snapshot to a test VM on your PVE 4.4 host without any down time. At least if you're running ZFS on the PVE 4.4 host as well and you have some spare storage left there to create the VM.
It would be really interesting to find out if this problem really is a PVE 5.0 thing.

I'm running a terminal server so E1000 is probably no option for me: The rdp protocol don't like slow networks or network latency and the users are picky about that.

joshin · Oct 18, 2017

Easy doesn't mean I have the time or inclination to do so on your schedule.

Moving the VM between hosts on a mixed cluster is not the problem. Dealing with downtime from failure due to 700ms+ ping times does. Not to mention my time constraints this week.

Plus the issue manifests under NETWORK LOAD - which can not be properly replicated in a test VM.

The evidence in this thread & the other related ones all point to something that changed between 4.4 and 5.0. A Google Doc or some other spreadsheet where folk can put their observations would be useful. If no one else has a chance to do that before I can revisit this issue, I will.

-J

Andreas Piening said:
Since you are running ZFS it should be easy to create a snapshot and send / receive the snapshot to a test VM on your PVE 4.4 host without any down time. At least if you're running ZFS on the PVE 4.4 host as well and you have some spare storage left there to create the VM.
It would be really interesting to find out if this problem really is a PVE 5.0 thing.

I'm running a terminal server so E1000 is probably no option for me: The rdp protocol don't like slow networks or network latency and the users are picky about that.

poxin · Oct 20, 2017

Good to know I'm not the only one experiencing this https://forum.proxmox.com/threads/windows-guest-hangs-during-backup.36406/
https://forum.proxmox.com/threads/windows-guest-hangs-during-backup.36406/
Though I only have this occur with Windows systems, linux and containers both do fine.

Andreas Piening · Nov 5, 2017

Ok something is going on: On my test system (software stack is the same as on my production system, one Windows Server 1016 has been cloned to the test system to be able to analyze this issue) I updated the system to the latest version with (upgrade / dist-upgrade, now PVE 5.1.36) and then I downloaded the newest virtIO drivers (0.1.141) because I stumbled upon this bug report: https://bugzilla.redhat.com/show_bug.cgi?id=1451978
The bug report is for a different version (not the 1.126 which is still marked as stable) but since the issue sounds related and I don't have anything to loose, I gave it a try. I updated all virtIO drivers and services (netKVM, virtio-scsi, Balloon drivers and Service, Guest Agent ...) and did a reboot. Guess what? Problem gone!
I did a ping test while doing a backup of the machine and a complete DB check at the same time and when I did this before I was kicked out of my RDP session for minutes and had huge package loss and delays. Now: Not a single glitch. I can use the browser to scroll web pages under heavy load and no single packet loss (ping max was 0.3 mSec).
I did a few more checks on my test system and after 2 days I decided to do the same updates to my production system with 2 VMs. The issue completely disappeared. In order to be able to switch from IDE to SCSI I had to create an additional small (32 GB) disk and attach it to my SCSI bus so that the RedHat VirtIO SCSI controller was active and I could install the needed drivers. After a shutdown I removed the small disk and changed my drives from IDE to SCSI and started the machine again. That worked well.

I'm still not sure if the latest PVE upgrade, the newer "testing" virtIO Drivers or a combination of both brought the cure.
However, I'm really happy that it is finally working with virtIO-SCSI since the boot times and initial application startup times are much lower compared to IDE for obvious reasons.

Phinitris · Nov 5, 2017

@Andreas Piening
I can confirm that after upgrading to PVE 5.1 the issue is gone and the host is operating normally again with SCSI (virtio-scsi).

Andreas Piening · Nov 5, 2017

Phinitris said:
I can confirm that after upgrading to PVE 5.1 the issue is gone and the host is operating normally again with SCSI (virtio-scsi).

Oh that's good news! So you did upgrade to PVE 5.1 but did not upgrade your virtIO drivers, correct?

Phinitris · Nov 5, 2017

Andreas Piening said:
Oh that's good news! So you did upgrade to PVE 5.1 but did not upgrade your virtIO drivers, correct?

Yes correct. I have many Linux and Windows based VMs and they all are running fine now without upgraded VirtIO drivers.

AlBundy · Nov 6, 2017

Yes, it's finally fixed! Even doing backups didn't create high load emails!

Andreas Piening · Nov 6, 2017

Great.
So if anyone has a background on this (Proxmox Staff?): It would be interesting to know what the reason was behind this issue.
I can't find anything related to this in the release notes.

micro · Nov 6, 2017

Unfortunately for me even after the upgrade to PVE5.1 I'm experiencing high IO wait and huge jump in Load during writing operations to VirtIO storage. I did a quick test - adding a new virtio disk to existing linux vm machine and simple dd if=/dev/zero of=test bs=1M count=500 bumped the Load of the host to 18 and the IO wait was huge again.

fabian · Nov 6, 2017

Andreas Piening said:
Great.
So if anyone has a background on this (Proxmox Staff?): It would be interesting to know what the reason was behind this issue.
I can't find anything related to this in the release notes.

most likely the newer kernel (5.0 had a 4.10 based kernel, 5.1 has a 4.13 based one)

Phinitris · Nov 6, 2017

micro said:
Unfortunately for me even after the upgrade to PVE5.1 I'm experiencing high IO wait and huge jump in Load during writing operations to VirtIO storage. I did a quick test - adding a new virtio disk to existing linux vm machine and simple dd if=/dev/zero of=test bs=1M count=500 bumped the Load of the host to 18 and the IO wait was huge again.

What hardware do you use and how is your storage configured? What speed do you get with dd?

micro · Nov 6, 2017

Phinitris said:
What hardware do you use and how is your storage configured? What speed do you get with dd?

The hardware, the storage and dd speed aren't relevant here. The only difference is IDE vs VirtIO (and this was discussed multiple times in this thread).

micro · Apr 10, 2019

Just to mention here 5.1 didn't solved the issue to me. I have changed the storage and switched from FC to iSCSI. The problems are gone. I don't know what resolved it - was it something with OS support for direct attached FC or the storage itself.

> 3000 mSec Ping and packet drops with VirtIO under load

Active Member

Renowned Member

Renowned Member

Well-Known Member

Renowned Member

Well-Known Member

Renowned Member

Well-Known Member

Well-Known Member

Renowned Member

Well-Known Member

Renowned Member

Active Member

Well-Known Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Renowned Member

We value your privacy