High latency in Linux KVM virtual machines

kcalliauw

Active Member
Jan 13, 2012
42
0
26
Belgium
Hi,

I'm currently facing some strange network latency issues on 2 of my Proxmox hosts. My current setup:

Cluster1: 5 hosts, latest stable PVE
Cluster2: 2 hosts, 1 latest stable, 1 latest PVEtest (upgraded to test repo to see if the problem went away, no luck)

On these clusters I have, among others, 8 identical virtual machines, of which 3 are running on Cluster2. On these 3 VM's I am experiencing high latency and weird ping times while on the other 5 vms there is no problem. The VM's are running latest updates of debian squeeze.

Output of pveversion -v of 1 of the affected hosts:
Code:
root@node9:~# pveversion -v
pve-manager: 2.3-7 (pve-manager/2.3/1fe64d18)
running kernel: 2.6.32-18-pve
proxmox-ve-2.6.32: 2.3-88
pve-kernel-2.6.32-16-pve: 2.6.32-82
pve-kernel-2.6.32-18-pve: 2.6.32-88
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-4
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-2
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-36
qemu-server: 2.3-8
pve-firmware: 1.0-21
libpve-common-perl: 1.0-44
libpve-access-control: 1.0-25
libpve-storage-perl: 2.3-2
vncterm: 1.0-3
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.3-18
ksm-control-daemon: 1.1-1

Config of 1 of the affected VM's:
Code:
root@node9:~# cat /etc/pve/qemu-server/311.conf 
bootdisk: virtio0
cores: 3
ide2: none,media=cdrom
memory: 4096
name: br-app8
net0: virtio=36:8D:23:4F:51:33,bridge=vmbr451
ostype: l26
sockets: 2
virtio0: mainvol00:311/vm-311-disk-1.qcow2,cache=writethrough,size=32G

Latency on affected host: (seems OK)
Code:
root@node9:~# ping belnet.be
PING belnet.be (193.190.130.15) 56(84) bytes of data.
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=1 ttl=55 time=4.41 ms
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=2 ttl=55 time=4.53 ms
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=3 ttl=55 time=4.52 ms
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=4 ttl=55 time=4.44 ms
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=5 ttl=55 time=4.31 ms
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=6 ttl=55 time=4.57 ms
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=7 ttl=55 time=4.53 ms

Latency on affected VM: (not OK)
Code:
root@br-app8:~# ping belnet.be
PING belnet.be (193.190.130.15) 56(84) bytes of data.
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=1 ttl=54 time=0.751 ms
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=2 ttl=54 time=9.63 ms
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=3 ttl=54 time=0.035 ms
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=4 ttl=54 time=4.99 ms
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=5 ttl=54 time=12.4 ms
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=6 ttl=54 time=0.030 ms
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=7 ttl=54 time=10.7 ms

The strange thing is, the very low ping times are really not possible.

Same ping on another VM, other host (seems OK)
Code:
root@br-app2:~# ping belnet.be
PING belnet.be (193.190.130.15) 56(84) bytes of data.
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=1 ttl=54 time=5.05 ms
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=2 ttl=54 time=5.09 ms
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=3 ttl=54 time=4.91 ms
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=4 ttl=54 time=4.90 ms
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=5 ttl=54 time=4.91 ms
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=6 ttl=54 time=4.90 ms

network on bad host:
Code:
root@node9:~# ifconfig vmbr451
vmbr451   Link encap:Ethernet  HWaddr 00:25:90:91:03:48  
          inet addr:127.45.1.9  Bcast:127.45.1.255  Mask:255.255.255.0
          inet6 addr: fe80::225:90ff:fe91:348/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1819 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:89148 (87.0 KiB)  TX bytes:468 (468.0 B)

root@node9:~# ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 00:25:90:91:03:48  
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:1353542 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1080980 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:1626619667 (1.5 GiB)  TX bytes:704216510 (671.5 MiB)

Any help in resolving this would be greatly appreciated!

Kind regards,
Koen
 
Hi,

I'm currently facing some strange network latency issues on 2 of my Proxmox hosts. My current setup:

Cluster1: 5 hosts, latest stable PVE
Cluster2: 2 hosts, 1 latest stable, 1 latest PVEtest (upgraded to test repo to see if the problem went away, no luck)

On these clusters I have, among others, 8 identical virtual machines, of which 3 are running on Cluster2. On these 3 VM's I am experiencing high latency and weird ping times while on the other 5 vms there is no problem. The VM's are running latest updates of debian squeeze.

..

try a 3.2 kernel inside your Debian Squeeze guest.

install from squeeze-backports:

add

"deb http://backports.debian.org/debian-backports squeeze-backports main"

to /etc/apt/sources.list


then:

Code:
apt-get update && apt-get install -t squeeze-backports linux-image-3.2.0-0.bpo.4-amd64
 
try a 3.2 kernel inside your Debian Squeeze guest.

install from squeeze-backports:

add

"deb http://backports.debian.org/debian-backports squeeze-backports main"

to /etc/apt/sources.list


then:

Code:
apt-get update && apt-get install -t squeeze-backports linux-image-3.2.0-0.bpo.4-amd64

Yes, update to 3.2. I had the same latency problem since qemu-kvm 1.2.
(Dietmar have reported it in qemu mailing list a long time ago, since to be related to a bug in 2.6.32 virtio drivers and number of cores and new feature of qemu kvm 1.2)
 
Hi,

Thanks for the quick reply. Unfortunately this didn't seem to help much:

Good VM:
Code:
root@br-app2:~# uname -a
Linux br-app2 2.6.32-5-amd64 #1 SMP Sun Sep 23 10:07:46 UTC 2012 x86_64 GNU/Linux
root@br-app2:~# ping -c8 10.10.0.1
PING 10.10.0.1 (10.10.0.1) 56(84) bytes of data.
64 bytes from 10.10.0.1: icmp_req=1 ttl=64 time=0.430 ms
64 bytes from 10.10.0.1: icmp_req=2 ttl=64 time=0.430 ms
64 bytes from 10.10.0.1: icmp_req=3 ttl=64 time=0.470 ms
64 bytes from 10.10.0.1: icmp_req=4 ttl=64 time=0.459 ms
64 bytes from 10.10.0.1: icmp_req=5 ttl=64 time=0.313 ms
64 bytes from 10.10.0.1: icmp_req=6 ttl=64 time=0.513 ms
64 bytes from 10.10.0.1: icmp_req=7 ttl=64 time=0.348 ms
64 bytes from 10.10.0.1: icmp_req=8 ttl=64 time=0.421 ms

--- 10.10.0.1 ping statistics ---
8 packets transmitted, 8 received, 0% packet loss, time 6997ms
rtt min/avg/max/mdev = 0.313/0.423/0.513/0.060 ms

Bad VM:
Code:
root@br-app8:~# uname -a
Linux br-app8 3.2.0-0.bpo.4-amd64 #1 SMP Debian 3.2.35-2~bpo60+1 x86_64 GNU/Linux
root@br-app8:~# ping -c 8 10.10.0.1
PING 10.10.0.1 (10.10.0.1) 56(84) bytes of data.
64 bytes from 10.10.0.1: icmp_req=1 ttl=64 time=0.000 ms
64 bytes from 10.10.0.1: icmp_req=2 ttl=64 time=3.03 ms
64 bytes from 10.10.0.1: icmp_req=3 ttl=64 time=5.66 ms
64 bytes from 10.10.0.1: icmp_req=4 ttl=64 time=0.757 ms
64 bytes from 10.10.0.1: icmp_req=5 ttl=64 time=0.034 ms
64 bytes from 10.10.0.1: icmp_req=6 ttl=64 time=10.2 ms
64 bytes from 10.10.0.1: icmp_req=7 ttl=64 time=0.030 ms
64 bytes from 10.10.0.1: icmp_req=8 ttl=64 time=0.042 ms

--- 10.10.0.1 ping statistics ---
8 packets transmitted, 8 received, 0% packet loss, time 7000ms
rtt min/avg/max/mdev = 0.000/2.476/10.248/3.496 ms

Was worth a shot though. Basically the VM's on this host are unusable since we're running a distributed calculation application on there that requires very low latencies to stuff like redis.

/K
 
Hi,

Thanks for the quick reply. Unfortunately this didn't seem to help much:

Good VM:
Code:
root@br-app2:~# uname -a
Linux br-app2 2.6.32-5-amd64 #1 SMP Sun Sep 23 10:07:46 UTC 2012 x86_64 GNU/Linux
root@br-app2:~# ping -c8 10.10.0.1
PING 10.10.0.1 (10.10.0.1) 56(84) bytes of data.
64 bytes from 10.10.0.1: icmp_req=1 ttl=64 time=0.430 ms
64 bytes from 10.10.0.1: icmp_req=2 ttl=64 time=0.430 ms
64 bytes from 10.10.0.1: icmp_req=3 ttl=64 time=0.470 ms
64 bytes from 10.10.0.1: icmp_req=4 ttl=64 time=0.459 ms
64 bytes from 10.10.0.1: icmp_req=5 ttl=64 time=0.313 ms
64 bytes from 10.10.0.1: icmp_req=6 ttl=64 time=0.513 ms
64 bytes from 10.10.0.1: icmp_req=7 ttl=64 time=0.348 ms
64 bytes from 10.10.0.1: icmp_req=8 ttl=64 time=0.421 ms

--- 10.10.0.1 ping statistics ---
8 packets transmitted, 8 received, 0% packet loss, time 6997ms
rtt min/avg/max/mdev = 0.313/0.423/0.513/0.060 ms

Bad VM:
Code:
root@br-app8:~# uname -a
Linux br-app8 3.2.0-0.bpo.4-amd64 #1 SMP Debian 3.2.35-2~bpo60+1 x86_64 GNU/Linux
root@br-app8:~# ping -c 8 10.10.0.1
PING 10.10.0.1 (10.10.0.1) 56(84) bytes of data.
64 bytes from 10.10.0.1: icmp_req=1 ttl=64 time=0.000 ms
64 bytes from 10.10.0.1: icmp_req=2 ttl=64 time=3.03 ms
64 bytes from 10.10.0.1: icmp_req=3 ttl=64 time=5.66 ms
64 bytes from 10.10.0.1: icmp_req=4 ttl=64 time=0.757 ms
64 bytes from 10.10.0.1: icmp_req=5 ttl=64 time=0.034 ms
64 bytes from 10.10.0.1: icmp_req=6 ttl=64 time=10.2 ms
64 bytes from 10.10.0.1: icmp_req=7 ttl=64 time=0.030 ms
64 bytes from 10.10.0.1: icmp_req=8 ttl=64 time=0.042 ms

--- 10.10.0.1 ping statistics ---
8 packets transmitted, 8 received, 0% packet loss, time 7000ms
rtt min/avg/max/mdev = 0.000/2.476/10.248/3.496 ms

Was worth a shot though. Basically the VM's on this host are unusable since we're running a distributed calculation application on there that requires very low latencies to stuff like redis.

/K
mmm, this is strange.
Can you test between 2 vm (kernel 3.2) on the same bridge ?
 
Hi,

Thanks for the quick reply. Unfortunately this didn't seem to help much:

can you test with one CPU inside the guest (cpu: 1, cores: 1)?
 
can you test with one CPU inside the guest (cpu: 1, cores: 1)?

This seems to help. There is much less fluctuation in the latency when using 1cpu, 1core. That being said, it's still not as stable as the others (deviance of +/- 1.0msec while +/-0.2 on others).
 
mmm, this is strange.
Can you test between 2 vm (kernel 3.2) on the same bridge ?
Quite a bit of fluctuation still when using multiple cores, not so with 1 core.

Code:
root@br-app7:~# ping 10.10.0.10
PING 10.10.0.10 (10.10.0.10) 56(84) bytes of data.
64 bytes from 10.10.0.10: icmp_req=1 ttl=64 time=0.302 ms
64 bytes from 10.10.0.10: icmp_req=2 ttl=64 time=0.227 ms
64 bytes from 10.10.0.10: icmp_req=3 ttl=64 time=0.094 ms
64 bytes from 10.10.0.10: icmp_req=4 ttl=64 time=0.446 ms
64 bytes from 10.10.0.10: icmp_req=5 ttl=64 time=0.112 ms
64 bytes from 10.10.0.10: icmp_req=6 ttl=64 time=0.067 ms
64 bytes from 10.10.0.10: icmp_req=7 ttl=64 time=0.454 ms
64 bytes from 10.10.0.10: icmp_req=8 ttl=64 time=0.086 ms
64 bytes from 10.10.0.10: icmp_req=9 ttl=64 time=0.549 ms
 
It seems to be exclusively related to the number of sockets and not the number of cores. A problem with the SMP code i qemu?

Both below uses a single socket but multiple cores. So instead of 2 sockets 3 cores try 1 socket 6 cores.

A routed ping:
ping -c10 yggdrasil
PING yggdrasil.datanom.net (172.16.1.2) 56(84) bytes of data.
64 bytes from yggdrasil.datanom.net (172.16.1.2): icmp_req=1 ttl=63 time=0.648 ms
64 bytes from yggdrasil.datanom.net (172.16.1.2): icmp_req=2 ttl=63 time=0.758 ms
64 bytes from yggdrasil.datanom.net (172.16.1.2): icmp_req=3 ttl=63 time=0.676 ms
64 bytes from yggdrasil.datanom.net (172.16.1.2): icmp_req=4 ttl=63 time=0.566 ms
64 bytes from yggdrasil.datanom.net (172.16.1.2): icmp_req=5 ttl=63 time=0.784 ms
64 bytes from yggdrasil.datanom.net (172.16.1.2): icmp_req=6 ttl=63 time=0.696 ms
64 bytes from yggdrasil.datanom.net (172.16.1.2): icmp_req=7 ttl=63 time=0.726 ms
64 bytes from yggdrasil.datanom.net (172.16.1.2): icmp_req=8 ttl=63 time=0.669 ms
64 bytes from yggdrasil.datanom.net (172.16.1.2): icmp_req=9 ttl=63 time=0.654 ms
64 bytes from yggdrasil.datanom.net (172.16.1.2): icmp_req=10 ttl=63 time=0.575 ms


--- yggdrasil.datanom.net ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9013ms
rtt min/avg/max/mdev = 0.566/0.675/0.784/0.069 ms

A non routed ping
ping -c10 nas
PING nas.datanom.net (192.168.2.11) 56(84) bytes of data.
64 bytes from nas.datanom.net (192.168.2.11): icmp_req=1 ttl=64 time=0.392 ms
64 bytes from nas.datanom.net (192.168.2.11): icmp_req=2 ttl=64 time=0.404 ms
64 bytes from nas.datanom.net (192.168.2.11): icmp_req=3 ttl=64 time=0.388 ms
64 bytes from nas.datanom.net (192.168.2.11): icmp_req=4 ttl=64 time=0.402 ms
64 bytes from nas.datanom.net (192.168.2.11): icmp_req=5 ttl=64 time=0.401 ms
64 bytes from nas.datanom.net (192.168.2.11): icmp_req=6 ttl=64 time=0.430 ms
64 bytes from nas.datanom.net (192.168.2.11): icmp_req=7 ttl=64 time=0.425 ms
64 bytes from nas.datanom.net (192.168.2.11): icmp_req=8 ttl=64 time=0.399 ms
64 bytes from nas.datanom.net (192.168.2.11): icmp_req=9 ttl=64 time=0.429 ms
64 bytes from nas.datanom.net (192.168.2.11): icmp_req=10 ttl=64 time=0.447 ms


--- nas.datanom.net ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9011ms
rtt min/avg/max/mdev = 0.388/0.411/0.447/0.030 ms
 
Not fixed with 1 socket, 6 cores for me:

Code:
root@br-app7:~# ping -c10 belnet.be
PING belnet.be (193.190.130.15) 56(84) bytes of data.
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=1 ttl=55 time=6.47 ms
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=2 ttl=55 time=4.91 ms
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=3 ttl=55 time=3.65 ms
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=4 ttl=55 time=4.59 ms
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=5 ttl=55 time=0.037 ms
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=6 ttl=55 time=16.4 ms
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=7 ttl=55 time=5.88 ms
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=8 ttl=55 time=5.33 ms
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=9 ttl=55 time=17.6 ms
64 bytes from fiorano.belnet.be (193.190.130.15): icmp_req=10 ttl=55 time=4.15 ms

Anything else I can try to troubleshoot? Thanks.

/K
 
Hi,

I am still having this issue on some virtual machines, even on the latest 3.0 version of Proxmox.
More things I can do to troubleshoot?

/K
 
Hi,

I am still having this issue on some virtual machines, even on the latest 3.0 version of Proxmox.
More things I can do to troubleshoot?

/K

Hi, which kernel do you use in your guest ? 2.6.32 is known to have this problem, I recommand your tu use 3.X kernels.
 
there is no such known problem in 2.6.32.

if you disagree, provide details and a bug report. 3.2 kernel is not recommended on Proxmox VE.
 
Thanks for the report.
I'll try to see if we can enable x2apic by default for next proxmox release. I'm not sure, but I think that ovirt/rhev enable it by default.
(It was a problem in the past because kernel_irqchip was disabled before qemu 1.3, but now it's should work out of the box)

 
Hello everyone

I do not know if this question is related to the same problem.

My scenery:
- I have a Windows Server 2008 r2 in PVE 2.3 using 4 cores
- The server is a DELL 2900
- My VMID config:
boot: c
bootdisk: virtio0
cores: 4
cpu: host
ide2: none,media=cdrom,size=6604K
memory: 36864
name: Win2008R2
net0: virtio=72:60:0D:E5:29:65,bridge=vmbr0
net1: virtio=DE:B3:D6:AB:B5:6B,bridge=vmbr1
ostype: win7
sockets: 1
virtio0: local:107/vm-107-disk-1.raw,size=350G
virtio1: local:107/vm-107-disk-2.raw,size=32G
virtio2: local:107/vm-107-disk-3.raw,size=150G
virtio3: local:107/vm-107-disk-4.qcow2,size=50G

The behavior:
When PVE GUI shows (Tag: Summary, Section: Status) the CPU usage between 8% to 10%,
and the task manager of Windows VM shows between 0% to 1%

My question:
Why no difference in the report of CPU usage?
 
The behavior:
When PVE GUI shows (Tag: Summary, Section: Status) the CPU usage between 8% to 10%,
and the task manager of Windows VM shows between 0% to 1%

My question:
Why no difference in the report of CPU usage?

Generally, this is related to interrupts which used cpu on the host.
Like an usb device for example, no cpu on guest, but send a lot of interrupts.

The X2apic options try to help this, you can also try to disable usb tablet.
For next proxmox version I'll try to add some hyper-v features which could help windows guests too.
 
Generally, this is related to interrupts which used cpu on the host.
Like an usb device for example, no cpu on guest, but send a lot of interrupts.

The X2apic options try to help this, you can also try to disable usb tablet.
For next proxmox version I'll try to add some hyper-v features which could help windows guests too.

Thanks spirit four your answer.
But do you think that i must change the configurations, ie X2apic and usb tablet?,
And if the answer are yes, how i do it?

For example:
In the "/etc/pve/qemu-server/200.conf" file i must modify:

Before:
cpu: host
After:
cpu host,+x2apic

or how to make the change?

and how i disabled usb tablet?

Best regards
Cesar

Re-edit: Important to me because in this VM i run MS-SQL
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!