Network packet loss in high traffic VMs

Hans Otto Lunde · Nov 18, 2022

Hi,

I don't think you can do that, and that also makes sense, I think.
In the end it is the physical interfaces, that handle the input/output, so it makes sense that it is those interfaces, where you can modify the buffer-sizes etc. All the logical/virtual interfaces will then benefit from that.
Have you been unable to get any improved results by modifying the "real" interfaces buffersizes etc.?

nicebug · Nov 18, 2022

I set the BUFFER SIZE of the real interface, then restart the network interface of the host, and restart the VM in the PVE panel, but it still has 30% packet loss.

Perhaps I should delete the VM and rebuild or reboot the host? It is not clear why.

I now set Multiqueue to 2 and the problem is solved.

spirit · Nov 18, 2022

Hi, in coming proxmox 7.3, the rx|tx buffer size of qemu nic has been bumped to 1024 and also have improvement on vm multi-queue.

with default queue=1, only 1 vm core is used to handle incoming traffic. if you have a lot of small packets per second, this core can saturated and you can have packet drop.

auranext · Nov 24, 2022

@mika,
it is possible to change ring buffer on virtio-net vnic (I have not tested)
this should logically reduce the packets loss during intensive udp traffic

qemu-system-x86_64 -device virtio-net-pci,?
rx_queue_size=<uint16> - (default: 256)
tx_queue_size=<uint16> - (default: 256)

dont forgot to check others conditions, path L2MTU, VM tx_buffer (txqlen), VM scaling governor, hypervisor pci profile (max perf is mandatory for packet processing)
Additionnally it is important to notice that kvm hypervisor with virtio-net (vhost-net really) is not very efficient to process intensive network with many VMs. It is a linux kernel (not only) parrallelisation issue.
for example on single node (SMP) processing +1M pps is possible on single VM but it is difficult to achieve this distribution over 40 VMs
When you notice packet loss can you detail the node load and the load of each VM
finally I suggest to have a look at VM cpu STEAL time , it represent the amout of cpu the VM neaded but the hypervisor cannot give because used by another task or VM. (command top)

spirit · Nov 24, 2022

auranext said:
@mika,
it is possible to change ring buffer on virtio-net vnic (I have not tested)
this should logically reduce the packets loss during intensive udp traffic

the default value has been bumped to 1024 in proxmox 7.3.

spirit · Nov 24, 2022

auranext said:
for example on single node (SMP) processing +1M pps is possible on single VM

you need to increase queue to balance on multiple vm vcpus.
(as indeed, 1core is limited to 1-2 mpps)

vesalius · Nov 24, 2022

spirit said:
the default value has been bumped to 1024 in proxmox 7.3.

@spirit What is the cli command verify these new default setting? Should we check from Proxmox host cli or vm?

spirit said:
you need to increase queue to balance on multiple vm vcpus.
(as indeed, 1core is limited to 1-2 mpps)

any general advice on the number of queues relative to the number of VM vcpus? 1 to 1 a good starting point?

spirit · Nov 24, 2022

vesalius said:
@spirit What is the cli command verify these new default setting? Should we check from Proxmox host cli or vm?

any general advice on the number of queues relative to the number of VM vcpus? 1 to 1 a good starting point?

queues should be lower or equal to number of cpus. (1 to 1 max, each queue use 1dedicated thread, so need 1 vcpu)

nicebug · Nov 27, 2022

spirit said:
Hi, in coming proxmox 7.3, the rx|tx buffer size of qemu nic has been bumped to 1024 and also have improvement on vm multi-queue.

with default queue=1, only 1 vm core is used to handle incoming traffic. if you have a lot of small packets per second, this core can saturated and you can have packet drop.

Thanks!!

I'm planning to upgrade from 7.2 to 7.3. Will all this change require us to delete the VM and rebuild?

Previously, I increased the queue and the problem was solved, but with more and more TCP connections, the bandwidth of the VM cannot be fully utilized. I am not sure whether it has something to do with BUFFER SIZE.

auranext · Dec 5, 2022

spirit said:
the default value has been bumped to 1024 in proxmox 7.3.

Hi,
I am doing some qemu 7 tests , rx_queue_size=1024,tx_queue_size=1024 are present in vm cmdline
In the guestOS only rx queue is increased, tx stay at 256
did you notice that too ? is this normal behavior ?

scdhh · Feb 16, 2023

hi, I have a TrueNAS SCALE VM and an Ubuntu VM.
When I configure VirtIO NIC and use smb to transfer the file, the file will be corrupted.
Using E1000 and RTL8139 is normal.
I use hex to view the file and find a lot of 00 bytes in it.
Could this be related to the issue?
PVE version: 7.3-3

spirit · Feb 16, 2023

scdhh said:
hi, I have a TrueNAS SCALE VM and an Ubuntu VM.
When I configure VirtIO NIC and use smb to transfer the file, the file will be corrupted.
Using E1000 and RTL8139 is normal.
I use hex to view the file and find a lot of 00 bytes in it.
Could this be related to the issue?
PVE version: 7.3-3

Maybe it's a bug in freebsd virtio driver ?

scdhh · Feb 16, 2023

spirit said:
Maybe it's a bug in freebsd virtio driver ?

But TrueNAS SCALE is based on Debian
And I have the same issue using virtio in Ubuntu (although they are both based on Debian
Could this be a Debian driver bug?

spirit · Feb 17, 2023

scdhh said:
But TrueNAS SCALE is based on Debian
And I have the same issue using virtio in Ubuntu (although they are both based on Debian
Could this be a Debian driver bug?

this is strange. I have 4000 debian vm in production (stretch,buster,bulleseye with stock kernel), and I never have see this problem.

scdhh · Feb 17, 2023

spirit said:
this is strange. I have 4000 debian vm in production (stretch,buster,bulleseye with stock kernel), and I never have see this problem.

I don't know if this problem has anything to do with my strange network structure.
I drew a diagram, it's kind of bad, but it should be readable XD

mm553 · Nov 28, 2023

I've solved the problem with multi queue on kernel 5.15, but since kernel 6.x the problem is back. I can't figure out where, why and how. The only workaround is to start with the old kernel again.
Really frustrating

maly_glod · May 10, 2024

Hi @mm553
did you manage to solve the problem on 6.x kernel?

mm553 · May 10, 2024

maly_glod said:
Hi @mm553
did you manage to solve the problem on 6.x kernel?

Unfortunately not. Still using a 5.x kernel

tglenard · Dec 4, 2024

Hi,
If you still have the problem, maybe this can help.
I had the same strange error on multiple truenas for months, and I think I just found the solution... try to activate the proxmox firewall on the vm network device of your truenas. It seems a stop / start of the VM is needed even if proxmox says no...
I don't know why, but it seems I've really fewer incoming failure
Thanks for the advice for queues

Network packet loss in high traffic VMs

Renowned Member

New Member

Distinguished Member

Well-Known Member

Distinguished Member

Distinguished Member

Renowned Member

Distinguished Member

New Member

Well-Known Member

New Member

Attachments

Distinguished Member

New Member

Distinguished Member

New Member

Member

New Member

Member

Member

We value your privacy