Network throughput of the virtual machines collapses after a few days

BertrandBB · Oct 13, 2019

Hi

I use proxmox on a Soyoustart server (OVH). The installation dates from a few weeks before the release of the V6 Proxmox. The migration to v6 was made hoping a resolution of the problem that I will explain here, but this was not the case.

The problem is that the network throughput of the virtual machines collapses after a few days. There are 4 VMs on the server and they are all concerned. The flow ended by capping at 1Mo / s. It is very weak compared to real capacities.

I have been installing servers for my personal use for 20 years to host websites. It did not transform me as a network expert because it is the first time I am confronted with the problem.

The only solution to find a normal flow of VM is the reboot of the server. This morning I still tried this:

Code:

systemctl restart networking
Job for networking.service failed because the control process exited with error code.
See "systemctl status networking.service" and "journalctl -xe" for details.
root@ns31****:/etc/init.d# systemctl start networking
Job for networking.service failed because the control process exited with error code.
See "systemctl status networking.service" and "journalctl -xe" for details.

# systemctl status networking.service
● networking.service - Raise network interfaces
   Loaded: loaded (/lib/systemd/system/networking.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Sun 2019-10-13 11:54:42 CEST; 16s ago
     Docs: man:interfaces(5)
  Process: 15283 ExecStart=/sbin/ifup -a --read-environment (code=exited, status=1/FAILURE)
Main PID: 15283 (code=exited, status=1/FAILURE)

oct. 13 11:54:42 ns31**** systemd[1]: Starting Raise network interfaces...
oct. 13 11:54:42 ns31**** ifup[15283]: Waiting for vmbr0 to get ready (MAXWAIT is 2 seconds).
oct. 13 11:54:42 ns31**** ifup[15283]: RTNETLINK answers: File exists
oct. 13 11:54:42 ns31**** ifup[15283]: ifup: failed to bring up vmbr0
oct. 13 11:54:42 ns31**** systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
oct. 13 11:54:42 ns31**** systemd[1]: networking.service: Failed with result 'exit-code'.
oct. 13 11:54:42 ns31**** systemd[1]: Failed to start Raise network interfaces.

I ended up rebooting the machine because I could not afford to leave my users without services ...

I of course contacted the customer service, but I am asked to do my tests in rescue mode ... Obviously the results will be good and the conclusion will probably be as always: the problem lies between the screen and the chair, there where was made the configuration ...

Thank you to those who can help me !

Bertrand B.

spirit · Oct 13, 2019

you can't restart networking service, because vm tap interfaces are not defined in /etc/network/interfaces. (so at best, vmbr will be restarted without any vm pluggged on it).
Anyway, that shouldn't fix your problem.

can you post your /etc/network/interfaces file ?
do you have any logs in /var/log/kern.log, /var/log/messages ?
#pve-version -v ?
Do you tried to do network benchmark between vm on same host ?

BertrandBB · Oct 13, 2019

Thanks for your answer ! Here the elements :

Bash:

cat  /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).

# The loopback network interface
auto lo
iface lo inet loopback


# vmbr0: Bridging. Make sure to use only MAC adresses that were assigned to you.
auto vmbr0
iface vmbr0 inet static
    address 188.165.203.***/24
    gateway 188.165.203.254
    bridge_ports eno1
    bridge_stp off
    bridge_fd 0

Lines in kern.log after the reboot (a lot of lines in first startup seconds) - (At this time, network is ok)

Bash:

Oct 13 12:12:44 ns31**** kernel: [  913.132254] sctp: Hash tables configured (bind 512/512)
Oct 13 13:44:25 ns31**** kernel: [ 6414.884788] hrtimer: interrupt took 31224 ns
Oct 13 16:49:04 ns31**** kernel: [17494.149919] perf: interrupt took too long (2576 > 2500), lowering kernel.perf_event_max_sample_rate to 77500

The same for messages :

Bash:

Oct 13 12:12:44 ns3***** kernel: [  913.132254] sctp: Hash tables configured (bind 512/512)
Oct 13 13:44:25 ns31***** kernel: [ 6414.884788] hrtimer: interrupt took 31224 ns
Oct 13 16:49:04 ns31**** kernel: [17494.149919] perf: interrupt took too long (2576 > 2500), lowering kernel.perf_event_max_sample_rate to 77500

pveversion -v :

Bash:

pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.21-2-pve)
pve-manager: 6.0-7 (running version: 6.0-7/28984024)
pve-kernel-5.0: 6.0-8
pve-kernel-helper: 6.0-8
pve-kernel-4.15: 5.4-9
pve-kernel-5.0.21-2-pve: 5.0.21-6
pve-kernel-4.15.18-21-pve: 4.15.18-48
ceph-fuse: 12.2.12-pve1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.12-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-5
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-9
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-65
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-7
pve-cluster: 6.0-7
pve-container: 3.0-7
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-9
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1

The iperf command asked by OVH support, runed when the netword problem was present :

Code:

iperf -c iperf.ovh.net -i 2 -t 20 -P 5 -f m -o 5
------------------------------------------------------------
Client connecting to iperf.ovh.net, TCP port 5001
TCP window size: 0.08 MByte (default)
------------------------------------------------------------
[  7] local 188.165.203.*** port 49106 connected with 188.165.12.136 port 5001
[  5] local 188.165.203.*** port 49102 connected with 188.165.12.136 port 5001
[  3] local 188.165.203.*** port 49104 connected with 188.165.12.136 port 5001
[  6] local 188.165.203.*** port 49100 connected with 188.165.12.136 port 5001
[  4] local 188.165.203.*** port 49098 connected with 188.165.12.136 port 5001
[ ID] Interval       Transfer     Bandwidth
[  7]  0.0- 2.0 sec  0.62 MBytes  2.62 Mbits/sec
[  5]  0.0- 2.0 sec  0.62 MBytes  2.62 Mbits/sec
[  6]  0.0- 2.0 sec  0.62 MBytes  2.62 Mbits/sec
[  4]  0.0- 2.0 sec  0.62 MBytes  2.62 Mbits/sec
[  3]  0.0- 2.0 sec  0.75 MBytes  3.15 Mbits/sec
[SUM]  0.0- 2.0 sec  3.25 MBytes  13.6 Mbits/sec
[  4]  2.0- 4.0 sec  0.38 MBytes  1.57 Mbits/sec
[  6]  2.0- 4.0 sec  0.38 MBytes  1.57 Mbits/sec
[  3]  2.0- 4.0 sec  0.38 MBytes  1.57 Mbits/sec
[  7]  2.0- 4.0 sec  0.50 MBytes  2.10 Mbits/sec
[  5]  2.0- 4.0 sec  0.50 MBytes  2.10 Mbits/sec
[SUM]  2.0- 4.0 sec  2.12 MBytes  8.91 Mbits/sec
[  4]  4.0- 6.0 sec  0.50 MBytes  2.10 Mbits/sec
[  6]  4.0- 6.0 sec  0.50 MBytes  2.10 Mbits/sec
[  3]  4.0- 6.0 sec  0.37 MBytes  1.57 Mbits/sec
[  7]  4.0- 6.0 sec  0.50 MBytes  2.10 Mbits/sec
[  5]  4.0- 6.0 sec  0.50 MBytes  2.10 Mbits/sec
[SUM]  4.0- 6.0 sec  2.37 MBytes  9.96 Mbits/sec
[  3]  6.0- 8.0 sec  0.38 MBytes  1.57 Mbits/sec
[  7]  6.0- 8.0 sec  0.38 MBytes  1.57 Mbits/sec
[  5]  6.0- 8.0 sec  0.38 MBytes  1.57 Mbits/sec
[  4]  6.0- 8.0 sec  0.50 MBytes  2.10 Mbits/sec
[  6]  6.0- 8.0 sec  0.50 MBytes  2.10 Mbits/sec
[SUM]  6.0- 8.0 sec  2.12 MBytes  8.91 Mbits/sec
[  7]  8.0-10.0 sec  0.38 MBytes  1.57 Mbits/sec
[  5]  8.0-10.0 sec  0.38 MBytes  1.57 Mbits/sec
[  6]  8.0-10.0 sec  0.38 MBytes  1.57 Mbits/sec
[  4]  8.0-10.0 sec  0.38 MBytes  1.57 Mbits/sec
[  3]  8.0-10.0 sec  0.50 MBytes  2.10 Mbits/sec
[SUM]  8.0-10.0 sec  2.00 MBytes  8.39 Mbits/sec
[  4] 10.0-12.0 sec  0.38 MBytes  1.57 Mbits/sec
[  6] 10.0-12.0 sec  0.38 MBytes  1.57 Mbits/sec
[  3] 10.0-12.0 sec  0.38 MBytes  1.57 Mbits/sec
[  7] 10.0-12.0 sec  0.50 MBytes  2.10 Mbits/sec
[  5] 10.0-12.0 sec  0.50 MBytes  2.10 Mbits/sec
[SUM] 10.0-12.0 sec  2.12 MBytes  8.91 Mbits/sec
[  5] 12.0-14.0 sec  0.38 MBytes  1.57 Mbits/sec
[  6] 12.0-14.0 sec  0.50 MBytes  2.10 Mbits/sec
[  4] 12.0-14.0 sec  0.50 MBytes  2.10 Mbits/sec
[  3] 12.0-14.0 sec  0.50 MBytes  2.10 Mbits/sec
[  7] 12.0-14.0 sec  0.50 MBytes  2.10 Mbits/sec
[SUM] 12.0-14.0 sec  2.38 MBytes  9.96 Mbits/sec
[  6] 14.0-16.0 sec  0.38 MBytes  1.57 Mbits/sec
[  4] 14.0-16.0 sec  0.38 MBytes  1.57 Mbits/sec
[  3] 14.0-16.0 sec  0.38 MBytes  1.57 Mbits/sec
[  7] 14.0-16.0 sec  0.38 MBytes  1.57 Mbits/sec
[  5] 14.0-16.0 sec  0.50 MBytes  2.10 Mbits/sec
[SUM] 14.0-16.0 sec  2.00 MBytes  8.39 Mbits/sec
[  5] 16.0-18.0 sec  0.38 MBytes  1.57 Mbits/sec
[  6] 16.0-18.0 sec  0.50 MBytes  2.10 Mbits/sec
[  4] 16.0-18.0 sec  0.50 MBytes  2.10 Mbits/sec
[  3] 16.0-18.0 sec  0.50 MBytes  2.10 Mbits/sec
[  7] 16.0-18.0 sec  0.50 MBytes  2.10 Mbits/sec
[SUM] 16.0-18.0 sec  2.38 MBytes  9.96 Mbits/sec
[  3] 18.0-20.0 sec  0.38 MBytes  1.57 Mbits/sec
[  3]  0.0-20.1 sec  4.50 MBytes  1.88 Mbits/sec
[  7] 18.0-20.0 sec  0.38 MBytes  1.57 Mbits/sec
[  7]  0.0-20.2 sec  4.62 MBytes  1.92 Mbits/sec
[  5] 18.0-20.0 sec  0.50 MBytes  2.10 Mbits/sec
[  5]  0.0-20.3 sec  4.62 MBytes  1.91 Mbits/sec
[  6] 18.0-20.0 sec  0.50 MBytes  2.10 Mbits/sec
[  6]  0.0-20.6 sec  4.62 MBytes  1.89 Mbits/sec
[  4] 18.0-20.0 sec  0.50 MBytes  2.10 Mbits/sec
[SUM] 18.0-20.0 sec  2.25 MBytes  9.44 Mbits/sec
[  4]  0.0-20.6 sec  4.62 MBytes  1.89 Mbits/sec
[SUM]  0.0-20.6 sec  23.0 MBytes  9.38 Mbits/sec

Thanks !

B.

BertrandBB · Oct 14, 2019

As expected, after receiving the technical test report on a newly rebooted server in rescue mode, the OVH technical service concludes with a simple sentence that the problem is that of an internal configuration.
Me in 20 years, I have never seen a Debian that reduced its ethernet traffic randomly like that for no reason after a week of normal operation. Either it works or it does not work. In short, if anyone has an idea before it all ends at Amazon ... Thank you!

B.

spirit · Oct 14, 2019

When the problem occur, do you have global network stats usage of the nic ?

(Just an idea, maybe something is flooding you at this moment. (ddos,...)) ?

Is the iperf done from the proxmox host directly ?

BertrandBB · Oct 14, 2019

Thank you for this answer and this idea! I did not think about flood! I will try to watch what happens (but would it be strange that a reboot server stops the flood?)
The iperf command is well done on the host. Curiously, VMs are also limited to 1 MB at this time (wget a file on a VM from the host or from my home computer). In short, it's curious all that.
Thank you !

Search

Search

Network throughput of the virtual machines collapses after a few days

BertrandBB

Member

spirit

Distinguished Member

BertrandBB

Member

BertrandBB

Member

spirit

Distinguished Member

BertrandBB

Member