frozen lxc container or standstill voice (asterisk)

Denis Kulikov · Nov 16, 2018

Hi all!

"A long time ago" we try to migrate our small office PBX (Asterisk 11 + CentOS6) to Proxmox (LXC container) and discovered problem with voice that called standstill (one way audio).
We have up to 7 concurent SIP calls (maximum 1 Mbps of voice traffic).
Only one container running on node, LA near 1.
We try to create test cluster for smooth migration of container and quick change hardware (3 different servers, from "High": 2xX5650,128GB RAM to "Low": 2xX5250, 16GB RAM), only one container running on the node at the same time and problem persist anywhere.
We try to capture traffic in 4 places and simultaneously ping each place with 100ms interval: host, container, switch (SPAN), ISP and find that container freeze for 1-1,5 seconds.

Then problem occur:
tcpdump on host showed icmp ping replies from container(!) to PC and one way RTP traffic (from ISP to container, but not on reverse direction).
tcpdump on container showing nothing (but ping replies exist on bridge|wire!), 0(zero) packets was captured in 1-1,5 sec interval - as if the container was freeze(hung), but packets existed in capture after and before this "one-way voice standstill".

In general, it looks as if the container is frozen for 1 - 1,5 sec.
At first we thought it was a problem with network bridge or veeth, but container replies to ping and not showing this on tcpdump (nothing showed then problem exist).

In dmesg, syslog, etc - no suspicious messages (such as hung, timeout, etc) - all fine everywhere.
For example from "Low" server:

Code:

top - 11:41:09 up 15 days,  3:12,  2 users,  load average: 0,61, 0,63, 0,52
Tasks: 327 total,   1 running, 266 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0,9 us,  1,8 sy,  0,0 ni, 96,1 id,  1,2 wa,  0,0 hi,  0,0 si,  0,0 st
KiB Mem : 12296532 total,  4123032 free,  5691732 used,  2481768 buff/cache
KiB Swap:  8388604 total,  8369916 free,    18688 used.  6074492 avail Mem

Code:

pveversion -v

proxmox-ve: 5.2-2 (running kernel: 4.15.18-7-pve)
pve-manager: 5.2-10 (running version: 5.2-10/6f892b40)
pve-kernel-4.15: 5.2-10
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.17-1-pve: 4.15.17-9
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-40
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-30
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-3
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-29
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
pve-zsync: 1.7-1
qemu-server: 5.0-38
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.11-pve1~bpo1

Kernel, lxc, etc updates, tuning, reading and hardware changes during year doesn`t solve the problem and we ask community to help us, please - how to find the root cause the problem (veeth, lxc, zfs....)?

dcsapak · Nov 19, 2018

if i understand your post correctly, the freeze happens when you try to migrate the container?
if yes this is normal, there is no real live-migration with containers, only 'restart' mode so the containers stops, config gets transferred to the new node and started again

if you need a live-migration, you need to use vms instead of containers

Denis Kulikov · Nov 19, 2018

Many thanks for answer.
No, we use migration only for move container to different hardware (in off state) for tests.
'Freeze' problem persists on all nodes (on lab, on production).
After analyze tcpdumps we seen very strange network problems inside hosts (tcpdump on external interface of host with 2-3 bridge|vlans and 1 container) - RTP packet reordering, etc and we can't imagine what is the reason for this (tc with default settings - cannot, drops and flushes not exist on ifconfig|ethtool statistics, dmesg and kernel.log - nothing interesting)....
We try to do some network tuning in kernel (backlog, etc) and wait for results, but on up to 2 Mbps of voice traffic and some control traffic (up to 10 Mbps with bursts) - it is not required (i think).

Denis Kulikov · Dec 5, 2018

Hi all.

We found some interesting thing, that container freeze for 1.2 sec then network activity from corosync started (unicast udp 5404->5405 between cluster members) and unfreeze after them ended (and rtp traffic burst detected).
Any help would be appreciated.

gosha · Dec 5, 2018

Hi!

Denis Kulikov said:
Hi all.

We found some interesting thing, that container freeze for 1.2 sec then network activity from corosync started (unicast udp 5404->5405 between cluster members) and unfreeze after them ended (and rtp traffic burst detected).
Any help would be appreciated.

You need to use a separate physical network interface for the cluster (corosync) needs.
Like this:

Best regards
Gosha

Denis Kulikov · Dec 5, 2018

Many thanks for answer, gosha.

gosha said:
You need to use a separate physical network interface for the cluster needs.

Then internal traffic in 3 nodes cluster (without shared storage and replication) with 2 containers up to 10 Mbps (on Gigabit network) it`is not necessary, i think (we observe this issue in 2 nodes cluster with 1 container).
Or you can describe why this network setup is necessary?

gosha · Dec 5, 2018

Denis Kulikov said:
Many thanks for answer, gosha.
Or you can describe why this network setup is necessary?

"container freeze for 1.2 sec then network activity from corosync started" - is this a bad argument?

Gosha

Denis Kulikov · Dec 5, 2018

gosha said:
"container freeze for 1.2 sec then network activity from corosync started" - is this a bad argument?

It`s not an argument for network engineer, activity from corosync (bidirectional!) is approx (without DSCP and CoS values):
400packets*162bytes = 64800 bytes in 120 ms.
For 1 second rate is 5.2 Mbit/s.
And additional network interface needed for 5Mbit/s traffic?

We will try to setup dedicated nic for cluster activity, but i think that this not help.

gosha · Dec 5, 2018

Denis Kulikov said:
It`s not an argument for network engineer, activity from corosync (bidirectional!) is approx (without DSCP and CoS values):
400packets*162bytes = 64800 bytes in 120 ms.
For 1 second rate is 5.2 Mbit/s.
And additional network interface needed for 5Mbit/s traffic?

We will try to setup dedicated nic for cluster activity, but i think that this not help.

Ok. Please read first paragraph here: https://pve.proxmox.com/wiki/Separate_Cluster_Network

Gosha

Denis Kulikov · Dec 5, 2018

We separate the cluster network (move to dedicated nic), cluster is fine, problem (issue) is persist and nothing was changed as expected.
We will try to profile the components by strace and perf, debug corosync, lxcfs(fuse) and update glibc in container.

Denis Kulikov · Dec 6, 2018

I think that we found the root cause of this issue.
It`s lxc-freeze command, that issued by pvesr and it work as expected - container is freezing.
I don't have enough time now to find the reason of this and need to do some tests (replication is off - as my colleagues said), and read the docs about replication/pvesr/etc stuff.

Search

Search

frozen lxc container or standstill voice (asterisk)

Denis Kulikov

Member

dcsapak

Proxmox Staff Member

Denis Kulikov

Member

Denis Kulikov

Member

gosha

Well-Known Member

Denis Kulikov

Member

gosha

Well-Known Member

Denis Kulikov

Member

gosha

Well-Known Member

Denis Kulikov

Member

Denis Kulikov

Member