Network outages on production servers

gslongo

New Member
Apr 20, 2015
21
0
1
Hi all,

We are facing to a big problem on our production servers :

Packets sent to another machine are dropped somewhere (pve-firewall stopped) => OVS Bridge / VirtIO NICs (Tested e1000 interface, same issue)

In the tcpdump you can see SYN packet sent from source, then SYN+ACK sent from destination but nothing enter inside the tap device. This is happening at a VM restart (of VM 105) and maybe I'm wrong but I'm pretty sure this was not happening in the last release. The problem seems to be "auto-solved" in one hour (approximately)

I also see in the dmesg logs :


[862462.377048] device tap105i0 entered promiscuous mode
[862471.009538] kvm: zapping shadow pages for mmio generation wraparound
[862471.339032] kvm [10123]: vcpu0 unhandled rdmsr: 0x606
[862472.786883] kvm [10123]: vcpu0 unhandled rdmsr: 0x611
[862472.787020] kvm [10123]: vcpu0 unhandled rdmsr: 0x639
[862472.787129] kvm [10123]: vcpu0 unhandled rdmsr: 0x641
[862472.787242] kvm [10123]: vcpu0 unhandled rdmsr: 0x619
[862473.287933] kvm [10123]: vcpu0 unhandled rdmsr: 0x1ad
[862856.498283] device tap103i0 entered promiscuous mode
[862868.201655] kvm: zapping shadow pages for mmio generation wraparound
[862948.596922] kvm [10123]: vcpu0 unhandled rdmsr: 0x606
[862950.050088] kvm [10123]: vcpu0 unhandled rdmsr: 0x611
[862950.050226] kvm [10123]: vcpu0 unhandled rdmsr: 0x639
[862950.050342] kvm [10123]: vcpu0 unhandled rdmsr: 0x641
[862950.050473] kvm [10123]: vcpu0 unhandled rdmsr: 0x619
[862950.694569] kvm [10123]: vcpu0 unhandled rdmsr: 0x1ad
[863133.483743] device tap103i0 entered promiscuous mode
[863140.332821] kvm: zapping shadow pages for mmio generation wraparound


The PID 10123 is one of the concerned machines. this is the machine "Dump2" (105) which recieve the connection. This is a CentOS 7 Guest (Zimbra)


~# pveversion -v
proxmox-ve-2.6.32: 3.4-150 (running kernel: 3.10.0-8-pve)
pve-manager: 3.4-3 (running version: 3.4-3/2fc72fee)
pve-kernel-3.10.0-7-pve: 3.10.0-27
pve-kernel-3.10.0-8-pve: 3.10.0-30
pve-kernel-3.10.0-5-pve: 3.10.0-19
pve-kernel-2.6.32-37-pve: 2.6.32-150
pve-kernel-2.6.32-34-pve: 2.6.32-140
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-16
qemu-server: 3.4-3
pve-firmware: 1.1-4
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-32
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-8
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

Thank you for your help !
 

Attachments

Very strange. After send ping to the destination first, the connection can be established (tested on 2 hosts) :

[root@zimbra-proxy1 ~]# telnet 172.29.254.247 389
Trying 172.29.254.247...
^C
[root@zimbra-proxy1 ~]# ping 172.29.254.247
PING 172.29.254.247 (172.29.254.247) 56(84) bytes of data.
64 bytes from 172.29.254.247: icmp_seq=1 ttl=63 time=0.280 ms
64 bytes from 172.29.254.247: icmp_seq=2 ttl=64 time=0.137 ms
^C
--- 172.29.254.247 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.137/0.208/0.280/0.072 ms
[root@zimbra-proxy1 ~]# telnet 172.29.254.247 389
Trying 172.29.254.247...
Connected to 172.29.254.247.
Escape character is '^]'.
 
Hi,
you can try the new kernel 3.10-9.