Network outages on production servers

gslongo

New Member
Apr 20, 2015
21
0
1
Hi all,

We are facing to a big problem on our production servers :

Packets sent to another machine are dropped somewhere (pve-firewall stopped) => OVS Bridge / VirtIO NICs (Tested e1000 interface, same issue)

In the tcpdump you can see SYN packet sent from source, then SYN+ACK sent from destination but nothing enter inside the tap device. This is happening at a VM restart (of VM 105) and maybe I'm wrong but I'm pretty sure this was not happening in the last release. The problem seems to be "auto-solved" in one hour (approximately)

I also see in the dmesg logs :


[862462.377048] device tap105i0 entered promiscuous mode
[862471.009538] kvm: zapping shadow pages for mmio generation wraparound
[862471.339032] kvm [10123]: vcpu0 unhandled rdmsr: 0x606
[862472.786883] kvm [10123]: vcpu0 unhandled rdmsr: 0x611
[862472.787020] kvm [10123]: vcpu0 unhandled rdmsr: 0x639
[862472.787129] kvm [10123]: vcpu0 unhandled rdmsr: 0x641
[862472.787242] kvm [10123]: vcpu0 unhandled rdmsr: 0x619
[862473.287933] kvm [10123]: vcpu0 unhandled rdmsr: 0x1ad
[862856.498283] device tap103i0 entered promiscuous mode
[862868.201655] kvm: zapping shadow pages for mmio generation wraparound
[862948.596922] kvm [10123]: vcpu0 unhandled rdmsr: 0x606
[862950.050088] kvm [10123]: vcpu0 unhandled rdmsr: 0x611
[862950.050226] kvm [10123]: vcpu0 unhandled rdmsr: 0x639
[862950.050342] kvm [10123]: vcpu0 unhandled rdmsr: 0x641
[862950.050473] kvm [10123]: vcpu0 unhandled rdmsr: 0x619
[862950.694569] kvm [10123]: vcpu0 unhandled rdmsr: 0x1ad
[863133.483743] device tap103i0 entered promiscuous mode
[863140.332821] kvm: zapping shadow pages for mmio generation wraparound


The PID 10123 is one of the concerned machines. this is the machine "Dump2" (105) which recieve the connection. This is a CentOS 7 Guest (Zimbra)


~# pveversion -v
proxmox-ve-2.6.32: 3.4-150 (running kernel: 3.10.0-8-pve)
pve-manager: 3.4-3 (running version: 3.4-3/2fc72fee)
pve-kernel-3.10.0-7-pve: 3.10.0-27
pve-kernel-3.10.0-8-pve: 3.10.0-30
pve-kernel-3.10.0-5-pve: 3.10.0-19
pve-kernel-2.6.32-37-pve: 2.6.32-150
pve-kernel-2.6.32-34-pve: 2.6.32-140
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-16
qemu-server: 3.4-3
pve-firmware: 1.1-4
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-32
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-8
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

Thank you for your help !
 

Attachments

  • Dump1.txt
    2.2 KB · Views: 0
  • Dump2.txt
    5 KB · Views: 0
Very strange. After send ping to the destination first, the connection can be established (tested on 2 hosts) :

[root@zimbra-proxy1 ~]# telnet 172.29.254.247 389
Trying 172.29.254.247...
^C
[root@zimbra-proxy1 ~]# ping 172.29.254.247
PING 172.29.254.247 (172.29.254.247) 56(84) bytes of data.
64 bytes from 172.29.254.247: icmp_seq=1 ttl=63 time=0.280 ms
64 bytes from 172.29.254.247: icmp_seq=2 ttl=64 time=0.137 ms
^C
--- 172.29.254.247 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.137/0.208/0.280/0.072 ms
[root@zimbra-proxy1 ~]# telnet 172.29.254.247 389
Trying 172.29.254.247...
Connected to 172.29.254.247.
Escape character is '^]'.
 
Hi,
you can try the new kernel 3.10-9.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!