Warning about ovs-upgrade to 2.6 and feature request for vzdump

udo

Distinguished Member
Apr 22, 2009
5,977
199
163
Ahrensburg; Germany
Hi,
do an upgrade today (2/3 nodes). Due the new openvswitch 2.6.0-2 the network-connectivity for all VMs was lost for an short time (app. 2 min.) on the first node.
And the upgrade was disruped due ssh-access via ovs-bridge ;)

But on the second node, I use an IP which isn't controlled by ovs... and the whole node wasn't accessible.
Then I went to the console and I see, the nodes was rebooted (can be due ha-feature) - but all ovs-networks are down (restart openvswitch-service don't help).

I move some important VMs to another node and start the VMs there.
But one important VM use local storage - I try to use vzdump, but vzdump can't backup an VM in emergency situations. First, can't get lock (ok "pvesm expected 1" helps) but then the VM was tried to start, which failed, because the bridge wasn't there.
I think an -emergeny flag for vzdump was an good thing! (Without lock and starting VM - can work with shutdown client only I think).
I use "qm move_disk" to migrate to ceph and than move the config and start the VM on the other node.

Long story short - another reboot bring the ovs-bridge up again.

So be carefully by this update (perhaps disable HA before)

Udo
 
Hi,
i went into similiar Problems..
On one system a /etc/init.d/networking stop ...start didi the trick for me, not sure if this helps, but maybe it`s worth to give it a try..

Regards

Markus
 
I just reviewed the ovs package update scripts, and I cannot see and bug. It simply does

# ovs-ctl stop
# ovs-ctl start

This seems to be the standard way to restart ovs services. So it is unclear to me whats wrong.
 
I just reviewed the ovs package update scripts, and I cannot see and bug. It simply does

# ovs-ctl stop
# ovs-ctl start

This seems to be the standard way to restart ovs services. So it is unclear to me whats wrong.
Hi Dietmar,
perhaps it's not wrong - but need a "long" time without network.

I upgraded yesterday the last node (after I can online migrate the VMs) - the network was down for 61 second in my case.
The reboot from the second node was ha-related (self-fencing, due network down) - and this was during the upgrade process, which make the trouble...

Udo
 
I upgraded yesterday the last node (after I can online migrate the VMs) - the network was down for 61 second in my case.

Strange - why does a simply restart needs 61 seconds? Was there high load on the server?
 
Strange - why does a simply restart needs 61 seconds? Was there high load on the server?
Hi Dietmar,
don't have access to the terminal-output now (perhaps it's in the history and I can look monday, but I have no correlation between ping and upgrade).
I had an ping running in a second terminal. The network connecction was gone and the upgrade was working (don't remember at what point the network is unreachable). There are some lines during the upgrade (so not only start ovs) between network down and back again.
Looks for me, that the network came back later due another process, or was gone before...

The load was very low, because there wasn't any running VMs on this system. Only ceph was running but also with low IO.

On the first try there was some VMs running, but very quiet ones.

Udo
 
So be carefully by this update (perhaps disable HA before)
As this is the case also for some other actions imho it would be nice to have a "disable HA" button or command, which completely disables HA.
This would be very useful to avoid fencing, reboots and various VM migrations before/during updates!
 
I can restart OVS services in 0.3 seconds here:

# time systemctl try-restart openvswitch-nonetwork.service
real 0m0.305s
user 0m0.000s
sys 0m0.000s

Note: Above command is executed when you update the package.
 
I can restart OVS services in 0.3 seconds here:

# time systemctl try-restart openvswitch-nonetwork.service
real 0m0.305s
user 0m0.000s
sys 0m0.000s

Note: Above command is executed when you update the package.
Hi Dietmar,
the restart don't need much time, but the bridge is working much time later!

I do following:
ssh pve-02 (updated node) and ping in the background the ovs-ip of pve03 (an updated node also) and connect with an non-ovs-ip:
Code:
root@pve02:~# ping -W 1 -O 10.1.1.13 | grep "no answer yet"&
root@pve02:~# ssh 10.1.3.13
And see what happens:
Code:
root@pve03:~# time systemctl try-restart openvswitch-nonetwork.service

real  0m1.015s
user  0m0.000s
sys  0m0.000s
root@pve03:~# no answer yet for icmp_seq=117
  no answer yet for icmp_seq=118
  no answer yet for icmp_seq=119
  no answer yet for icmp_seq=120
  no answer yet for icmp_seq=121
  no answer yet for icmp_seq=122
  no answer yet for icmp_seq=123
  no answer yet for icmp_seq=124
  no answer yet for icmp_seq=125
  no answer yet for icmp_seq=126
  no answer yet for icmp_seq=127
  no answer yet for icmp_seq=128
  no answer yet for icmp_seq=129
  no answer yet for icmp_seq=130
  no answer yet for icmp_seq=131
  no answer yet for icmp_seq=132
  no answer yet for icmp_seq=133
  no answer yet for icmp_seq=134
  no answer yet for icmp_seq=135
  no answer yet for icmp_seq=136
  no answer yet for icmp_seq=137
  no answer yet for icmp_seq=138
  no answer yet for icmp_seq=139
  no answer yet for icmp_seq=140
  no answer yet for icmp_seq=141
  no answer yet for icmp_seq=142
  no answer yet for icmp_seq=143
  no answer yet for icmp_seq=144
  no answer yet for icmp_seq=145
  no answer yet for icmp_seq=146
  no answer yet for icmp_seq=147
  no answer yet for icmp_seq=148
  no answer yet for icmp_seq=149
31 seconds was the bridge not accessible. Don't know, why this take twice time during upgrade, but 30 second more than one second service restart is strange, or?

Udo
 
31 seconds was the bridge not accessible. Don't know, why this take twice time during upgrade, but 30 second more than one second service restart is strange, or?

Yes, strange. You run the ping on the same node where you restart the service?
 
Yes, strange. You run the ping on the same node where you restart the service?
No,
I ping from pve02 -> pve03 (ovs) in background (output only no answer)
Than I ssh in the same session to pve03 (non-ovs IP) and then restart the service on pve-03.

Due the backgrounded ping I got the "no answer" output in the same session. The commands are in my last posting.

Udo
 
Hi Udo,

try to reproduce it here on Intel NIC's but everything is working about 1-2 sec disruption on the ping.
7 VM on the bride, corosync on a VLAN on the bridge, bond with LACP.
May be we use an other configuration.
Can you send me the network config so I can setup your network here and try to reproduce it?
 
The problem is related to the mtu.
It is no more necessary to subtract the vlan tag from the mtu.
So the mtu which you are using does not match to the device and so it wait.
 
The problem is related to the mtu.
It is no more necessary to subtract the vlan tag from the mtu.
So the mtu which you are using does not match to the device and so it wait.
Hi Wolfgang,
this is new with openvswitch 2.6? Because I test some weeks ago, first with mtu 9000, which not work...

But with mtu 9000 on an tagged bridge the restart from openvswitch-nonetwork.service work fast and without error now:
Code:
root@pve03:~# time systemctl try-restart openvswitch-nonetwork.service

real  0m0.786s
user  0m0.000s
sys  0m0.000s
Thanks!

Udo
 
I'm still searching for the change in the code but they have done many changes on mtu and include the new filed mtu_requested.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!