problems after dist-upgrade

Jon morby · Mar 1, 2018

We've just run an apt-get dist-upgrade on a node and the result was a little surprising ... and I'm now trying to work out what to do next

root@pve-7:~# apt-get dist-upgrade
Reading package lists... Done
Building dependency tree
Reading state information... Done
Calculating upgrade... Done
The following NEW packages will be installed:
proxmox-widget-toolkit pve-i18n pve-kernel-4.13.13-6-pve
The following packages will be upgraded:
bind9-host ceph ceph-base ceph-common ceph-mds ceph-mgr ceph-mon ceph-osd
cpp-6 curl dnsutils gcc-6-base iproute2 libatomic1 libbind9-140 libcephfs2
libcurl3 libcurl3-gnutls libdns-export162 libdns162 libgcc1 libgfortran3
libirs141 libisc-export160 libisc160 libisccc140 libisccfg140 liblwres141
libnvpair1linux libpve-access-control libpve-common-perl libquadmath0
librados2 libradosstriper1 librbd1 librgw2 libstdc++6 libtasn1-6
libuutil1linux libvorbis0a libvorbisenc2 libxml2 libzfs2linux libzpool2linux
linux-libc-dev lxcfs proxmox-ve pve-cluster pve-container pve-docs
pve-ha-manager pve-kernel-4.13.13-2-pve pve-manager pve-qemu-kvm
python-cephfs python-libxml2 python-rados python-rbd python-rgw qemu-server
radosgw spl zfs-initramfs zfsutils-linux
64 upgraded, 3 newly installed, 0 to remove and 0 not upgraded.
Need to get 185 MB of archives.
After this operation, 155 MB disk space will be freed.
Do you want to continue? [Y/n]

snip

Setting up pve-manager (5.1-46) ...
Job for pvestatd.service failed because the control process exited with error code.
See "systemctl status pvestatd.service" and "journalctl -xe" for details.
dpkg: error processing package pve-manager (--configure):
subprocess installed post-installation script returned error exit status 1
dpkg: dependency problems prevent configuration of proxmox-ve:
proxmox-ve depends on pve-manager; however:
Package pve-manager is not configured yet.

dpkg: error processing package proxmox-ve (--configure):
dependency problems - leaving unconfigured
Errors were encountered while processing:
pve-manager
proxmox-ve

I am very aware of the "do not use apt-get upgrade" and we haven't done so on this node, the node itself has been working fine for months ... and we were simply aiming to ensure that the latest versions of things were installed

Any thoughts / suggestions?

This is using the pve-no-subscription repo

dcsapak · Mar 1, 2018

Jon morby said:
Job for pvestatd.service failed because the control process exited with error code.
See "systemctl status pvestatd.service" and "journalctl -xe" for details.

what do those two commands say?

Jon morby · Mar 1, 2018

I can't paste the results into my reply because of the system's spam filters ... I've added them in the attached logfile which hopefully I can post

Jon morby · Mar 2, 2018

more debug info just in case

Jon morby · Mar 2, 2018

It also seems as though the corosync stuff has stopped working. Whilst a pvecm status shows all nodes and the cluster being quorate, as soon as I try and view any files (in any directory) under /etc/pve/nodes everything just grinds to a halt

We have Juniper EX4200 configured as

set protocols igmp-snooping vlan all

and an MX104 configured as
set protocols igmp interface ae0.603 version 2

with all the PVE nodes default routing through an address on ae0.603

This was all working perfectly until this Thursday when we did a dist-upgrade first thing in the morning

I now have 5.1 nodes of different sub releases all refusing to talk to each other

I've tried downgrading from 5.1-46 back to 5.1-41 but to no avail ... we just seem to get the same errors

I am a little bamboozled

dcsapak · Mar 2, 2018

what does systemctl status pve-cluster corosync say?
also the whole journal output of the relevant times would be interresting (e.g. during a restart of corosync/pve-cluster)

Jon morby · Mar 2, 2018

Seems to think it is running, however there are still major freezes when trying to access any of the shared directories

root@pve-6:~# systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2018-03-02 12:07:27 GMT; 19min ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 3199 (corosync)
Tasks: 2 (limit: 4915)
Memory: 42.3M
CPU: 7.037s
CGroup: /system.slice/corosync.service
└─3199 /usr/sbin/corosync -f

Mar 02 12:17:52 pve-6 corosync[3199]: notice [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:52 pve-6 corosync[3199]: [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:52 pve-6 corosync[3199]: notice [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:52 pve-6 corosync[3199]: [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:52 pve-6 corosync[3199]: notice [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:52 pve-6 corosync[3199]: [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:52 pve-6 corosync[3199]: notice [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:52 pve-6 corosync[3199]: [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:53 pve-6 corosync[3199]: notice [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:53 pve-6 corosync[3199]: [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360

root@pve-1:~# systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2018-03-02 11:53:56 GMT; 34min ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 2507 (corosync)
Tasks: 2 (limit: 4915)
Memory: 42.9M
CPU: 12.273s
CGroup: /system.slice/corosync.service
└─2507 /usr/sbin/corosync -f

Mar 02 12:07:37 pve-1 corosync[2507]: notice [TOTEM ] Retransmit List: 28d 28e 28f 290 291 292 293 294 295 296
Mar 02 12:07:37 pve-1 corosync[2507]: [TOTEM ] Retransmit List: 28d 28e 28f 290 291 292 293 294 295 296
Mar 02 12:07:37 pve-1 corosync[2507]: notice [TOTEM ] Retransmit List: 28d 28e 28f 290 291 292 293 294 295 296
Mar 02 12:07:37 pve-1 corosync[2507]: [TOTEM ] Retransmit List: 28d 28e 28f 290 291 292 293 294 295 296
Mar 02 12:09:45 pve-1 corosync[2507]: notice [TOTEM ] A new membership (84.246.192.121:15632) was formed. Members joined: 4
Mar 02 12:09:45 pve-1 corosync[2507]: [TOTEM ] A new membership (84.246.192.121:15632) was formed. Members joined: 4
Mar 02 12:09:45 pve-1 corosync[2507]: notice [QUORUM] Members[6]: 5 4 1 2 3 7
Mar 02 12:09:45 pve-1 corosync[2507]: notice [MAIN ] Completed service synchronization, ready to provide service.
Mar 02 12:09:45 pve-1 corosync[2507]: [QUORUM] Members[6]: 5 4 1 2 3 7
Mar 02 12:09:45 pve-1 corosync[2507]: [MAIN ] Completed service synchronization, ready to provide service.

Quorum information
------------------
Date: Fri Mar 2 12:28:51 2018
Quorum provider: corosync_votequorum
Nodes: 6
Node ID: 0x00000005
Ring ID: 5/15632
Quorate: Yes

Votequorum information
----------------------
Expected votes: 7
Highest expected: 7
Total votes: 6
Quorum: 4
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000005 1 84.246.192.121 (local)
0x00000004 1 84.246.192.122
0x00000001 1 84.246.192.123
0x00000002 1 84.246.192.124
0x00000003 1 84.246.192.201
0x00000007 1 84.246.192.202

I've removed one of the broken nodes and reformatted it to try and get a clean upgrade, that worked, but adding it back into the cluster just hangs as well

dcsapak · Mar 2, 2018

Jon morby said:
Mar 02 12:17:52 pve-6 corosync[3199]: notice [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:52 pve-6 corosync[3199]: [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:52 pve-6 corosync[3199]: notice [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:52 pve-6 corosync[3199]: [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:52 pve-6 corosync[3199]: notice [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:52 pve-6 corosync[3199]: [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:52 pve-6 corosync[3199]: notice [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:52 pve-6 corosync[3199]: [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:53 pve-6 corosync[3199]: notice [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:53 pve-6 corosync[3199]: [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360

that indicates a network problem, are you sure nothing changed on the switch side ? maybe try to confirm that multicast is working with omping
see https://pve.proxmox.com/wiki/Cluster_Manager#_cluster_network for detailed commands

also confirm that you have a multicast querier if you enabled multicast snooping

Jon morby · Mar 2, 2018

multicast all seems to be working and nothing has changed on the switch or router side of things in months

Jon morby · Mar 2, 2018

Also worth noting that the VMs are alive but I can't ping the containers ... I'm wondering if this is an issue with Open vSwitch and vlans?

dcsapak · Mar 2, 2018

Jon morby said:
multicast all seems to be working and nothing has changed on the switch or router side of things in months

can you post the output of omping ? even if you did not change the switch side, if it barely run until now, and there is additional load, it could explain this

Jon morby said:
Also worth noting that the VMs are alive but I can't ping the containers ... I'm wondering if this is an issue with Open vSwitch and vlans?

impossible to say without more information, how does your network setup look like?

Jon morby · Mar 2, 2018

Each Proxmox node is setup similarly to this

auto lo
iface lo inet loopback

iface eno1 inet manual

iface eno2 inet manual

iface eno3 inet manual

iface eno4 inet manual

iface enp5s0f0 inet manual

iface enp5s0f1 inet manual

allow-vmbr0 bond0
iface bond0 inet manual
ovs_bonds enp5s0f0 enp5s0f1
ovs_type OVSBond
ovs_bridge vmbr0
ovs_options lacp=active bond_mode=balance-tcp

auto vmbr0
iface vmbr0 inet manual
ovs_type OVSBridge
ovs_ports bond0 vlan603 vlan850

allow-vmbr0 vlan603
iface vlan603 inet static
address 84.246.192.203
netmask 255.255.255.0
gateway 84.246.192.1
ovs_type OVSIntPort
ovs_bridge vmbr0
ovs_options tag=603

iface vlan603 inet6 static
address 2a01:2c0:a::203
netmask 64
gateway 2a01:2c0:a::1

allow-vmbr0 vlan850
iface vlan850 inet static
address 172.16.50.247
netmask 255.255.255.0
ovs_type OVSIntPort
ovs_bridge vmbr0
ovs_options tag=850

ceph traffic happens on vlan850
access network on 603

We have 2 x 10Gig NICs which are bonded using LACP and those ports are configured as trunk ports

fyi I've been in contact to try and buy a support contract, but not yet had anyone reach out to seal the deal

Alwin · Mar 2, 2018

On which nic is your corosync (cluster) traffic on? Can you ping other nodes on that network? If possible, try to separate corosync onto its own physical network. When you get corosync to stabilize then your cluster should respond again. Is Ceph still in HEALTH_OK?

Jon morby · Mar 2, 2018

ceph health is ok ... corosync is on vlan603 which is basically where the Proxmox nodes live and talk to each other

Starting VMs now results in a timeout (after the upgrade)

TASK ERROR: start failed: command '/usr/bin/kvm -id 6026 -chardev 'socket,id=qmp,path=/var/run/qemu-server/6026.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/6026.pid -daemonize -smbios 'type=1,uuid=1016b907-a101-4f2e-b80e-d2324eed933b' -name ras-1 -smp '2,sockets=1,cores=2,maxcpus=2' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga std -vnc unix:/var/run/qemu-server/6026.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 2048 -object 'memory-backend-ram,id=ram-node0,size=2048M' -numa 'node,nodeid=0,cpus=0-1,memdev=ram-node0' -k en-gb -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:3f8b20c8b5e' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=rbd:rbd/vm-6026-disk-1:mon_host=172.16.50.11;172.16.50.12;172.16.50.13:auth_supported=cephx:id=admin:keyring=/etc/pve/priv/ceph/proxmox.keyring,if=none,id=drive-scsi0,cache=writeback,format=raw,aio=threads,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap6026i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=46:ED:97:3A:E0:4A,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300'' failed: got timeout

They're running, but they're not pingable

Alwin · Mar 2, 2018

The pmxcfs is using corosync and if that is not working then you get those timeouts. If Ceph is ok, you can try to stop the pmxcfs server and start it manual, 'pmxcfs -l' . You might be able to get at least on that host the VMs online. But this is only a workaround for emergency work.

Please add also the requested outputs from above postings, eg omping.

Jon morby · Mar 2, 2018

omping -m 239.192.51.129 pve-1.fido.net pve-2.fido.net pve-3.fido.net
gives

pve-1.fido.net : unicast, xmt/rcv/%loss = 1087/1087/0%, min/avg/max/std-dev = 0.147/0.294/1.305/0.055
pve-1.fido.net : multicast, xmt/rcv/%loss = 1087/1087/0%, min/avg/max/std-dev = 0.172/0.323/1.303/0.059

whilst
root@pve-1:~# omping -c 10000 -i 0.001 -F -q pve-1.fido.net pve-2.fido.net pve-3.fido.net
pve-2.fido.net : waiting for response msg
pve-3.fido.net : waiting for response msg
pve-3.fido.net : joined (S,G) = (*, 232.43.211.234), pinging
pve-2.fido.net : joined (S,G) = (*, 232.43.211.234), pinging
pve-2.fido.net : given amount of query messages was sent
pve-3.fido.net : given amount of query messages was sent

pve-2.fido.net : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.061/0.226/3.264/0.084
pve-2.fido.net : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.076/0.254/3.243/0.088
pve-3.fido.net : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.052/0.182/2.562/0.076
pve-3.fido.net : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.062/0.212/3.229/0.080

across the board

ceph is external as a secondary file system ... the primary system is still local files and a number of VMs are on the local LVM storage only

if I can get the standalone VMS online (which aren't on ceph) then I get most of my core / critical systems back

Jon morby · Mar 2, 2018

ok, using pmxcfs -l I have managed to bring up the core services again ...

The problem remains however that the cluster seems to be fubar .. despite having been working fine for more than a year

Alwin · Mar 5, 2018

Jon morby said:
pve-2.fido.net : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.061/0.226/3.264/0.084
pve-2.fido.net : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.076/0.254/3.243/0.088
pve-3.fido.net : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.052/0.182/2.562/0.076
pve-3.fido.net : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.062/0.212/3.229/0.080

4 ms (default) when corosync doesn't receive its token in time, your max is 3.243. So, if you have a "little more" traffic then usual, your cluster will fall apart. Separate your corosync traffic onto its own physical network.

While not recommended, you can increase the time the token is valid, but this only moves the problem further out. See the manpage of corosync for that.

problems after dist-upgrade

New Member

Proxmox Staff Member

New Member

Attachments

New Member

Attachments

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

New Member

Proxmox Staff Member

New Member

Proxmox Retired Staff

New Member

Proxmox Retired Staff

New Member

New Member

Proxmox Retired Staff

We value your privacy