problems after dist-upgrade

Jon morby

New Member
Mar 1, 2018
22
0
1
53
We've just run an apt-get dist-upgrade on a node and the result was a little surprising ... and I'm now trying to work out what to do next

root@pve-7:~# apt-get dist-upgrade
Reading package lists... Done
Building dependency tree
Reading state information... Done
Calculating upgrade... Done
The following NEW packages will be installed:
proxmox-widget-toolkit pve-i18n pve-kernel-4.13.13-6-pve
The following packages will be upgraded:
bind9-host ceph ceph-base ceph-common ceph-mds ceph-mgr ceph-mon ceph-osd
cpp-6 curl dnsutils gcc-6-base iproute2 libatomic1 libbind9-140 libcephfs2
libcurl3 libcurl3-gnutls libdns-export162 libdns162 libgcc1 libgfortran3
libirs141 libisc-export160 libisc160 libisccc140 libisccfg140 liblwres141
libnvpair1linux libpve-access-control libpve-common-perl libquadmath0
librados2 libradosstriper1 librbd1 librgw2 libstdc++6 libtasn1-6
libuutil1linux libvorbis0a libvorbisenc2 libxml2 libzfs2linux libzpool2linux
linux-libc-dev lxcfs proxmox-ve pve-cluster pve-container pve-docs
pve-ha-manager pve-kernel-4.13.13-2-pve pve-manager pve-qemu-kvm
python-cephfs python-libxml2 python-rados python-rbd python-rgw qemu-server
radosgw spl zfs-initramfs zfsutils-linux
64 upgraded, 3 newly installed, 0 to remove and 0 not upgraded.
Need to get 185 MB of archives.
After this operation, 155 MB disk space will be freed.
Do you want to continue? [Y/n]

snip

Setting up pve-manager (5.1-46) ...
Job for pvestatd.service failed because the control process exited with error code.
See "systemctl status pvestatd.service" and "journalctl -xe" for details.
dpkg: error processing package pve-manager (--configure):
subprocess installed post-installation script returned error exit status 1
dpkg: dependency problems prevent configuration of proxmox-ve:
proxmox-ve depends on pve-manager; however:
Package pve-manager is not configured yet.

dpkg: error processing package proxmox-ve (--configure):
dependency problems - leaving unconfigured
Errors were encountered while processing:
pve-manager
proxmox-ve



I am very aware of the "do not use apt-get upgrade" and we haven't done so on this node, the node itself has been working fine for months ... and we were simply aiming to ensure that the latest versions of things were installed

Any thoughts / suggestions?

This is using the pve-no-subscription repo
 
Job for pvestatd.service failed because the control process exited with error code.
See "systemctl status pvestatd.service" and "journalctl -xe" for details.
what do those two commands say?
 
I can't paste the results into my reply because of the system's spam filters ... I've added them in the attached logfile which hopefully I can post
 

Attachments

  • proxmox.log
    5.5 KB · Views: 12
It also seems as though the corosync stuff has stopped working. Whilst a pvecm status shows all nodes and the cluster being quorate, as soon as I try and view any files (in any directory) under /etc/pve/nodes everything just grinds to a halt

We have Juniper EX4200 configured as

set protocols igmp-snooping vlan all

and an MX104 configured as
set protocols igmp interface ae0.603 version 2

with all the PVE nodes default routing through an address on ae0.603

This was all working perfectly until this Thursday when we did a dist-upgrade first thing in the morning :(

I now have 5.1 nodes of different sub releases all refusing to talk to each other

I've tried downgrading from 5.1-46 back to 5.1-41 but to no avail ... we just seem to get the same errors

I am a little bamboozled :(
 
what does systemctl status pve-cluster corosync say?
also the whole journal output of the relevant times would be interresting (e.g. during a restart of corosync/pve-cluster)
 
Seems to think it is running, however there are still major freezes when trying to access any of the shared directories

root@pve-6:~# systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2018-03-02 12:07:27 GMT; 19min ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 3199 (corosync)
Tasks: 2 (limit: 4915)
Memory: 42.3M
CPU: 7.037s
CGroup: /system.slice/corosync.service
└─3199 /usr/sbin/corosync -f

Mar 02 12:17:52 pve-6 corosync[3199]: notice [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:52 pve-6 corosync[3199]: [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:52 pve-6 corosync[3199]: notice [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:52 pve-6 corosync[3199]: [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:52 pve-6 corosync[3199]: notice [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:52 pve-6 corosync[3199]: [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:52 pve-6 corosync[3199]: notice [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:52 pve-6 corosync[3199]: [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:53 pve-6 corosync[3199]: notice [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:53 pve-6 corosync[3199]: [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360

root@pve-1:~# systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2018-03-02 11:53:56 GMT; 34min ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 2507 (corosync)
Tasks: 2 (limit: 4915)
Memory: 42.9M
CPU: 12.273s
CGroup: /system.slice/corosync.service
└─2507 /usr/sbin/corosync -f

Mar 02 12:07:37 pve-1 corosync[2507]: notice [TOTEM ] Retransmit List: 28d 28e 28f 290 291 292 293 294 295 296
Mar 02 12:07:37 pve-1 corosync[2507]: [TOTEM ] Retransmit List: 28d 28e 28f 290 291 292 293 294 295 296
Mar 02 12:07:37 pve-1 corosync[2507]: notice [TOTEM ] Retransmit List: 28d 28e 28f 290 291 292 293 294 295 296
Mar 02 12:07:37 pve-1 corosync[2507]: [TOTEM ] Retransmit List: 28d 28e 28f 290 291 292 293 294 295 296
Mar 02 12:09:45 pve-1 corosync[2507]: notice [TOTEM ] A new membership (84.246.192.121:15632) was formed. Members joined: 4
Mar 02 12:09:45 pve-1 corosync[2507]: [TOTEM ] A new membership (84.246.192.121:15632) was formed. Members joined: 4
Mar 02 12:09:45 pve-1 corosync[2507]: notice [QUORUM] Members[6]: 5 4 1 2 3 7
Mar 02 12:09:45 pve-1 corosync[2507]: notice [MAIN ] Completed service synchronization, ready to provide service.
Mar 02 12:09:45 pve-1 corosync[2507]: [QUORUM] Members[6]: 5 4 1 2 3 7
Mar 02 12:09:45 pve-1 corosync[2507]: [MAIN ] Completed service synchronization, ready to provide service.



Quorum information
------------------
Date: Fri Mar 2 12:28:51 2018
Quorum provider: corosync_votequorum
Nodes: 6
Node ID: 0x00000005
Ring ID: 5/15632
Quorate: Yes

Votequorum information
----------------------
Expected votes: 7
Highest expected: 7
Total votes: 6
Quorum: 4
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000005 1 84.246.192.121 (local)
0x00000004 1 84.246.192.122
0x00000001 1 84.246.192.123
0x00000002 1 84.246.192.124
0x00000003 1 84.246.192.201
0x00000007 1 84.246.192.202


I've removed one of the broken nodes and reformatted it to try and get a clean upgrade, that worked, but adding it back into the cluster just hangs as well
 
Mar 02 12:17:52 pve-6 corosync[3199]: notice [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:52 pve-6 corosync[3199]: [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:52 pve-6 corosync[3199]: notice [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:52 pve-6 corosync[3199]: [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:52 pve-6 corosync[3199]: notice [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:52 pve-6 corosync[3199]: [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:52 pve-6 corosync[3199]: notice [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:52 pve-6 corosync[3199]: [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:53 pve-6 corosync[3199]: notice [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
Mar 02 12:17:53 pve-6 corosync[3199]: [TOTEM ] Retransmit List: 1354 1355 1356 1357 1358 1359 135a 135b 135c 135d 135e 135f 1360
that indicates a network problem, are you sure nothing changed on the switch side ? maybe try to confirm that multicast is working with omping
see https://pve.proxmox.com/wiki/Cluster_Manager#_cluster_network for detailed commands

also confirm that you have a multicast querier if you enabled multicast snooping
 
multicast all seems to be working and nothing has changed on the switch or router side of things in months
 
Also worth noting that the VMs are alive but I can't ping the containers ... I'm wondering if this is an issue with Open vSwitch and vlans?
 
multicast all seems to be working and nothing has changed on the switch or router side of things in months
can you post the output of omping ? even if you did not change the switch side, if it barely run until now, and there is additional load, it could explain this

Also worth noting that the VMs are alive but I can't ping the containers ... I'm wondering if this is an issue with Open vSwitch and vlans?
impossible to say without more information, how does your network setup look like?
 
Each Proxmox node is setup similarly to this

auto lo
iface lo inet loopback

iface eno1 inet manual

iface eno2 inet manual

iface eno3 inet manual

iface eno4 inet manual

iface enp5s0f0 inet manual

iface enp5s0f1 inet manual

allow-vmbr0 bond0
iface bond0 inet manual
ovs_bonds enp5s0f0 enp5s0f1
ovs_type OVSBond
ovs_bridge vmbr0
ovs_options lacp=active bond_mode=balance-tcp

auto vmbr0
iface vmbr0 inet manual
ovs_type OVSBridge
ovs_ports bond0 vlan603 vlan850

allow-vmbr0 vlan603
iface vlan603 inet static
address 84.246.192.203
netmask 255.255.255.0
gateway 84.246.192.1
ovs_type OVSIntPort
ovs_bridge vmbr0
ovs_options tag=603

iface vlan603 inet6 static
address 2a01:2c0:a::203
netmask 64
gateway 2a01:2c0:a::1

allow-vmbr0 vlan850
iface vlan850 inet static
address 172.16.50.247
netmask 255.255.255.0
ovs_type OVSIntPort
ovs_bridge vmbr0
ovs_options tag=850


ceph traffic happens on vlan850
access network on 603

We have 2 x 10Gig NICs which are bonded using LACP and those ports are configured as trunk ports

fyi I've been in contact to try and buy a support contract, but not yet had anyone reach out to seal the deal
 
On which nic is your corosync (cluster) traffic on? Can you ping other nodes on that network? If possible, try to separate corosync onto its own physical network. When you get corosync to stabilize then your cluster should respond again. Is Ceph still in HEALTH_OK?
 
ceph health is ok ... corosync is on vlan603 which is basically where the Proxmox nodes live and talk to each other

Starting VMs now results in a timeout (after the upgrade)

TASK ERROR: start failed: command '/usr/bin/kvm -id 6026 -chardev 'socket,id=qmp,path=/var/run/qemu-server/6026.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/6026.pid -daemonize -smbios 'type=1,uuid=1016b907-a101-4f2e-b80e-d2324eed933b' -name ras-1 -smp '2,sockets=1,cores=2,maxcpus=2' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga std -vnc unix:/var/run/qemu-server/6026.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 2048 -object 'memory-backend-ram,id=ram-node0,size=2048M' -numa 'node,nodeid=0,cpus=0-1,memdev=ram-node0' -k en-gb -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:3f8b20c8b5e' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=rbd:rbd/vm-6026-disk-1:mon_host=172.16.50.11;172.16.50.12;172.16.50.13:auth_supported=cephx:id=admin:keyring=/etc/pve/priv/ceph/proxmox.keyring,if=none,id=drive-scsi0,cache=writeback,format=raw,aio=threads,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap6026i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=46:ED:97:3A:E0:4A,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300'' failed: got timeout

They're running, but they're not pingable
 
The pmxcfs is using corosync and if that is not working then you get those timeouts. If Ceph is ok, you can try to stop the pmxcfs server and start it manual, 'pmxcfs -l' . You might be able to get at least on that host the VMs online. But this is only a workaround for emergency work.

Please add also the requested outputs from above postings, eg omping.
 
omping -m 239.192.51.129 pve-1.fido.net pve-2.fido.net pve-3.fido.net
gives

pve-1.fido.net : unicast, xmt/rcv/%loss = 1087/1087/0%, min/avg/max/std-dev = 0.147/0.294/1.305/0.055
pve-1.fido.net : multicast, xmt/rcv/%loss = 1087/1087/0%, min/avg/max/std-dev = 0.172/0.323/1.303/0.059

whilst
root@pve-1:~# omping -c 10000 -i 0.001 -F -q pve-1.fido.net pve-2.fido.net pve-3.fido.net
pve-2.fido.net : waiting for response msg
pve-3.fido.net : waiting for response msg
pve-3.fido.net : joined (S,G) = (*, 232.43.211.234), pinging
pve-2.fido.net : joined (S,G) = (*, 232.43.211.234), pinging
pve-2.fido.net : given amount of query messages was sent
pve-3.fido.net : given amount of query messages was sent

pve-2.fido.net : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.061/0.226/3.264/0.084
pve-2.fido.net : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.076/0.254/3.243/0.088
pve-3.fido.net : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.052/0.182/2.562/0.076
pve-3.fido.net : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.062/0.212/3.229/0.080

across the board

ceph is external as a secondary file system ... the primary system is still local files and a number of VMs are on the local LVM storage only

if I can get the standalone VMS online (which aren't on ceph) then I get most of my core / critical systems back
 
ok, using pmxcfs -l I have managed to bring up the core services again ...

The problem remains however that the cluster seems to be fubar .. despite having been working fine for more than a year :(
 
pve-2.fido.net : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.061/0.226/3.264/0.084
pve-2.fido.net : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.076/0.254/3.243/0.088
pve-3.fido.net : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.052/0.182/2.562/0.076
pve-3.fido.net : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.062/0.212/3.229/0.080
4 ms (default) when corosync doesn't receive its token in time, your max is 3.243. So, if you have a "little more" traffic then usual, your cluster will fall apart. Separate your corosync traffic onto its own physical network.

While not recommended, you can increase the time the token is valid, but this only moves the problem further out. See the manpage of corosync for that.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!