8.1 Update Hung/Failed

This didnt help.
What I have done:
1. Fixed a line in code (removed pvescheduller restart), thanks to advice from previous page
2. Fix corosync conf desynchronization (by coping new version of corosync.conf to buggy node)
3. Reconfigure by dpkg
4. Restart daemons or reboot
Thanks for sharing your workaround. I would however like to find the root cause of the issue so this can be avoided in the future.

Regarding (1.): Yes, this will work as by removing the restart of pvescheduler, you work around the hanging part of the postinstall script, therefore my question, where you able to successfully restart the pvescheduler.service afterwards?

Regarding (2.): This might be the interesting part, as you stated that the cluster was quorate before you started the upgrade process. Could you share the systemd journal during the upgrade, maybe there is something to learn from there? journalctl --since <DATETIME> --until <DATETIME> > journal.txt dumps the logs into a file.

Regarding (3.): So once you had quorum again, for you it seems everything works again as expected. For @slekkus this seems to not be the case.
 
I have no pvescheduler.pid.lock file in var/run
systemctl daemon-reload works fine, thereafter pvesched start still hangs.
There's no /etc/pve/jobs.xcfg and locks folder is empty.

Tried pvcem delnode but no success. I am going to reinstall all, this is too time consuming.
I understand if this debugging session takes up to much time and you might want to fix by the workaround suggested above. Before that, could you verify once again that you truly have write access to /etc/pve on this node by e.g. touch /etc/pve/priv/lock/test and ls -l /etc/pve/priv/lock/. Also, please share the systemd journal since boot, this might help to get further insights, journalctl -b > journal.txt
 
I understand if this debugging session takes up to much time and you might want to fix by the workaround suggested above. Before that, could you verify once again that you truly have write access to /etc/pve on this node by e.g. touch /etc/pve/priv/lock/test and ls -l /etc/pve/priv/lock/. Also, please share the systemd journal since boot, this might help to get further insights, journalctl -b > journal.txt

Cannot write in that lock folder it seems.

oot@mosh:~# touch /etc/pve/priv/lock/test touch: cannot touch '/etc/pve/priv/lock/test': Permission denied root@mosh:~# ls -l /etc/pve/priv/ total 4 dr-x------ 2 root www-data 0 Sep 13 13:50 acme -r-------- 1 root www-data 1675 Dec 6 19:40 authkey.key -r-------- 1 root www-data 2152 Dec 6 19:49 authorized_keys -r-------- 1 root www-data 0 Nov 5 12:07 authorized_keys.tmp.1411 dr-x------ 2 root www-data 0 Sep 20 17:36 ceph -r-------- 1 root www-data 151 Sep 18 20:02 ceph.client.admin.keyring -r-------- 1 root www-data 228 Sep 18 20:02 ceph.mon.keyring -r-------- 1 root www-data 4500 Dec 6 19:49 known_hosts dr-x------ 2 root www-data 0 Sep 13 13:50 lock -r-------- 1 root www-data 3272 Sep 13 13:50 pve-root-ca.key -r-------- 1 root www-data 3 Nov 5 16:06 pve-root-ca.srl

Attached journal
 

Attachments

Cannot write in that lock folder it seems.

oot@mosh:~# touch /etc/pve/priv/lock/test touch: cannot touch '/etc/pve/priv/lock/test': Permission denied root@mosh:~# ls -l /etc/pve/priv/ total 4 dr-x------ 2 root www-data 0 Sep 13 13:50 acme -r-------- 1 root www-data 1675 Dec 6 19:40 authkey.key -r-------- 1 root www-data 2152 Dec 6 19:49 authorized_keys -r-------- 1 root www-data 0 Nov 5 12:07 authorized_keys.tmp.1411 dr-x------ 2 root www-data 0 Sep 20 17:36 ceph -r-------- 1 root www-data 151 Sep 18 20:02 ceph.client.admin.keyring -r-------- 1 root www-data 228 Sep 18 20:02 ceph.mon.keyring -r-------- 1 root www-data 4500 Dec 6 19:49 known_hosts dr-x------ 2 root www-data 0 Sep 13 13:50 lock -r-------- 1 root www-data 3272 Sep 13 13:50 pve-root-ca.key -r-------- 1 root www-data 3 Nov 5 16:06 pve-root-ca.srl

Attached journal
Well, according to the logs you rebooted this node and it is currently not part of the quorate part of the cluster. You will have to check your corosync network connectivity and make sure that the cluster is healthy. After your initial issues with the quorum, I was under the assumtion that your cluster is fine, as you showed in one of your posts, which however is not the case.

Did you make sure that the node was quorate before doing any of the above debugging? Because I suspect that all your current issues boil down to the node not being quorate and therefore the services not being able to acquire locks on the proxmox cluster filesystem.

Double check the /etc/corosync/corosync.conf and /etc/pve/corosync.conf for each of the nodes (see also [0] for details), and make sure that the nodes can reach each other via the cluster network. Also check the other nodes for errors.

Further, please share you systemd journal since the time of the upgrade to 8.1, maybe that tells the whole story of what went wrong at which point.
You can generate this e.g. by journalctl --since -2weeks > journal.txt to dump the journal for the last 2 weeks.

[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_edit_corosync_conf
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!