8.1 Update Hung/Failed

Chris · Dec 7, 2023

Whatever said:
This didnt help.
What I have done:
1. Fixed a line in code (removed pvescheduller restart), thanks to advice from previous page
2. Fix corosync conf desynchronization (by coping new version of corosync.conf to buggy node)
3. Reconfigure by dpkg
4. Restart daemons or reboot

Thanks for sharing your workaround. I would however like to find the root cause of the issue so this can be avoided in the future.

Regarding (1.): Yes, this will work as by removing the restart of pvescheduler, you work around the hanging part of the postinstall script, therefore my question, where you able to successfully restart the pvescheduler.service afterwards?

Regarding (2.): This might be the interesting part, as you stated that the cluster was quorate before you started the upgrade process. Could you share the systemd journal during the upgrade, maybe there is something to learn from there? journalctl --since <DATETIME> --until <DATETIME> > journal.txt dumps the logs into a file.

Regarding (3.): So once you had quorum again, for you it seems everything works again as expected. For @slekkus this seems to not be the case.

Chris · Dec 7, 2023

slekkus said:
I have no pvescheduler.pid.lock file in var/run
systemctl daemon-reload works fine, thereafter pvesched start still hangs.
There's no /etc/pve/jobs.xcfg and locks folder is empty.

Tried pvcem delnode but no success. I am going to reinstall all, this is too time consuming.

I understand if this debugging session takes up to much time and you might want to fix by the workaround suggested above. Before that, could you verify once again that you truly have write access to /etc/pve on this node by e.g. touch /etc/pve/priv/lock/test and ls -l /etc/pve/priv/lock/. Also, please share the systemd journal since boot, this might help to get further insights, journalctl -b > journal.txt

slekkus · Dec 7, 2023

Chris said:
I understand if this debugging session takes up to much time and you might want to fix by the workaround suggested above. Before that, could you verify once again that you truly have write access to /etc/pve on this node by e.g. touch /etc/pve/priv/lock/test and ls -l /etc/pve/priv/lock/. Also, please share the systemd journal since boot, this might help to get further insights, journalctl -b > journal.txt

Cannot write in that lock folder it seems.


oot@mosh:~# touch /etc/pve/priv/lock/test
touch: cannot touch '/etc/pve/priv/lock/test': Permission denied
root@mosh:~# ls -l  /etc/pve/priv/
total 4
dr-x------ 2 root www-data    0 Sep 13 13:50 acme
-r-------- 1 root www-data 1675 Dec  6 19:40 authkey.key
-r-------- 1 root www-data 2152 Dec  6 19:49 authorized_keys
-r-------- 1 root www-data    0 Nov  5 12:07 authorized_keys.tmp.1411
dr-x------ 2 root www-data    0 Sep 20 17:36 ceph
-r-------- 1 root www-data  151 Sep 18 20:02 ceph.client.admin.keyring
-r-------- 1 root www-data  228 Sep 18 20:02 ceph.mon.keyring
-r-------- 1 root www-data 4500 Dec  6 19:49 known_hosts
dr-x------ 2 root www-data    0 Sep 13 13:50 lock
-r-------- 1 root www-data 3272 Sep 13 13:50 pve-root-ca.key
-r-------- 1 root www-data    3 Nov  5 16:06 pve-root-ca.srl

Attached journal

Chris · Dec 11, 2023

slekkus said:
Cannot write in that lock folder it seems.

oot@mosh:~# touch /etc/pve/priv/lock/test touch: cannot touch '/etc/pve/priv/lock/test': Permission denied root@mosh:~# ls -l /etc/pve/priv/ total 4 dr-x------ 2 root www-data 0 Sep 13 13:50 acme -r-------- 1 root www-data 1675 Dec 6 19:40 authkey.key -r-------- 1 root www-data 2152 Dec 6 19:49 authorized_keys -r-------- 1 root www-data 0 Nov 5 12:07 authorized_keys.tmp.1411 dr-x------ 2 root www-data 0 Sep 20 17:36 ceph -r-------- 1 root www-data 151 Sep 18 20:02 ceph.client.admin.keyring -r-------- 1 root www-data 228 Sep 18 20:02 ceph.mon.keyring -r-------- 1 root www-data 4500 Dec 6 19:49 known_hosts dr-x------ 2 root www-data 0 Sep 13 13:50 lock -r-------- 1 root www-data 3272 Sep 13 13:50 pve-root-ca.key -r-------- 1 root www-data 3 Nov 5 16:06 pve-root-ca.srl

Attached journal

Well, according to the logs you rebooted this node and it is currently not part of the quorate part of the cluster. You will have to check your corosync network connectivity and make sure that the cluster is healthy. After your initial issues with the quorum, I was under the assumtion that your cluster is fine, as you showed in one of your posts, which however is not the case.

Did you make sure that the node was quorate before doing any of the above debugging? Because I suspect that all your current issues boil down to the node not being quorate and therefore the services not being able to acquire locks on the proxmox cluster filesystem.

Double check the /etc/corosync/corosync.conf and /etc/pve/corosync.conf for each of the nodes (see also [0] for details), and make sure that the nodes can reach each other via the cluster network. Also check the other nodes for errors.

Further, please share you systemd journal since the time of the upgrade to 8.1, maybe that tells the whole story of what went wrong at which point.
You can generate this e.g. by journalctl --since -2weeks > journal.txt to dump the journal for the last 2 weeks.

[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_edit_corosync_conf

Search

Search

8.1 Update Hung/Failed

Chris

Proxmox Staff Member

Chris

Proxmox Staff Member

slekkus

New Member

Attachments

Chris

Proxmox Staff Member

We value your privacy