Small Issue Leading into Something More

emilhozan

Member
Aug 27, 2019
51
4
8
Hey all,

My predicament started with emails that had:
- subject: Cron <root@HOSTNAME> test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily )
- body: /usr/bin/mandb: can't set the locale; make sure $LC_* and $LANG are correct

I have 9 nodes in a cluster and only 3 of them were reporting this issue.

When starting to troubleshoot, I picked one and went through many suggestions based on many online searchs, none of which helped. I tried many variations of the keywords and it's obvious issue is with the LOCALE but anyways.

I then decided to try and update the system:
- apt-get update; this worked fine
- apt-get dist-upgrade; this is where my issues started

After running this, I am prompted to run "apt --fix-broken-install' to which I do, it completes, and then I run apt-get dist-update again. I run into another issue so then I double check PVE 5.x's official repo per this link; https://pve.proxmox.com/wiki/Package_Repositories#_proxmox_ve_5_x_repositories

This:
1581407355814.png
And this:
1581407382234.png

Use two different Debian versions but i decided to switch from one to the other. Note that ALL 9 had the same /etc/apt/sources.list but only changed on 3. I ran apt-get update and apt-get dist-upgrade again and seemed to be making some progress. The update was taking forever on the first node, which is why i went through with the other two (mistake, I see that now....hindsight 2020).

And to sum up what my current issue is now: these three nodes are reporting offline but are accessible and respond to network commands. I've tried to search many resources pertaining to this but cannot find a solution that worked.

One thing that may be of importance and why I explained what I did above. After dist-upgrade, I was prompted with a message. I don't quite recall what it was or its messaging but something about a file with differences and being prompted with keeping the current file (which is what I clicked, considering the file may have been something to do with configurations and whatnot), view file differences (to which there were only a few line differences that didn't really stick out as harmful but still opted to retain the original file), and a few others.

After dist-upgrade, I rebooted the servers for good measure, I noticed they took a while to come back up. I then tested a ping, noticed it worked, and then slowly got to where I am now.

Can anyone help shed some light?
 
Oh, also, I am fairly certain this issue has something to do with corosync. An error from syslog:
Feb 10 23:55:49 t1n4 pveproxy[1715]: Cluster not quorate - extending auth key lifetime!
 
Not sure I understand the situation completely - but it sounds like you had a 9-node cluster running PVE 5/stretch,
and then tried to update/updated 3 nodes to PVE 6/buster?

Major version upgrades (PVE 5 -> PVE 6) need a bit of preparation, and in that case corosync also made a major version jump.

Check out the wiki-entry for the version-upgrade:
https://pve.proxmox.com/wiki/Upgrade_from_5.x_to_6.0

For the current situation - compare the output of `pveversion -v` on all hosts - that should help identifying what is happening and what to do from there

I hope this helps!