Proxmox VE 5.1-41 Cluster Totally Crashed

Benoit G

New Member
Jan 7, 2018
4
0
1
39
Montréal
yulpa.io
Hi guys,

I have been working on Proxmox VE 5.1-41 for a couple of days.
It's been the second time I encoure this issue, I'm wondering if anybody have heard about it.

It's a 3 nodes cluster , classical setup.
Installation works well, network works well, corrum and everything works like a charm.

I installed few VM to do some testing , all works great, and from nowhere , the cluster start crashing.
The crash cause this :

- PVE binaries are deleted :

Code:
ls /usr/bin/|grep pve
pvecm

- PVE systemd files are removed
Code:
ls  /lib/systemd/system|grep pve
pve-cluster.service
pve-container@.service
pve-firewall.service
pvefw-logger.service
pve-ha-crm.service
pve-ha-lrm.service
system-pve\x2dcontainer.slice

- PVE scripts in /usr/share/perl5/PVE/ are removed
10 files has been deleted

Code:
> API2.pm
> API2Tools.pm
> APLInfo.pm
> AutoBalloon.pm
> CephTools.pm
> HTTPServer.pm
> pvecfg.pm
> Report.pm
> Status
> VZDump.pm

- PVE /usr/share/pve-manager/ directory is fully removed
This directory does not exist anymore


So this is terrible because it happen to ALL the nodes in the cluster.
I can't explain how PVE decide to kill himself and delete all those files /folder without human interaction .
It does not look like it has been hacked two times
Nobody updated the system or upgraded anything nor remove anything in the mind time

Have you ever heard about this ? :(
One time, i can tell myself, yes , i did a reboot who fucked everything , but two time, no way, the cluster was working grate.

Let me know if I can provide more log..

I was able to solve by taking files, folders and binaries from another working cluster , reboot and all is back but this is not ok for production unless we can find what really happen

Thanks a lot for you help.

Regards,
Ben
 
Hi,

do you have configured an unattended upgrades or something like that?
 
What you describe locks like an unclean deinstallation of proxmox-ve,
what can happend if you install conflicting debian packages.

But I never head about it in combination of a node crash.
 
I'm surprised, because I really did not update/remove anything and that whats worry me
And this happened two times.. I wish I can explain what happened .

Thanks for you answers, I will keep tracking the post in case this happen to someone else
 
Interesting. I agree it looks like some software is installed and it removes PVE packages. Maybe you could try one more time but this time make a "snapshot/backup" of the /etc/ directory and the full list of installed packages and versions with "dpkg -l". If it happens again make another 'snapshot' of the /etc directory and packages list and compare them with the 'working copy'.
 
  • Like
Reactions: fireon
Interesting. I agree it looks like some software is installed and it removes PVE packages. Maybe you could try one more time but this time make a "snapshot/backup" of the /etc/ directory and the full list of installed packages and versions with "dpkg -l". If it happens again make another 'snapshot' of the /etc directory and packages list and compare them with the 'working copy'.

Good idea, I will take a snapshot and compare if that happen again ! thanks
 
Maybe you can add those to the snapshots too:
/var/lib/dpkg
/var/log/dpkg.log
/var/log/apt/history.log

You can also go one step further and install and configure the "tripwire" package to report to you:
Tripwire is a tool that aids system administrators and users in
monitoring a designated set of files for any changes. Used with
system files on a regular (e.g., daily) basis, Tripwire can notify
system administrators of corrupted or tampered files, so damage
control measures can be taken in a timely manner.