pve-cluster will not start after additional RAID1 storage upgrade

JS1 · Nov 21, 2018

Hello everybody,

today my PVE host received an additional RAID1 storage upgrade.
Before the update, I had 2 x 240 GB SSD as Hardware-RAID1 (Adaptec controller).
To that controller, another 2 x 480 GB SSD as Hardware-RAID1 were added. Ever since the upgrade was done, my pve-cluster will not start.

Events:

* Before the downtime, I shutdown the node using the web GUI - I confirmed the the power off using my IPMI.
* data center provider added the disks physically and created another RAID 1 on the controller.
* PVE node was booted up again.

After this, I noticed I couldn't access the machine via Web GUI.

* I logged on as root via IPMI/KVM and checked out the situation. The network was fine so I simply used "reboot".
* After the reboot, the logon prompt came up and I was able to ping the host from the internet. So i tried to login and saw these messages on the console (pve1.png).

* So I rebooted again, and got this:

* Then I ran fsck -n /dev/mapper/pve-root to see the errors:

* And the same command without "-n" to really do the changes and corrected all the errors. In the end, the filesystem was clean.

* I used the command "exit" to boot again and got the logon prompt.
During login I noticed some errors (which were most probably there after the upgrade - I just hadn't noticed them)

* So I checked out systemctl status pve-cluster:

It seems my /var/lib/pve-cluster/config.db has a problem - although I am not really sure.
The node is standalone and I'm quite new to Proxmox.
Are my VMs still ok and is there any way to recover from this error?

* ls -l /var/lib/pve-cluster (notice the config.db.bak which I created manually after receiving these errors):

Any help is deeply appreciated!

Kind regards and thanks

JS1

Stoiko Ivanov · Nov 21, 2018

Does the journal provide any hints to what went wrong (maybe there still is a problem with the filesystem)?

Else /var/lib/pve-cluster/config.db is a sqlite database (along with it's -wal and -shm file) - the -wal and -shm file should not be present, when there is no access to the database (and pmxcfs should be the only process accessing it) - see https://pve.proxmox.com/pve-docs/chapter-pmxcfs.html and https://www.sqlite.org/wal.html .

Be sure to make backups before doing any changes!

What's also odd is that your config.db-wal is a socket (instead of a regular file).

Hope that helps with recovering!

JS1 · Nov 21, 2018

Stoiko Ivanov said:
Does the journal provide any hints to what went wrong (maybe there still is a problem with the filesystem)?

Else /var/lib/pve-cluster/config.db is a sqlite database (along with it's -wal and -shm file) - the -wal and -shm file should not be present, when there is no access to the database (and pmxcfs should be the only process accessing it) - see ... and ....

Be sure to make backups before doing any changes!

What's also odd is that your config.db-wal is a socket (instead of a regular file).

Hope that helps with recovering!

Thank you very much for your input.

This is what the journal looked like:

I renamed the -wal and -shm files and I could start pve-cluster and all the rest of the services!
So the node is back up. It seems that somehow, those two files were not deleted after last shutdown.

Any idea how this could have happened?

Kind regards and thanks again for the great support,
JS1

Stoiko Ivanov · Nov 21, 2018

Glad the problem was solved!

* Well given the corrupted filesystem, and the necessary steps (fsck) to recover it, it could happen that those 2 files were corrupted.
* They could have remained, if the host was not shutdown cleanly (pmxcfs stops after the guests are all shutdown, which has quite a long timeout, maybe the onsite people started the host to see that the raid was indeed there, but didn't shut it down properly, but while it was booting up?).

In any case - it sounds like a good time to make sure that the sqlite db is part of your backup (along with the disk-images, /etc, and whatever else you need).

Search

Search

pve-cluster will not start after additional RAID1 storage upgrade

JS1

New Member

Stoiko Ivanov

Proxmox Staff Member

JS1

New Member

Stoiko Ivanov

Proxmox Staff Member