JS1

New Member
Nov 21, 2018
2
0
1
31
Hello everybody,

today my PVE host received an additional RAID1 storage upgrade.
Before the update, I had 2 x 240 GB SSD as Hardware-RAID1 (Adaptec controller).
To that controller, another 2 x 480 GB SSD as Hardware-RAID1 were added. Ever since the upgrade was done, my pve-cluster will not start.

Events:

* Before the downtime, I shutdown the node using the web GUI - I confirmed the the power off using my IPMI.
* data center provider added the disks physically and created another RAID 1 on the controller.
* PVE node was booted up again.

After this, I noticed I couldn't access the machine via Web GUI.

* I logged on as root via IPMI/KVM and checked out the situation. The network was fine so I simply used "reboot".
* After the reboot, the logon prompt came up and I was able to ping the host from the internet. So i tried to login and saw these messages on the console (pve1.png).

pve1.png

* So I rebooted again, and got this:

pve2.png

* Then I ran fsck -n /dev/mapper/pve-root to see the errors:

pve3.png

* And the same command without "-n" to really do the changes and corrected all the errors. In the end, the filesystem was clean.

* I used the command "exit" to boot again and got the logon prompt.
During login I noticed some errors (which were most probably there after the upgrade - I just hadn't noticed them)

pve4.png

* So I checked out systemctl status pve-cluster:

pve5.png

It seems my /var/lib/pve-cluster/config.db has a problem - although I am not really sure.
The node is standalone and I'm quite new to Proxmox.
Are my VMs still ok and is there any way to recover from this error?

* ls -l /var/lib/pve-cluster (notice the config.db.bak which I created manually after receiving these errors):

upload_2018-11-21_17-31-32.png

Any help is deeply appreciated!

Kind regards and thanks

JS1
 
Does the journal provide any hints to what went wrong (maybe there still is a problem with the filesystem)?

Else /var/lib/pve-cluster/config.db is a sqlite database (along with it's -wal and -shm file) - the -wal and -shm file should not be present, when there is no access to the database (and pmxcfs should be the only process accessing it) - see https://pve.proxmox.com/pve-docs/chapter-pmxcfs.html and https://www.sqlite.org/wal.html .

Be sure to make backups before doing any changes!

What's also odd is that your config.db-wal is a socket (instead of a regular file).

Hope that helps with recovering!
 
  • Like
Reactions: JS1
Does the journal provide any hints to what went wrong (maybe there still is a problem with the filesystem)?

Else /var/lib/pve-cluster/config.db is a sqlite database (along with it's -wal and -shm file) - the -wal and -shm file should not be present, when there is no access to the database (and pmxcfs should be the only process accessing it) - see ... and ....

Be sure to make backups before doing any changes!

What's also odd is that your config.db-wal is a socket (instead of a regular file).

Hope that helps with recovering!

Thank you very much for your input.

This is what the journal looked like:

upload_2018-11-21_18-8-13.png

upload_2018-11-21_18-8-21.png

I renamed the -wal and -shm files and I could start pve-cluster and all the rest of the services!
So the node is back up. It seems that somehow, those two files were not deleted after last shutdown.

Any idea how this could have happened?


Kind regards and thanks again for the great support,
JS1
 
Glad the problem was solved!

* Well given the corrupted filesystem, and the necessary steps (fsck) to recover it, it could happen that those 2 files were corrupted.
* They could have remained, if the host was not shutdown cleanly (pmxcfs stops after the guests are all shutdown, which has quite a long timeout, maybe the onsite people started the host to see that the raid was indeed there, but didn't shut it down properly, but while it was booting up?).

In any case - it sounds like a good time to make sure that the sqlite db is part of your backup (along with the disk-images, /etc, and whatever else you need).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!