Broke my config.db due to hostname change

Archmatux · Nov 15, 2016

Hi All,

I have been migrating all of my servers to new subnets and have been changing the hostnames to more appropriate names as I go along.

The very last server was my Proxmox hypervisor which is where it's all gone wrong, mostly due to my own stupidity.

I attempted to follow: https://pve.proxmox.com/wiki/Renaming_a_PVE_node
My server is standalone and is not part of a cluster. However I did leave nodes on the server. (Yes I know I'm an idiot)

When I attempted to move the configuration files I got the error:
mv: cannot move ‘/etc/pve/nodes/sauron/lxc’ to ‘/etc/pve/nodes/hpv-01/lxc’: Directory not empty
mv: cannot move ‘/etc/pve/nodes/sauron/qemu-server’ to ‘/etc/pve/nodes/hpv-01/qemu-server’: Directory not empty

After this /etc/pve became unmounted.

I attempted to restart the cluster service:

~# systemctl status pve-cluster

● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
Active: failed (Result: exit-code) since Tue 2016-11-15 18:42:56 GMT; 8s ago
Process: 28155 ExecStartPost=/usr/bin/pvecm updatecerts --silent (code=exited, status=0/SUCCESS)
Process: 30032 ExecStart=/usr/bin/pmxcfs $DAEMON_OPTS (code=exited, status=255)
Main PID: 28153 (code=killed, signal=SEGV)

Nov 15 18:42:56 hpv-01 pmxcfs[30032]: [database] crit: found entry with duplicate name (inode = 00000000011C65D7, parent = 00000000011C653C, name = 'lxc')
Nov 15 18:42:56 hpv-01 pmxcfs[30032]: [database] crit: DB load failed
Nov 15 18:42:56 hpv-01 pmxcfs[30032]: [database] crit: found entry with duplicate name (inode = 00000000011C65D7, parent = 00000000011C653C, name = 'lxc')
Nov 15 18:42:56 hpv-01 pmxcfs[30032]: [database] crit: DB load failed
Nov 15 18:42:56 hpv-01 pmxcfs[30032]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Nov 15 18:42:56 hpv-01 pmxcfs[30032]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Nov 15 18:42:56 hpv-01 pmxcfs[30032]: [main] notice: exit proxmox configuration filesystem (-1)
Nov 15 18:42:56 hpv-01 systemd[1]: pve-cluster.service: control process exited, code=exited status=255
Nov 15 18:42:56 hpv-01 systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Nov 15 18:42:56 hpv-01 systemd[1]: Unit pve-cluster.service entered failed state.

I was able to follow the following guide to partially resolve the problems with the database:
http://blog.sjas.de/posts/proxmox-unable-to-open-database.html

I removed duplicates for 'lxc' as well as 'qemu-server' and 'lrm_status'.

The current status of the service is as follows:
~# systemctl status pve-cluster
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
Active: failed (Result: exit-code) since Tue 2016-11-15 19:21:49 GMT; 2s ago
Process: 28155 ExecStartPost=/usr/bin/pvecm updatecerts --silent (code=exited, status=0/SUCCESS)
Process: 38065 ExecStart=/usr/bin/pmxcfs $DAEMON_OPTS (code=exited, status=255)
Main PID: 28153 (code=killed, signal=SEGV)

Nov 15 19:21:49 hpv-01 pmxcfs[38065]: [database] crit: missing directory inode (inode = 000000000000000A)
Nov 15 19:21:49 hpv-01 pmxcfs[38065]: [database] crit: DB load failed
Nov 15 19:21:49 hpv-01 pmxcfs[38065]: [database] crit: missing directory inode (inode = 000000000000000A)
Nov 15 19:21:49 hpv-01 pmxcfs[38065]: [database] crit: DB load failed
Nov 15 19:21:49 hpv-01 pmxcfs[38065]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Nov 15 19:21:49 hpv-01 pmxcfs[38065]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Nov 15 19:21:49 hpv-01 pmxcfs[38065]: [main] notice: exit proxmox configuration filesystem (-1)
Nov 15 19:21:49 hpv-01 systemd[1]: pve-cluster.service: control process exited, code=exited status=255
Nov 15 19:21:49 hpv-01 systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Nov 15 19:21:49 hpv-01 systemd[1]: Unit pve-cluster.service entered failed state.

I have a copy of config.db from before I attempted to clean up the database which I can provide to anyone who can assist.

Will this be at all possible to recover from?
If not, what would my best recovery strategy be?

I would prefer not to restore all guests from last night's backup if possible as the VM and LXC container data is good.
If I reinstalled Proxmox can I simply re-attach the storage and restore the KVM and LXC configs from my backups?

Archmatux · Nov 15, 2016

Update!

I have made some progress.

I've figured out that in the database one of the rows I had deleted was still set to be the parent of another row.
I have restored the deleted row and have deleted the one that was not referenced.

I am now able to successfully start the pve-cluster service, however clearly the database is far from clean.

I am now looking at the best way to proceed.

I've been doing some reading on pmxcfs and as far as I understand it, if I manually update the values for the parents where applicable and remove the redundant rows this may resolve the issue?

As I have a backup of the (albeit broken) config.db file I will proceed anyway but would appreciate any input.

Archmatux · Nov 15, 2016

Another update..... this actually appears to be going well.

I have manually edited the database as follows:

I changed all rows back to their original parent values.
I then removed the record with the new node name and changed the name field for the old record.

I have been able to restart the pve-cluster service and the filesystem is now as follows:

~# ls /etc/pve -R
/etc/pve:
authkey.pub datacenter.cfg firewall local lxc nodes openvz priv pve-root-ca.pem pve-www.key qemu-server storage.cfg user.cfg vzdump.cron

/etc/pve/firewall:
cluster.fw

/etc/pve/nodes:
hpv-01

/etc/pve/nodes/hpv-01:
host.fw lrm_status lxc openvz priv pve-ssl.key pve-ssl.pem qemu-server

/etc/pve/nodes/hpv-01/lxc:
100.conf 102.conf 105.conf 106.conf 107.conf 108.conf 109.conf

/etc/pve/nodes/hpv-01/openvz:

/etc/pve/nodes/hpv-01/priv:

/etc/pve/nodes/hpv-01/qemu-server:
101.conf 103.conf 104.conf

/etc/pve/priv:
authkey.key authorized_keys known_hosts lock pve-root-ca.key pve-root-ca.srl

/etc/pve/priv/lock:

As far as I can tell that looks correct to me.

Finally I have gone through the list of services https://pve.proxmox.com/wiki/Service_daemons
I have run systemctl status for each one and have restarted the ones that were showing errors.

I am now able to log into the web interface again and everything appears to be correct so far.

EDIT: As everything currently appears to be working I will reboot later today and make sure everything is still good.

From what I can tell the instructions on the wiki for renaming a node are a bit outdated and still refer to rrd files. Would it be worth my updating the wiki with better instructions?

fabian · Nov 16, 2016

the linked article states

Now move the configuration files, as the pmxcfs has a few restrictions to ensure consistency you cannot rename non empty folders. Thus if you have VMs or Containers on the node, which is not recommended when changing a nodes name, you have to recreate the folder structure and copy files per folder level.

which you did not do. the instructions for moving the rrd files are also there for a reason and are not outdated

Archmatux · Nov 16, 2016

Thanks fabian. I'll admit I skim read the wiki.

My main mistake was restarting pve-cluster before fully migrating the configuration.

I also copied and pasted the path to the rrd files from the wiki: /var/lib/rrdcache/db/pve2 ?

Unless it's changed I believe there is a typo: /var/lib/rrdcache/db/pve2 instead of /var/lib/rrdcached/db/pve2

EDIT: Either way I've now sorted out the rrd files and I've done a reboot.
Everything appears to be working normally and I've learned a ton in the process.

Good thing it was my home setup I was working on and not a production system

fabian · Nov 17, 2016

correct, there was a typo (fixed now

)

zenny · Feb 19, 2017

Hi,

I got into the similar problem Proxmox4.4 as reported by OT while changing hostname:

# service pve-cluster status
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
Active: failed (Result: exit-code) since Sun 2017-02-19 18:15:30 CET; 1min 21s ago
Process: 7676 ExecStart=/usr/bin/pmxcfs $DAEMON_OPTS (code=exited, status=255)

Feb 19 18:15:30 server2 pmxcfs[7676]: [database] crit: found entry with duplicate name (inode = 00000000000C6337, parent = 00000000000C4FF4, name = 'lxc')
Feb 19 18:15:30 server2 pmxcfs[7676]: [database] crit: DB load failed
Feb 19 18:15:30 server2 pmxcfs[7676]: [database] crit: DB load failed
Feb 19 18:15:30 server2 pmxcfs[7676]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Feb 19 18:15:30 server2 pmxcfs[7676]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Feb 19 18:15:30 server2 pmxcfs[7676]: [main] notice: exit proxmox configuration filesystem (-1)
Feb 19 18:15:30 server2 pmxcfs[7676]: [main] notice: exit proxmox configuration filesystem (-1)
Feb 19 18:15:30 server2 systemd[1]: pve-cluster.service: control process exited, code=exited status=255
Feb 19 18:15:30 server2 systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Feb 19 18:15:30 server2 systemd[1]: Unit pve-cluster.service entered failed state.

I don't see any duplicate 'lxc' related entry in the sqlite database as reported above when I followed http://blog.sjas.de/posts/proxmox-unable-to-open-database.html

$ grep -Rn lxc Downloads/pve_sqlite.txt
7:INSERT INTO table VALUES(7,6,7,0,1486147739,4,'lxc',NULL);
21:INSERT INTO table VALUES(68,67,68,0,1486147796,4,'lxc',NULL);
38:INSERT INTO table VALUES(810686,806900,810687,0,1486147739,4,'lxc',X'');
39:INSERT INTO table VALUES(811831,806900,811831,0,1487501183,4,'lxc',NULL);

Following the link as above, I changed the value of third entry to NULL. but without change:

$ grep -Rn lxc Downloads/pve_sqlite.txt
7:INSERT INTO table VALUES(7,6,7,0,1486147739,4,'lxc',NULL);
21:INSERT INTO table VALUES(68,67,68,0,1486147796,4,'lxc',NULL);
38:INSERT INTO table VALUES(810686,806900,810687,0,1486147739,4,'lxc',NULL);
39:INSERT INTO table VALUES(811831,806900,811831,0,1487501183,4,'lxc',NULL);

Any input appreciated!

Wbr,
/z

zenny · Feb 19, 2017

SOLVED: by deleting two entries at the bottom:

sqlite> delete from tree where inode=811831;
sqlite> delete from tree where inode=810686;

However, 'service pveproxy status' still shows a security key issue as follows:

Feb 19 22:45:01 server2 pveproxy[22600]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_fi...1618.
Feb 19 22:45:01 server2 pveproxy[22599]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_fi...1618.

Search

Search

Broke my config.db due to hostname change

Archmatux

Renowned Member

Archmatux

Renowned Member

Archmatux

Renowned Member

fabian

Proxmox Staff Member

Archmatux

Renowned Member

fabian

Proxmox Staff Member

zenny

Renowned Member

zenny

Renowned Member