Error during host rename

Airw0lf · Feb 23, 2023

Earlier this evening I tried renaming 2 standalone hosts.
One worked out as expected - the other did not.
Meaning the folder /etc/pve was empty - not sure if any other damage was done.

I compared the 2 hosts and was able to re-create some bits and pieces:
- Recreated the folder structure and logical links
- Restored the folder ./nodes with the VM-configs from a backup
- Recreated the storage.cfg
- Recreated the firewall config

The content of the folder is now:

Code:

root@vigilant:/etc/pve# ls -l
total 32
-rw-r--r-- 1 root root   44 Feb 22 22:51 datacenter.cfg
drwxr-xr-x 2 root root 4096 Feb 22 23:30 firewall
lrwxrwxrwx 1 root root   14 Feb 22 22:54 local -> nodes/vigilant
lrwxrwxrwx 1 root root   18 Feb 22 22:57 lxc -> nodes/vigilant/lxc
drwxr-xr-x 3 root root 4096 Feb 22 22:38 nodes
lrwxrwxrwx 1 root root   21 Feb 22 22:58 openvz -> nodes/vigilant/openvz
drwxr-xr-x 2 root root 4096 Feb 22 22:52 priv
lrwxrwxrwx 1 root root   26 Feb 22 22:59 qemu-server -> nodes/vigilant/qemu-server
drwxr-xr-x 2 root root 4096 Feb 22 23:32 sdn
-rw-r--r-- 1 root root  762 Feb 22 22:35 storage.cfg
-rw-r--r-- 1 root root  107 Feb 22 22:31 user.cfg
drwxr-xr-x 2 root root 4096 Feb 22 23:14 virtual-guest
r

The .nodes folder now:

Code:

root@vigilant:/etc/pve# ls -l nodes/vigilant
total 32
-rw-r----- 1 root root   34 Feb 22 22:40 host.fw
-rw-r----- 1 root root   83 Feb 22 22:40 lrm_status
drwxr-xr-x 2 root root 4096 Feb 22 22:40 lxc
drwxr-xr-x 2 root root 4096 Feb 22 22:40 openvz
drwx------ 2 root root 4096 Feb 22 22:40 priv
-rw-r----- 1 root root 1679 Feb 22 22:40 pve-ssl.key
-rw-r----- 1 root root 1692 Feb 22 22:40 pve-ssl.pem
drwxr-xr-x 2 root root 4096 Feb 22 22:40 qemu-server
root@vigilant:/etc/pve# ls -l ./nodes
total 4
drwxr-xr-x 6 root root 4096 Feb 22 22:40 vigilant

The host is now failing with lots of errors in the syslog.

Lots of these:

Code:

Feb 23 00:18:00 vigilant pveproxy[3177]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key)>
Feb 23 00:18:00 vigilant pveproxy[3178]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key)>

Some of these:
Feb 23 00:18:01 vigilant cron[964]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)

And before those there is this:

Code:

Feb 22 23:34:18 vigilant pmxcfs[1163]: [database] crit: found entry with duplicate name 'lxc' - A:(inode = 0x0000000000>
Feb 22 23:34:18 vigilant pmxcfs[1163]: [database] crit: found entry with duplicate name 'lxc' - A:(inode = 0x0000000000>
Feb 22 23:34:18 vigilant pmxcfs[1163]: [database] crit: DB load failed
Feb 22 23:34:18 vigilant pmxcfs[1163]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/c>
Feb 22 23:34:18 vigilant pmxcfs[1163]: [main] notice: exit proxmox configuration filesystem (-1)
Feb 22 23:34:18 vigilant pmxcfs[1163]: [database] crit: DB load failed
Feb 22 23:34:18 vigilant pmxcfs[1163]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/c>
Feb 22 23:34:18 vigilant pmxcfs[1163]: [main] notice: exit proxmox configuration filesystem (-1)
Feb 22 23:34:18 vigilant systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Feb 22 23:34:18 vigilant systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Feb 22 23:34:18 vigilant systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Feb 22 23:34:18 vigilant systemd[1]: Condition check resulted in Corosync Cluster Engine being skipped.
Feb 22 23:34:18 vigilant pve-firewall[975]: ipcc_send_rec[1] failed: Connection refused
Feb 22 23:34:18 vigilant pve-firewall[975]: ipcc_send_rec[2] failed: Connection refused
Feb 22 23:34:18 vigilant pve-firewall[975]: ipcc_send_rec[3] failed: Connection refused
Feb 22 23:34:18 vigilant pve-firewall[975]: Unable to load access control list: Connection refused
Feb 22 23:34:18 vigilant pve-firewall[975]: ipcc_send_rec[1] failed: Connection refused
Feb 22 23:34:18 vigilant pve-firewall[975]: ipcc_send_rec[2] failed: Connection refused
Feb 22 23:34:18 vigilant pve-firewall[975]: ipcc_send_rec[3] failed: Connection refused
Feb 22 23:34:18 vigilant systemd[1]: pve-firewall.service: Control process exited, code=exited, status=111/n/a
Feb 22 23:34:18 vigilant systemd[1]: pve-firewall.service: Failed with result 'exit-code'.
Feb 22 23:34:18 vigilant systemd[1]: Failed to start Proxmox VE firewall.

I tried to recreate the certs with the following command and results:

Code:

root@vigilant:/etc/pve# pvecm updatecerts --force
ipcc_send_rec[1] failed: Connection refused
ipcc_send_rec[2] failed: Connection refused
ipcc_send_rec[3] failed: Connection refused
Unable to load access control list: Connection refused

The system name is vigilant. The content of the files /etc/hosts and /etc/hostname is:

Code:

root@vigilant:/etc/pve# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
192.168.139.251 vigilant.itv.lan vigilant

root@vigilant:/etc/pve# cat /etc/hostname
vigilant

I searhc around the from in an attempt fixing this.
But after several hours and numerous attempts still no noticable improvements.

Any suggestions?

Moayad · Feb 23, 2023

hi,

Did you follow our wiki guide [0] for rename the PVE?

What says this command: hostname -f?

[0] https://pve.proxmox.com/wiki/Renaming_a_PVE_node

Airw0lf · Feb 23, 2023

Moayad said:
hi,

Did you follow our wiki guide [0] for rename the PVE?

What says this command: hostname -f?

[0] https://pve.proxmox.com/wiki/Renaming_a_PVE_node

Yes - I followed these instructions.

And after that, I did some renaming/moving of files and folders in /etc/pve/nodes.
Because of the new nodes entry after following the instructions.
Which worked fine on one of the two (standalone!) nodes - but not on the other.

I'm also able to read the (sqlite?) database /var/lib/pve-cluster/config.db.
However, the /var/log/syslog says this (assuming the phrase [database] is referring to this file):

Code:

Feb 22 23:34:18 vigilant pmxcfs[1018]: [database] crit: found entry with duplicate name 'lxc' - A:(inode = 0x0000000000>
Feb 22 23:34:18 vigilant pmxcfs[1018]: [database] crit: found entry with duplicate name 'lxc' - A:(inode = 0x0000000000>
Feb 22 23:34:18 vigilant pmxcfs[1018]: [database] crit: DB load failed

And hostname -f gives me:

Code:

root@vigilant:/home/will# hostname -f
vigilant.itv.lan

=====

EDIT 9:35 AM:
Just realized that I didn't do the cleanup-part mentioned on the end => did now do this.
However there is no difference in the outcome compared to the results above.

Moayad · Feb 23, 2023

Thank you for the output!

Do you have two folders in the /etc/pve/nodes/ path, or do you see two nodes in the PVE GUI? (ls -la /etc/pve/nodes/)

Airw0lf · Feb 23, 2023

Moayad said:
Thank you for the output!

Do you have two folders in the /etc/pve/nodes/ path, or do you see two nodes in the PVE GUI? (ls -la /etc/pve/nodes/)

There is no pve gui - I guess because it can not load the database and/or the certificates/keys; both error conditions are mentioned in the file called /var/log/syslog.

I have one folder in /etc/pve/nodes - it called vigilant.

Code:

root@vigilant:/home/will# ls -la /etc/pve/nodes
total 12
drwxr-xr-x 3 root root 4096 Feb 22 22:38 .
drwxr-x--- 7 root root 4096 Feb 23 09:03 ..
drwxr-xr-x 6 root root 4096 Feb 22 22:40 vigilant

The content of that folder (and its subfolders) is:

Code:

root@vigilant:/home/will# ls -R -l /etc/pve/nodes/vigilant
/etc/pve/nodes/vigilant:
total 32
-rw-r----- 1 root root   34 Feb 22 22:40 host.fw
-rw-r----- 1 root root   83 Feb 22 22:40 lrm_status
drwxr-xr-x 2 root root 4096 Feb 22 22:40 lxc
drwxr-xr-x 2 root root 4096 Feb 22 22:40 openvz
drwx------ 2 root root 4096 Feb 22 22:40 priv
-rw-r----- 1 root root 1679 Feb 22 22:40 pve-ssl.key
-rw-r----- 1 root root 1692 Feb 22 22:40 pve-ssl.pem
drwxr-xr-x 2 root root 4096 Feb 22 22:40 qemu-server

/etc/pve/nodes/vigilant/lxc:
total 4
-rw-r----- 1 root root 319 Feb 22 22:40 103.conf

/etc/pve/nodes/vigilant/openvz:
total 0

/etc/pve/nodes/vigilant/priv:
total 0

/etc/pve/nodes/vigilant/qemu-server:
total 24
-rw-r----- 1 root root 565 Feb 22 22:40 100.conf
-rw-r----- 1 root root 490 Feb 22 22:40 101.conf
-rw-r----- 1 root root 380 Feb 22 22:40 102.conf
-rw-r----- 1 root root 480 Feb 22 22:40 104.conf
-rw-r----- 1 root root 409 Feb 22 22:40 105.conf
-rw-r----- 1 root root 427 Feb 22 22:40 106.conf

Moayad · Feb 23, 2023

Hi again,

May also provide us with the output of ls -l /etc/pve/local/pve-ssl* command. I have also a question, did you reboot the server after the node was renamed?

Airw0lf · Feb 23, 2023

Moayad said:
Hi again,

May also provide us with the output of ls -l /etc/pve/local/pve-ssl* command. I have also a question, did you reboot the server after the node was renamed?

Hi @Moayad ,

First of all: thank you very much for your quick and to the point responses - really appreciated!

Yes - the server was rebooted with shutdown (i.e. powercycle) - more then once actually.
Also after doing the cleanup part as mentioned in the article.

Requested output:

Code:

root@vigilant:/home/will# ls -l /etc/pve/local/pve-ssl*
-rw-r----- 1 root root 1679 Feb 22 22:40 /etc/pve/local/pve-ssl.key
-rw-r----- 1 root root 1692 Feb 22 22:40 /etc/pve/local/pve-ssl.pem

Cheers - Will

fabian · Feb 23, 2023

that looks very wrong! can you please post the output of mount | grep /etc/pve and systemctl status pve-cluster?

Airw0lf · Feb 23, 2023

fabian said:
mount | grep /etc/pve and sy

fabian said:
that looks very wrong! can you please post the output of mount | grep /etc/pve and systemctl status pve-cluster?

I'm aware of that as there is no webui and mounts - see below:

Code:

root@vigilant:/home/will# mount | grep /etc/pve

root@vigilant:/home/will# systemctl status pve-cluster
● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Thu 2023-02-23 13:31:33 CET; 1h 11min ago
    Process: 1343 ExecStart=/usr/bin/pmxcfs (code=exited, status=255/EXCEPTION)
        CPU: 8ms

Feb 23 13:31:33 vigilant systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5.
Feb 23 13:31:33 vigilant systemd[1]: Stopped The Proxmox VE cluster filesystem.
Feb 23 13:31:33 vigilant systemd[1]: pve-cluster.service: Start request repeated too quickly.
Feb 23 13:31:33 vigilant systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Feb 23 13:31:33 vigilant systemd[1]: Failed to start The Proxmox VE cluster filesystem.

The one that is also renamed and working as expected has:

Code:

root@the-neb:/home/will# mount | grep /etc/pve
/dev/fuse on /etc/pve type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)

root@the-neb:/home/will# systemctl status pve-cluster
● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2023-02-22 22:14:19 CET; 16h ago
    Process: 1292 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
   Main PID: 1296 (pmxcfs)
      Tasks: 7 (limit: 77034)
     Memory: 40.0M
        CPU: 40.123s
     CGroup: /system.slice/pve-cluster.service
             └─1296 /usr/bin/pmxcfs

Feb 22 22:14:18 the-neb systemd[1]: Starting The Proxmox VE cluster filesystem...
Feb 22 22:14:19 the-neb systemd[1]: Started The Proxmox VE cluster filesystem.

fabian · Feb 23, 2023

ah, you already mentioned in your first post that /etc/pve "is empty", I missed that.

so, you somehow corrupted the DB (did you manually attempt to edit it?).. if you have a backup of /etc/pve, I'd suggest the following

- move /var/lib/pve-cluster/* out of the way, e.g. mkdir /var/lib/pve-cluster/broken; mv /var/lib/pve-cluster/* /var/lib/pve-cluster/broken/
- move /etc/pve with the wrong content out of the way, e.g. mv /etc/pve /etc/pve-broken; mkdir /etc/pve
- now, attempt to (re)-start pve-cluster: systemctl restart pve-cluster
- if /etc/pve is still empty, check and post journalctl --since "-15min" -u pve-cluster
- if /etc/pve is now filled with content again, restore individual config files from your backup, and then reboot the node so that all services are restarted (or restart them one by one)

Airw0lf · Feb 23, 2023

fabian said:
ah, you already mentioned in your first post that /etc/pve "is empty", I missed that.

so, you somehow corrupted the DB (did you manually attempt to edit it?).. if you have a backup of /etc/pve, I'd suggest the following

- move /var/lib/pve-cluster/* out of the way, e.g. mkdir /var/lib/pve-cluster/broken; mv /var/lib/pve-cluster/* /var/lib/pve-cluster/broken/
- move /etc/pve with the wrong content out of the way, e.g. mv /etc/pve /etc/pve-broken; mkdir /etc/pve
- now, attempt to (re)-start pve-cluster: systemctl restart pve-cluster
- if /etc/pve is still empty, check and post journalctl --since "-15min" -u pve-cluster
- if /etc/pve is now filled with content again, restore individual config files from your backup, and then reboot the node so that all services are restarted (or restart them one by one)

Hi @fabian ,

No - I didn't try to edit the DB manually - no reason for that.
And no - I don't have a backup of the complete /etc/pve folder - only of /etc/pve/nodes.

To what extend would this make a difference to the outcome of your suggestion?

Cheers - Will

fabian · Feb 23, 2023

it would mean that you have to recreate any relevant files in /etc/pve (e.g., storage.cfg and user.cfg) and /etc/pve/priv (e.g., keys/passwords used to access storages, ACME accounts, ..), either from memory/documentation, or using the data contained in the old, broken sqlite DB (you can inspect it with sqlite3 after moving it out of the way, the contents are pretty straight-forward).

the error messages in your first post are cut-off, could you do the following and post the resulting "log" file as well?

journalctl --since "2023-02-22 23:00" --until "2023-02-22 23:59" --unit pve-cluster > log

Airw0lf · Feb 23, 2023

fabian said:
it would mean that you have to recreate any relevant files in /etc/pve (e.g., storage.cfg and user.cfg) and /etc/pve/priv (e.g., keys/passwords used to access storages, ACME accounts, ..), either from memory/documentation, or using the data contained in the old, broken sqlite DB (you can inspect it with sqlite3 after moving it out of the way, the contents are pretty straight-forward).

the error messages in your first post are cut-off, could you do the following and post the resulting "log" file as well?

journalctl --since "2023-02-22 23:00" --until "2023-02-22 23:59" --unit pve-cluster > log

I already re-created bits-and-pieces (also mentioned when opening this threat).
But I'm not aware of what to put in for the remaining parts.

I also didn't trust myself copying the DB and start with that - didn't know (until your suggestion) how to put things back.

Attached the zip of the "log"-file.

Cheers - Will

fabian · Feb 23, 2023

the problem is you didn't actually restore the bits and pieces

/etc/pve is not a simple directory, pmxcfs needs to be mounted there. by moving both the DB and the mountpoint out of the way, pmxcfs should be able to start with a clean slate and you can then copy back the things you already recovered from /etc/pve-broken to /etc/pve

fabian · Feb 23, 2023

also, logs for pve-cluster covering the time period right before you did the rename up to "2023-02-22 23:01" would be interesting - you can use the same command but with adapt --since and --until to get those!

Airw0lf · Feb 23, 2023

fabian said:
also, logs for pve-cluster covering the time period right before you did the rename up to "2023-02-22 23:01" would be interesting - you can use the same command but with adapt --since and --until to get those!

Well... if it helps: I can get you je the logs for the past 2-3 days?
So that you can slice-and-dice things anyway you want?

fabian · Feb 23, 2023

sure - if you indicate the (rough) time when you did the rename

if you have shell history that shows what you did when attempting the rename that might shed some light onto the root cause as well.

Airw0lf · Feb 23, 2023

fabian said:
the problem is you didn't actually restore the bits and pieces /etc/pve is not a simple directory, pmxcfs needs to be mounted there. by moving both the DB and the mountpoint out of the way, pmxcfs should be able to start with a clean slate and you can then copy back the things you already recovered from /etc/pve-broken to /etc/pve

Interesting concept... I removed the broken stuff, re-created the folders, rebooted and got my webUI back.
Which indeed allows me add things one-by-one as I have these config-things documented.

I now have the LXC container and VM's running as if nothing happened...

Great stuff this pve technolgy - thanks guys!

RolandK · Jul 1, 2023

>Feb 22 23:34:18 vigilant pmxcfs[1163]: [database] crit: found entry with duplicate name 'lxc' - A

inode = 0x0000000000>
>Feb 22 23:34:18 vigilant pmxcfs[1163]: [database] crit: found entry with duplicate name 'lxc' - A

inode = 0x0000000000>

i had this problem today, after host rename

i tried renaming /etc/pve/nodes/old ->new, which didn't work - so i created "new" and moved qemu-server and lxc into the new dir and deleted the old ones. but apparently, that lead to duplicate names lxc and qem-server inside the database and pmxcfs wont start

i installed "visidata" from https://www.visidata.org , which is tui based database editor and could easily delete the old db entries (with oder revision) and pcxcfs works again after this

use at your own risk....

Error during host rename

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Attachments

Proxmox Staff Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Renowned Member

We value your privacy