[SOLVED] This is bad... /etc/pve/* is missing on boot

LooneyTunes · May 7, 2023

Hi,

I thought I was going completely mad, but then found this tread, with the almost identical issue, albeit mine was not fixable, despite adding my hostname back into /etc/hosts. I am not sure what induced it.

My setup now (was);
- System (PVE 7.4-3) on a 250 MB nvme about a year old.
- Data (holding some VMs) on a 500 GB SSD, about 3 years

The data disk died miserably - or at least I believe so. Running xfs_repair didn't do anything as the superblocks and magic number are gone. Looking in the log, it was clearly deteriorating. What started it I am not sure of, but I have been trying out different configurations for networking (guests didn't get expected nets).

What is really strange and alarming is that my whole configuration is just gone, poof, vanished. There is not one trace of either file nor folder below /etc/pve/

As this was a buildup from a crash a few weeks ago, a recent backup, hm, had not been taken yet. I had settled for snapshots "until done" when I were to setup backups.

Is there anything in this that is previously known, and what may have caused it? I have lately learned that pveproxy is very sensitive to missing/changed storage to... I really like Proxmox, but this is really no fun...

Question is now if my drive actually died, or if it just looks that way. I believe I've read that some storage in Proxmox don't have filesystems?

spirit · May 7, 2023

pve-cluster service is mouting /etc/pve.
(the real datas are in /var/lib/pve-cluster/config.db , an sqlite database)

do you have any error ?

can you try and send result of :

systemctl status pve-cluster
systemctl start pve-cluster
journalctl -u pve-cluster

LooneyTunes · May 7, 2023

spirit said:
pve-cluster service is mouting /etc/pve.
(the real datas are in /var/lib/pve-cluster/config.db , an sqlite database)

do you have any error ?

can you try and send result of :

systemctl status pve-cluster
systemctl start pve-cluster
journalctl -u pve-cluster

Hi,

thanks for responding

It is dead in the water unfortunately. I don't have networking working on it, have tried to fix that, but it won't boot to a usable state. I can login on the console locally though.

There are plenty of errors. Status of pve-cluster shows it in a "failed" state with "code=exited, status=255/EXCEPTION)". A little below that it states "pve-cluster.service: Scheduled restart job, restart counter is at 5." But that is after I restarted. It has much higher counts in the log.

Last command yields (omitting time/date)

- Staring The Proxmox VE cluster filesystem...
[main] crit: Unable to get local IP address
[main] crit: Unable to get local IP address
pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION

and repeating until end of log

I read this thread, and in post #4 the first command returns "ok", but the other indicate issues with the database...

Very long, but will type if it may help. Text "NOT NULL" is repeated a lot

bbgeek17 · May 8, 2023

Sounds like first thing is to get networking working, almost everything else will fall in place after that.

Blockbridge: Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

LooneyTunes · May 8, 2023

bbgeek17 said:
Sounds like first thing is to get networking working, almost everything else will fall in place after that.

Blockbridge: Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Great, sounds very easy. What I meant was that as pveproxy and pve-cluster is not running, I have no GUI.

I can, as said before I think, login through SSH and on the console. I ran some tests on the database, and that did not respond well at all, so wondering if there is potential to resolve this really...?

Code:

root@pve:~# systemctl status pve-cluster
● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Mon 2023-05-08 07:23:38 CEST; 8min ago
    Process: 931 ExecStart=/usr/bin/pmxcfs (code=exited, status=255/EXCEPTION)
        CPU: 10ms

May 08 07:23:38 pve systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5.
May 08 07:23:38 pve systemd[1]: Stopped The Proxmox VE cluster filesystem.
May 08 07:23:38 pve systemd[1]: pve-cluster.service: Start request repeated too quickly.
May 08 07:23:38 pve systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
May 08 07:23:38 pve systemd[1]: Failed to start The Proxmox VE cluster filesystem.
root@pve:~#

Code:

root@pve:~# journalctl -u pve-cluster
-- Boot 0702adfde93d4a379bdbba42dc890928 --
[snip]
May 07 18:39:39 pve systemd[1]: Starting The Proxmox VE cluster filesystem...
May 07 18:39:39 pve pmxcfs[904]: [main] crit: Unable to get local IP address
May 07 18:39:39 pve pmxcfs[904]: [main] crit: Unable to get local IP address
May 07 18:39:39 pve systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
May 07 18:39:39 pve systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
May 07 18:39:39 pve systemd[1]: Failed to start The Proxmox VE cluster filesystem.
May 07 18:39:39 pve systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 1.
May 07 18:39:39 pve systemd[1]: Stopped The Proxmox VE cluster filesystem.
May 07 18:39:39 pve systemd[1]: Starting The Proxmox VE cluster filesystem...
May 07 18:39:39 pve pmxcfs[917]: [main] crit: Unable to get local IP address
May 07 18:39:39 pve pmxcfs[917]: [main] crit: Unable to get local IP address
May 07 18:39:39 pve systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
May 07 18:39:39 pve systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
May 07 18:39:39 pve systemd[1]: Failed to start The Proxmox VE cluster filesystem.
[snip]

And well, I see I missed to add the link to the thread where I found the database tests... So this is what I did
First backed up pve-cluster;
- root@pve:~# tar czf pve-cluster-bask.tgz -C /var/lib/pve-cluster ./

Then
- root@pve:~# sqlite3 /var/lib/pve-cluster/config.db 'PRAGMA integrity_check'

Code:

root@pve:~# sqlite3 /var/lib/pve-cluster/config.db 'PRAGMA integrity_check'
ok
root@pve:~# 
root@pve:~# sqlite3 /var/lib/pve-cluster/config.db .schema
CREATE TABLE tree (  inode INTEGER PRIMARY KEY NOT NULL,  parent INTEGER NOT NULL CHECK(typeof(parent)=='integer'),  version INTEGER NOT NULL CHECK(typeof(version)=='integer'),  writer I
NTEGER NOT NULL CHECK(typeof(writer)=='integer'),  mtime INTEGER NOT NULL CHECK(typeof(mtime)=='integer'),  type INTEGER NOT NULL CHECK(typeof(type)=='integer'),  name TEXT NOT NULL,  da
ta BLOB);
root@pve:~# 
root@pve:~# 'SELECT inode,mtime,name FROM tree WHERE parent = 0'
-bash: SELECT inode,mtime,name FROM tree WHERE parent = 0: command not found
root@pve:~#

This is it so far. Please advice, thanks

Neobin · May 8, 2023

You seemingly have messed up your network configuration with this:

LooneyTunes said:
but I have been trying out different configurations for networking (guests didn't get expected nets).

which results in this:

LooneyTunes said:
Code:

May 07 18:39:39 pve pmxcfs[904]: [main] crit: Unable to get local IP address

So, bring your: /etc/network/interfaces (and: /etc/hosts) in a (known) working state and reboot the PVE-host.

This is, how the defaults look like (of course, you need to adapt it):
https://pve.proxmox.com/wiki/Network_Configuration#_default_configuration_using_a_bridge

spirit · May 8, 2023

"[main] crit: Unable to get local IP address"

you need to be able to resolve ip address for the current hostname.

(through dns or /etc/hosts)

LooneyTunes · May 8, 2023

Neobin said:
You seemingly have messed up your network configuration with this:

which results in this:

So, bring your: /etc/network/interfaces (and: /etc/hosts) in a (known) working state and reboot the PVE-host.

This is, how the defaults look like (of course, you need to adapt it):
https://pve.proxmox.com/wiki/Network_Configuration#_default_configuration_using_a_bridge

Network has been restored. If you were hoping the GUI magically would heal that is not the case. Database is still corrupt, or appears to be at least. I am grateful for the help, just a little frustrated, sorry. You are correct in me having tried a lot of different network configs, as I failed to find one that allowed my VMs to use tagged networks.

LooneyTunes · May 8, 2023

spirit said:
"[main] crit: Unable to get local IP address"

you need to be able to resolve ip address for the current hostname.

(through dns or /etc/hosts)

Thanks, network has been restored, and I now have SSH access, it won't boot the GUI. And the database seems corrupt... Any advice would be great, thanks

LooneyTunes · May 8, 2023

Incredible as it may sound, some improvement! I can now start pveproxy & pve-cluster services! And I have my /etc/pve/ back... It seems to be back in working order... Is there some sanity checks to run perhaps just to make sure all is good?

This would be the SMART data from the disk I though had failed... To me it seems to be alright?

Code:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   086   086   010    Pre-fail  Always       -       82
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       14998
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       152
177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       21
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   086   086   010    Pre-fail  Always       -       82
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   086   086   010    Pre-fail  Always       -       82
187 Reported_Uncorrect      0x0032   099   099   000    Old_age   Always       -       51
190 Airflow_Temperature_Cel 0x0032   073   046   000    Old_age   Always       -       27
195 Hardware_ECC_Recovered  0x001a   199   199   000    Old_age   Always       -       51
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
235 Unknown_Attribute       0x0012   099   099   000    Old_age   Always       -       89
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       20664748363

I tried starting one of the VMs, and it booted just fine. What a scare... From good to terrible in a heatbeat. A thought: If this infact was due to a misconfig on my part, wouldn't some simple config-check be in order, warning for misconfiguration - before making a veggie of everything? Anyways, would appriciate if there is any sanity checks

Search

Search

[SOLVED] This is bad... /etc/pve/* is missing on boot

LooneyTunes

Active Member

spirit

Distinguished Member

LooneyTunes

Active Member

bbgeek17

Distinguished Member

LooneyTunes

Active Member

Neobin

Distinguished Member

spirit

Distinguished Member

LooneyTunes

Active Member

LooneyTunes

Active Member

LooneyTunes

Active Member