Total Crash - How to rescue my VMs?

fpdragon · Jan 23, 2024

Hi,

While setting up my new/old server I have managed to crash my whole proxmox setup. I don't know why but I had the inglorious idea to setup the system disk as ZFS RAID0. While having HW troubles I had to reboot several times and I also had to kill the system in some cases. I don't know the details but in the end my Proxmox Linux system seems to be corrupted. The web UI isn't loading any more. I still can login via SSH and access the directory structure via SFTP. But the rest seems to be messed up. VMs are not starting.

syslog gives me several errors during boot:

Code:

[database] crit: unable to set WAL mode: disk I/O error#010
[main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Failed to start pve-cluster.service - The Proxmox VE cluster filesystem.

And many more...

However...

New plan:
I'd like to rescue my VMs and their configs. I do have backups but they are old.

Normally I would have started in
/etc/pve/...
but this directory is empty. Bad sign?

Hope you can help me.

Dunuin · Jan 23, 2024

fpdragon said:
[database] crit: unable to set WAL mode: disk I/O error#010 [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db' Failed to start pve-cluster.service - The Proxmox VE cluster filesystem.

Normally I would have started in
/etc/pve/...
but this directory is empty. Bad sign?

That "/var/lib/pve-cluster/config.db" is the SQLite DB that stores all your PVE configs. Those files you usually see in "/etc/pve" aren't actually files but entries in that DB that are displayed as files. So without a working DB the pve-cluster.service will fail to mount /etc/pve and nothing will work.
In case you got a backup of your "/var/lib/pve-cluster/config.db" you could try to restore it (and with it all the PVE configs) like described here: https://pve.proxmox.com/wiki/Proxmox_Cluster_File_System_(pmxcfs)#_recovery

fpdragon · Jan 25, 2024

Dunuin said:
That "/var/lib/pve-cluster/config.db" is the SQLite DB that stores all your PVE configs. Those files you usually see in "/etc/pve" aren't actually files but entries in that DB that are displayed as files. So without a working DB the pve-cluster.service will fail to mount /etc/pve and nothing will work.
In case you got a backup of your "/var/lib/pve-cluster/config.db" you could try to restore it (and with it all the PVE configs) like described here: https://pve.proxmox.com/wiki/Proxmox_Cluster_File_System_(pmxcfs)#_recovery

Thanks for your reply.
That sounds really bad. Seems that my VM settings are lost.
However, more important:
Is it possible to copy the VM images/data?

I took another SSD and now installed a fresh proxmox on it. My naive thinking was that I should be able to connect all drives, boot from the new and access the partitions of the old broken. Currently I cannot boot if I have all disks connected. Maybe something related to two rpools?

Is there a way to get to the image data?
Either from the SSH shell of the broken image or from another parallel system?
btw: I am on ZFS and not LVM.

Thanks.

fpdragon · Jan 25, 2024

Here some additional information about the failed pool:

Code:

zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
rpool   928G  46.7G   881G        -         -     0%     5%  1.00x  DEGRADED  -

Code:

zpool status
  pool: rpool
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
config:

        NAME                                                   STATE     READ WRITE CKSUM
        rpool                                                  DEGRADED     0     0     0
          ata-Samsung_SSD_850_EVO_500GB_S21JNXAG606866E-part3  DEGRADED     0     0    20  too many errors
          ata-Samsung_SSD_840_EVO_500GB_S1DHNSAF719482W-part3  ONLINE       0     0     2

Dunuin · Jan 25, 2024

You could extract the VM configs from your old VM backups. Whats then lost are all the host/datacenter configs like security groups and so on.

Did you try to scrub the old pool? Its only degraded but not failed. Maybe ZFS could fix some of those checksum errors.

fpdragon · Jan 25, 2024

Dunuin said:
You could extract the VM configs from your old VM backups. Whats then lost are all the host/datacenter configs like security groups and so on.

Did you try to scrub the old pool? Its only degraded but not failed. Maybe ZFS could fix some of those checksum errors.

I tried the following:

Code:

zpool clear -F rpool
zpool import rpool
zpool import -F rpool

root@ProxHpDL380:~# zpool scrub rpool
root@ProxHpDL380:~#


root@ProxHpDL380:~# zpool status -v rpool
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Thu Jan 25 22:21:00 2024
        46.7G / 46.7G scanned, 42.3G / 46.7G issued at 911M/s
        0B repaired, 90.46% done, 00:00:05 to go
config:

        NAME                                                   STATE     READ WRITE CKSUM
        rpool                                                  ONLINE       0     0     0
          ata-Samsung_SSD_850_EVO_500GB_#############-part3  ONLINE       0     0    28
          ata-Samsung_SSD_840_EVO_500GB_#############-part3  ONLINE       0     0    54

errors: Permanent errors have been detected in the following files:

        rpool/data/vm-1003-disk-2:<0x1>
        //var/log/journal/c2a350228c144dce86da5cd135528a96/system@00060fa2beeb927b-cd9aca9c96dfb169.journal~
        //var/lib/rrdcached/journal/rrd.journal.1706027451.072212
        //var/lib/pve-cluster/config.db-wal

fpdragon · Jan 25, 2024

Code:

root@ProxHpDL380:~# zpool status -v rpool
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:01:02 with 11 errors on Thu Jan 25 22:22:02 2024
config:

        NAME                                                   STATE     READ WRITE CKSUM
        rpool                                                  ONLINE       0     0     0
          ata-Samsung_SSD_850_EVO_500GB_###############-part3  ONLINE       0     0    28
          ata-Samsung_SSD_840_EVO_500GB_###############-part3  ONLINE       0     0    54

errors: Permanent errors have been detected in the following files:

        rpool/data/vm-1003-disk-2:<0x1>
        //var/log/journal/c2a350228c144dce86da5cd135528a96/system@00060fa2beeb927b-cd9aca9c96dfb169.journal~
        //var/lib/rrdcached/journal/rrd.journal.1706027451.072212
        //var/lib/pve-cluster/config.db-wal

fpdragon · Jan 25, 2024

So I guess it's just about 3 files that are corrupted and the config.db-wal seems to be the only one that is critical.
Any chance to reset it to the defaults and get the image data?

Dunuin · Jan 25, 2024

fpdragon said:
So I guess it's just about 3 files that are corrupted and the config.db-wal seems to be the only one that is critical.

A virtual disk of your VM/LXC with VMID 1003 got corrupted (so I would restore that VM from a backup), metrics and logs (not that important) and the config DB.

fpdragon said:
Any chance to reset it to the defaults and get the image data?

You could clone the virtual disks via "zfs send ... | zfs recv ..." between ZFS pools. See here for examples: https://docs.oracle.com/cd/E18752_01/html/819-5461/gbchx.html

fpdragon · Jan 26, 2024

Dunuin said:
A virtual disk of your VM/LXC with VMID 1003 got corrupted (so I would restore that VM vm a backup), metrics and logs (not that important) and the config DB.

You could clone the virtual disks via "zfs send ... | zfs recv ..." between ZFS pools. See here for examples: https://docs.oracle.com/cd/E18752_01/html/819-5461/gbchx.html

Thanks a lot for your help!

I renamed /var/lib/pve-cluster/config.db-wal and PVE GUI was booting again. Nice.
As you have seen... VM 1003 seems to be broken.
I tried to run a backup but it fails at 10%. I guess this VM needs to be thrown in the can.

backup restore is currently not that option since problems doesn't come alone...
I have a parallel thread where I discuss why my backups still use disk space but are not shown any more on my windows disk.
The thread is here: https://forum.proxmox.com/threads/where-are-my-vm-backups-gone.140436/

Search

Search

Total Crash - How to rescue my VMs?

fpdragon

Member

Dunuin

Distinguished Member

fpdragon

Member

fpdragon

Member

Dunuin

Distinguished Member

fpdragon

Member

fpdragon

Member

fpdragon

Member

Dunuin

Distinguished Member

fpdragon

Member

We value your privacy