[SOLVED] Boot Drive Recovery Procedure

Giganic

Renowned Member
Aug 19, 2016
25
4
68
After looking around for suitable answers to this particular question I decided to put some of the advice to the test and figure out what the procedure would be in the event my boot drive (ssd) decided to pack it in.

General consensus online and on these forums state that by simply backing up files located in /etc/pve/ is enough. I don't have any intention of backing up the vm's or containers as all their data is stored on a ZFS striped mirror.

So basically when the drive fails all I want to do is install Proxmox on a new drive and copy back all the files from my backup of /etc/pve. In my testing this didn't go quite to plan. When performing the installation on a new vm drive, after it restarted I was presented with the the following error.

Code:
[FAILED] Failed to start Mount ZFS filesystems.
See 'systemctl status zfs-mount.service' for details.
[DEPEND] Dependency failed for ZFS startup target.
[DEPEND] Dependency failed for ZFS file system shares.

I had nothing to lose on this install but was concerned that this is what I could be presented with on my live server. I managed to rectify the error by running "zfs mount -a" and performing a restart.

In closing I would like to ask those who can assist, is this the correct way to handle the situation? The future is always uncertain and would rather be armed with the correct information.

Thanks in advance.
 
Last edited:
Bumping in the hope it will be picked up by someone who might be able to help. I understand I don't have a subscription but surely someone can shed some light on my question.
 
I hope you are backing up your containers and vm's. Just because it is on ZFS et al, doesn't mean it is safe from all disasters.
 
I hope you are backing up your containers and vm's. Just because it is on ZFS et al, doesn't mean it is safe from all disasters.

While I appreciate your feedback, this doesn't address my original question.

I ask once again, could someone with some experience on this topic please provide some feedback, alternatively could @tom, @martin, @wolfgang or anyone else at Proxmox who has 5 minutes to spare provide some information?
 
Last edited:
please provide more information about your setup and concrete questions if you want concrete answers ;)
 
Thanks for your reply @fabian. I thought my initial post mentioned my server configuration (ssd + striped mirror), situation, problem and question, what other information would you like? Would be more than happy to provide.
 
  • what do you mean with "boot drive" (you probably mean that / is on a non-zfs ssd, and you have a zfs pool on separate disks?)
  • are you using an uptodate version (hint: post the output of "pveversion -v")
  • the output of "zpool status" and "zfs list" before and after attempting your disaster scenario might be helpful as well
  • the error message you posted tells you to look at the logs - did you try that? zfs usually tells you why it cannot import a pool or mount a dataset..
  • did you shut down cleanly before removing the ssd?
 
  • what do you mean with "boot drive" (you probably mean that / is on a non-zfs ssd, and you have a zfs pool on separate disks?)
  • are you using an uptodate version (hint: post the output of "pveversion -v")
  • the output of "zpool status" and "zfs list" before and after attempting your disaster scenario might be helpful as well
  • the error message you posted tells you to look at the logs - did you try that? zfs usually tells you why it cannot import a pool or mount a dataset..
  • did you shut down cleanly before removing the ssd?

Answers to your questions are below.

1. Correct. My SSD is where Proxmox is installed to. All templates, storage and containers are kept on the zfs pool.

2. I installed a fresh copy of Proxmox on my VirtualBox test environment, output below of pveversion -v.
Code:
proxmox-ve: 4.2-60 (running kernel: 4.4.15-1-pve)
pve-manager: 4.2-17 (running version: 4.2-17/e1400248)
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.4.8-1-pve: 4.4.8-52
pve-kernel-4.4.15-1-pve: 4.4.15-60
pve-kernel-4.2.8-1-pve: 4.2.8-41
pve-kernel-4.2.2-1-pve: 4.2.2-16
lvm2: 2.02.116-pve2
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-43
qemu-server: 4.0-85
pve-firmware: 1.1-8
libpve-common-perl: 4.0-72
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-56
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-qemu-kvm: 2.6-1
pve-container: 1.0-72
pve-firewall: 2.0-29
pve-ha-manager: 1.0-33
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.3-4
lxcfs: 2.0.2-pve1
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5.7-pve10~bpo80

3. Output of "zpool status"
Code:
  pool: tank
state: ONLINE
  scan: none requested
config:

        NAME                                       STATE     READ WRITE CKSUM
        tank                                       ONLINE       0     0     0
          mirror-0                                 ONLINE       0     0     0
            ata-VBOX_HARDDISK_VB4b8c215d-263617ae  ONLINE       0     0     0
            ata-VBOX_HARDDISK_VB484ddea6-c17af634  ONLINE       0     0     0

errors: No known data errors

Output of "zpool list"
Code:
tank  9.94G   602M  9.35G         -     3%     5%  1.00x  ONLINE  -
The result of the above commands is identical before and after.

4. Output of systemctl status zfs-mount.service
Code:
● zfs-mount.service - Mount ZFS filesystems
   Loaded: loaded (/lib/systemd/system/zfs-mount.service; static)
   Active: failed (Result: exit-code) since Wed 2016-08-24 02:21:44 AEST; 5min ago
  Process: 1665 ExecStart=/sbin/zfs mount -a (code=exited, status=1/FAILURE)
Main PID: 1665 (code=exited, status=1/FAILURE)

Aug 24 02:21:44 test2 zfs[1665]: cannot mount '/tank': failed to create mountpoint
Aug 24 02:21:44 test2 zfs[1665]: cannot mount '/tank/subvol-101-disk-1': failed to create mountpoint
Aug 24 02:21:44 test2 zfs[1665]: cannot mount '/tank/template': failed to create mountpoint
Aug 24 02:21:44 test2 systemd[1]: zfs-mount.service: main process exited, code=exited, status=1/FAILURE
Aug 24 02:21:44 test2 systemd[1]: Failed to start Mount ZFS filesystems.
Aug 24 02:21:44 test2 systemd[1]: Unit zfs-mount.service entered failed state.

5. Yes.

I am going to leave the VM in the state is is currently as once I run "zfs mount -a" the error clears itself and doesn't return.

Hopefully this added information helps, apologies it wasn't included in my original post.
 
seems like the issue is that the mountpoint paths cannot be created at the time that zfs-mount.service starts (is "/" still mounted ro at that point?). I'll see whether I can reproduce this.
 
I am not sure if I have understood your question, but please see below for a screenshot of the moment the error occurs on boot and the output of df -h.

Efw5QDv.jpg


df -h
Code:
Filesystem      Size  Used Avail Use% Mounted on
udev             10M     0   10M   0% /dev
tmpfs           401M  5.8M  395M   2% /run
/dev/dm-0       2.0G  1.2G  684M  63% /
tmpfs          1001M   25M  976M   3% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs          1001M     0 1001M   0% /sys/fs/cgroup
cgmfs           100K     0  100K   0% /run/cgmanager/fs
tmpfs           100K     0  100K   0% /run/lxcfs/controllers
/dev/fuse        30M   16K   30M   1% /etc/pve
 
the problem is that zfs-mount happens very early in the boot. when you reinstall your system, the mountpoint directories are not there (/tank and the subdirectories in your case). at the point in the boot process when the zfs-mount service runs, it is unable to create those directories (because it runs before systemd remounts "/"). normally these directories exist, and mounting works. when you run "zfs mount -a" (which is exactly what the service does) after the boot has completed, creating the directories works (because "/" is mounted normally).

to make a long story short: either do a manual "zfs mount -a" after reinstalling, or create the needed directories manually after reinstalling. this behaviour is expected.
 
Excellent, thanks for the detailed answer, provided the confirmation I was looking for. It is easy enough to run "zfs mount -a" so that is the course of action I will take.

Hopefully this information will also assist others in a similar situation.
 
the problem is that zfs-mount happens very early in the boot. when you reinstall your system, the mountpoint directories are not there (/tank and the subdirectories in your case). at the point in the boot process when the zfs-mount service runs, it is unable to create those directories (because it runs before systemd remounts "/"). normally these directories exist, and mounting works. when you run "zfs mount -a" (which is exactly what the service does) after the boot has completed, creating the directories works (because "/" is mounted normally).

to make a long story short: either do a manual "zfs mount -a" after reinstalling, or create the needed directories manually after reinstalling. this behaviour is expected.
Great Info shared here, thanks...

I am using a separate SATADOM just for OS and integrated Ceph for storage.

Would there be anything different and/or unique to restoring the "Boot Drive" with Ceph instead of ZFS?

Would the restore still just be copying a backup of the /etc/pve folder to the replacement drive?

Using PBS, which is preferable for boot drive backup, a .img to then just restore the .img to a new drive when the old one fails. Or stick with the .pxar backup and just restore the /etc/pve folder after reinstalling Proxmox on the replacement drive?

Any suggestions on how often to backup the boot drive to keep a good enough copy to restore? daily?

Can the normally generated logs be redirected to another storage instead of the boot drive?


df -h
Filesystem Size Used Avail Use% Mounted on
udev 252G 0 252G 0% /dev
tmpfs 51G 2.1M 51G 1% /run
/dev/mapper/pve-root 29G 12G 17G 41% /
tmpfs 252G 60M 252G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/sda2 511M 312K 511M 1% /boot/efi
tmpfs 252G 28K 252G 1% /var/lib/ceph/osd/ceph-7
tmpfs 252G 28K 252G 1% /var/lib/ceph/osd/ceph-17
tmpfs 252G 28K 252G 1% /var/lib/ceph/osd/ceph-23
tmpfs 252G 28K 252G 1% /var/lib/ceph/osd/ceph-5
tmpfs 252G 28K 252G 1% /var/lib/ceph/osd/ceph-6
tmpfs 252G 28K 252G 1% /var/lib/ceph/osd/ceph-4
192.168.1.50:/vol4/Prox-store-Cluster 37T 18T 19T 48% /mnt/pve/NAS-Store
192.168.1.50:/vol4/Prox-back-Cluster 37T 18T 19T 48% /mnt/pve/NAS-Back
/dev/fuse 128M 196K 128M 1% /etc/pve
tmpfs 51G 0 51G 0% /run/user/0

pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.13.19-3-pve)
pve-manager: 7.2-3 (running version: 7.2-3/c743d6c1)
pve-kernel-5.15: 7.2-3
pve-kernel-helper: 7.2-3
pve-kernel-5.13: 7.1-9
pve-kernel-5.0: 6.0-11
pve-kernel-5.15.35-1-pve: 5.15.35-3
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-3-pve: 5.13.19-7
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-8
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-6
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.2-2
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.1.8-1
proxmox-backup-file-restore: 2.1.8-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-10
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-1
pve-qemu-kvm: 6.2.0-6
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1
 
a full backup of the root disk is probably easier to restore from (less manual work) if it's recent enough, but both approaches should work. note that just /etc/pve will not be enough in case of a clustered system, you also need some parts of /etc/ (corosync config, network config, ..) and /var/ (pmxcfs state).

Can the normally generated logs be redirected to another storage instead of the boot drive?

sure, just configure syslog/journal accordingly, but beware of boot-time ordering ;)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!