CT/VM corrupted on reboot when on encrypted ZFS

proxbear

New Member
Apr 25, 2020
20
1
3
49
I'm running my VMs & CTs from image files stored on an encrypted ZFS pool - it's the same pool, which I boot from; the boot partition is not encrypted. There is a separate data-set called "encrypted" for this.

Everything works nicely, after reboot I just run a zfs-mounting script, which asks for pwd and mounts all the encrypted pools.

The only problem: when rebooting the host Proxmox, all the VMs & CTs get corrupted: inside the subvol dirs I can confirm most of the root dirs are missing, or the subvol is empty completely.

This is very ugly. Please, is there a possible workaround (script to write or procedure to follow) so I don't have to restore from backup all the corrupted machines after every reboot? I've tried to stop the machines, but this didn't help: reboot destroyed them anyway.

Thank you,
Andrej
 
Last edited:
hmm - do you have the container set to autostart?
if yes does this also happen if you disable autostart and manually start them after you've entered the passphrase (and the actual subvols are mounted when you start the containers)?

if no - please provide the journal from the last boot
 
Of course I have no autostart ;-). I'm attaching the log-file, thank you!
 

Attachments

  • journal.zip
    221.6 KB · Views: 1
Hmm did not directly see the problem in the log.
could you post (after a reboot, when the container does not start/is corrupted):
* the container's config (`cat /etc/pve/lxc/VMID.conf`)
* debug output from trying to start the container - see https://pve.proxmox.com/pve-docs/chapter-pct.html#_obtaining_debugging_logs
* all properties of the container's datasets (`zfs get all rpool/encrypted/subvol-VMID-disk-X`)
* the output of find on the container root before and after trying to start it (`find /rpool/encrypted/subvol-VMID-disk-X`)

(same info also for the VM)
what's the symptoms of the corrupted VM - how far does it get - what's the last error you see?
 
Ok, I'll post everything after the next reboot - but this can take up to 1 week, as currently a lot of backup tasks are initially running.

The symptoms: the PVE even doesn't realize something went wrong: sometimes it even creates a corrupted backup as "OK" (like being 20MB big, instead of 150GB) and even is happy "restoring" it from such corrupted backup, also "OK", but nothing is running of course. I had really hard time to realize what's going on, as 2 weeks ago I even didn't know there is something as "Proxmox" :D.

After the reboot and remounting the ZFS encrypted storage, the CTs give me an error when trying to run them. Sorry I didn't remember what it was, but it smelled like corrupted file-system (different messages, going from "locking error" to "not found" and such). So I guessed it's a screwed volume, having played a few days with Proxmox I knew at least where to expect the volume; this is also a nasty thing: deleting a CT, even with "purge", keeps the old volumes lying around, like disk-2/3/4/.... so I have to delete it from cmd-line always.

Of course, one day I won't reboot/remount so often, but I guess this can be quite valuable for you, as I'm playing a lot with it now when setting up everything from scratch. I moved away from FreeNAS (mainly because the dump unstable bhyve: one VM can tear-down the whole bloody hypervisor and corrupt all the VMs). The main point while trying all this under Proxmox was: I had to setup my own snapshotting; I was looking for an obvious way in UI and was quite shocked nothing there ;). So I've written a Python script, giving it a few params like pool-path, periodicity, prefix and keep-count - the biggest headache was trying to put it into crontab, when you guys are using the cron.d, took me quite few hours to figure out, as the dumb crontab was nicely there of course, just waiting for me to try it endless/hopeless.

I think the Proxmox is quite a good compromise between pure cmd-line and pure-UI; but it lucks a few features. I will subscribe probably when it makes me happy and plays nicely, to support you guys a bit :cool:. As a senior SW architect I understand you quite well!
 
deleting a CT, even with "purge", keeps the old volumes lying around, like disk-2/3/4/.... so I have to delete it from cmd-line always.
That's odd and should not happen - if it does please post the task log of the deletion

Ok, I'll post everything after the next reboot - but this can take up to 1 week, as currently a lot of backup tasks are initially running.
Thanks!

I think the Proxmox is quite a good compromise between pure cmd-line and pure-UI; but it lucks a few features. I will subscribe probably when it makes me happy and plays nicely, to support you guys a bit :cool:. As a senior SW architect I understand you quite well!
Glad to hear - always great to get some support :)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!