Clevis LUKS at boot, intermittent failure to auto-unlock

RubenRat

New Member
Jun 24, 2024
6
2
3
I am trying to auto-unlock non-root volumes at boot on PVE. The system boots from a 32gb ZFS RAID1 which is not encrypted, but VMs are to be stored on a mirror of LUKS partitions that then form another mirrored zpool. The reason for this (as some of you are no doubt aware) is that you can't replicate native ZFS encrypted VMs, and despite the caveats about needing to make sure you don't accidentally decrypt something when replicating it, I am aware of the risk and OK with managing that.

For auto-unlocking the LUKS partitions I have installed a tang server, used "clevis luks bind" to bind the LUKS partitions and can successfully unlock using the tang server, ie. every single time I run "clevis luks unlock" it works fine.

The problem is that at boot time, SOMETIMES it works and the LUKS volumes are successfully unlocked by clevis-systemd using the _netdev entries I've put in /etc/crypttab, sometimes it doesn't work and it just waits for the password so I have to KVM the server and type the password interactively to decrypt the devices.

Unfortunately one of the machines that'll be in the cluster hasn't got IPMI so I need to get it sorted before I can cluster them, otherwise that machine will just get stuck at boot and hang waiting for me to physically attend it, no bueno.

It's obviously (well, maybe I'm wrong, but it SEEMS obvious) some sort of timing issue that sometimes the crypttab is being processed before networking is available, because like I said the tang server (in fact, there's more than one of them) works perfectly - without fail, every time I interactively use "clevis luks unlock", it only ever fails to work during the PVE boot process, presumably because at the instant it tries to contact the tang server it can't reach it yet, but another boot it can.

To attempt to resolve it I've added "After=network-online.target" to /lib/systemd/system/remote-cryptsetup.target but that doesn't seem to have fixed it.

I'm really in need of someone more familiar than I am with systemd and the Debian boot process to point me in the right direction as to how I can either make it wait until some event like a target or service in systemd being reached, or even just introduce a fixed delay - although that would be a kludge, I strongly suspect that just a few seconds of crude delay before crypttab is processed would be enough to make it work every boot.
 
Hmmm, I might have just solved this by adding Wants=network-online.target and After=network-online.target to clevis-luks-askpass.service (instead of remote-cryptsetup.target). If anyone has any good insight and proper knowledge of the boot process so I can be sure whether that should consistently work then I'd still appreciate their input, but I figured I'd note this here now because it's successfully completed a few reboots like that and it might be helpful to someone else to note it.

EDIT: That left me mounting the ZFS datasets OK, but then pve-storage.target was getting reached too early and autostart VMs were failing to start, to resolve that I added After=remote-cryptsetup.target as a dependency for pve-storage.target and now it seems everything is occurring in the correct order.
 
Last edited:
  • Like
Reactions: UdoB

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!