Windows 11 VM with tpm in glusterfs: swtpm_setup fails

fes-it-admin

New Member
May 27, 2023
3
0
1
Hello,

I am currently trying to setup a WIndows 11 VM which requires tmp support. I have a proxmox 7.4-3 cluster which uses a gluster filesystem to store all vm data and images so that each of our nodes can take over the vms as failover. I tried setting this up, but the tmp setup fails with:

Code:
/bin/swtpm exit with status 256:
TASK ERROR: start failed: command 'swtpm_setup --tpmstate file://gluster://vm-host-01.fes/gv_vmdata/images/107/vm-107-disk-0.raw --createek --create-ek-cert --create-platform-cert --lock-nvram --config /etc/swtpm_setup.conf --runas 0 --not-overwrite --tpm2 --ecc' failed: exit code 1

This is my setup for the VM:

1685181609145.png

I am using a glusterfs replicate setup with this configuration:
Bash:
# gluster volume info
 
Volume Name: gv_vmdata
Type: Replicate
Volume ID: 7085d24d-c7cd-4fc0-9013-10056de09057
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 192.168.40.11:/data/proxmox/vm_data
Brick2: vm-host-01.fes:/data/proxmox/vm_data
Brick3: vm-data-01.fes:/data/proxmox/vm_data
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
storage.fips-mode-rchecksum: on
cluster.granular-entry-heal: on
cluster.self-heal-daemon: enable

I am not sure how to debug this problem properly. Can someone please help me to figure out how to proceed here in order to find out what is going wrong?
 
Thank you for the reply. Which Filesystems are currently supported that offer Network Storage?
I think it's only GlusterFS that would need to be handled differently. All others (NFS, CIFS, CephFS, shared LVM, ...) should work for TMP drives.
 
Same issue with Proxmox 8.0.4 and Win11 ZFS-over-iSCSI
Putting TPM on CIFS share works.
 
Same issue with Proxmox 8.0.4 and Win11 ZFS-over-iSCSI
Putting TPM on CIFS share works.
so could you elaborate on your solution as I have got the same problem with Win 11 and TPM on ZFS over iSCSI.
 
ok, got it. Guess there was a loop in my brain. So SMB/CIFS and NFS storage alike will probably work the same.
 
Last edited:
ok, just to add to the CIFS/NFS solution: this will unfortunately break the possibility to take snapshots (TPM is a .raw disk).
Is there any other possibility to add the TPM without losing the snapshot feature?
(removing the TPM after install works, but there won't be a TPM then...)
 
Last edited:
Same problem running TPM from Ceph after cloning(full) from template.
moved TPM to nfs share and VM starts normally.

Moved back to ceph, again failure.
Moved VM to another host and now it is working with TPM on ceph.. could there be any underlying hardware issue?
 
Last edited:
Hi,
Same problem running TPM from Ceph after cloning(full) from template.
moved TPM to nfs share and VM starts normally.

Moved back to ceph, again failure.
Moved VM to another host and now it is working with TPM on ceph.. could there be any underlying hardware issue?
please post the output of pveversion -v, your VM configuration qm config <ID> and the exact error message you get. Please also check your system logs/journal for any additional errors.
 
Not working host:

Code:
proxmox-ve: 8.1.0 (running kernel: 6.5.11-8-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.15: 7.4-9
proxmox-kernel-6.5: 6.5.11-8
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
pve-kernel-5.15.131-2-pve: 5.15.131-3
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph: 17.2.7-pve2
ceph-fuse: 17.2.7-pve2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown: residual config
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.0.5
libqb0: 1.0.5-1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.4-1
proxmox-backup-file-restore: 3.1.4-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.4
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-3
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.2.0
pve-qemu-kvm: 8.1.5-2
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1

Working host:

Code:
proxmox-ve: 8.1.0 (running kernel: 6.5.11-8-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.15: 7.4-9
proxmox-kernel-6.5: 6.5.11-8
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
pve-kernel-5.15.131-2-pve: 5.15.131-3
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph: 17.2.7-pve2
ceph-fuse: 17.2.7-pve2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown: residual config
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.0.5
libqb0: 1.0.5-1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.4-1
proxmox-backup-file-restore: 3.1.4-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.4
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-3
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.2.0
pve-qemu-kvm: 8.1.5-2
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1

VM configuration:
Code:
agent: 1
bios: ovmf
boot: order=scsi0;sata0
cores: 4
cpu: qemu64
description: -----
efidisk0: CephDiskSSD:vm-103-disk-0,efitype=4m,pre-enrolled-keys=1,size=528K
hotplug: disk
machine: pc-q35-8.1
memory: 8192
meta: creation-qemu=6.1.0,ctime=1662987504
name: BVMS-Showroom
net0: virtio=BC:24:11:C3:04:3C,bridge=vmbr0,firewall=1,tag=12
numa: 0
onboot: 1
ostype: win11
sata0: none,media=cdrom
scsi0: CephDiskSSD:vm-103-disk-1,cache=writeback,discard=on,iothread=1,size=74G,ssd=1
scsi1: Videodata:103/vm-103-disk-0.qcow2,backup=0,cache=writeback,discard=on,iothread=1,size=4000G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=4024f382-1426-4d01-a616-e0cdfe1af275
sockets: 1
tpmstate0: CephDiskSSD:vm-103-disk-2,size=4M,version=v2.0
vmgenid: 4079e4c9-2e83-4cee-b95d-17f756bf6526

Working on the syslogging, have to get the VM offline and move it back to faulting node.
Migrated running VM HA to faulty node, still works.
Shutdown node.
Trying to start and fails.

Code:
Feb 15 16:09:38 PM6 pve-ha-lrm[2621119]: start failed: command 'swtpm_setup --tpmstate file:///dev/rbd-pve/4a2d9f75-72f5-40fd-abd3-8537a976ff9f/CephDiskSSD/vm-103-disk-2 --createek --create-ek-cert --create-platform-cert --lock-nvram --config /etc/swtpm_setup.conf --runas 0 --not-overwrite --tpm2 --ecc' failed: exit code 1
Feb 15 16:09:38 PM6 pve-ha-lrm[2621116]: <root@pam> end task UPID:PM6:0027FEBF:033E1B95:65CE2932:qmstart:103:root@pam: start failed: command 'swtpm_setup --tpmstate file:///dev/rbd-pve/4a2d9f75-72f5-40fd-abd3-8537a976ff9f/CephDiskSSD/vm-103-disk-2 --createek --create-ek-cert --create-platform-cert --lock-nvram --config /etc/swtpm_setup.conf --runas 0 --not-overwrite --tpm2 --ecc' failed: exit code 1
Feb 15 16:09:38 PM6 pve-ha-lrm[2621116]: unable to start service vm:103
Feb 15 16:09:38 PM6 pve-ha-lrm[1849]: unable to start service vm:103 on local node after 1 retries

After 2 tries it migrates away since it is HA and start correctly on other node.
I must admit this is happening on my oldest node.

Node info:
1708010019092.png
 
Last edited:
Code:
Feb 15 16:09:38 PM6 pve-ha-lrm[2621119]: start failed: command 'swtpm_setup --tpmstate file:///dev/rbd-pve/4a2d9f75-72f5-40fd-abd3-8537a976ff9f/CephDiskSSD/vm-103-disk-2 --createek --create-ek-cert --create-platform-cert --lock-nvram --config /etc/swtpm_setup.conf --runas 0 --not-overwrite --tpm2 --ecc' failed: exit code 1
Feb 15 16:09:38 PM6 pve-ha-lrm[2621116]: <root@pam> end task UPID:PM6:0027FEBF:033E1B95:65CE2932:qmstart:103:root@pam: start failed: command 'swtpm_setup --tpmstate file:///dev/rbd-pve/4a2d9f75-72f5-40fd-abd3-8537a976ff9f/CephDiskSSD/vm-103-disk-2 --createek --create-ek-cert --create-platform-cert --lock-nvram --config /etc/swtpm_setup.conf --runas 0 --not-overwrite --tpm2 --ecc' failed: exit code 1
Feb 15 16:09:38 PM6 pve-ha-lrm[2621116]: unable to start service vm:103
Feb 15 16:09:38 PM6 pve-ha-lrm[1849]: unable to start service vm:103 on local node after 1 retries
Unfortunately, there is no real information about why the swtpm_setup call failed. Can you check the failed VM 103 - Start task (double click to open the log), in the bottom panel in the UI or your node's Task History? Maybe there is additional output there.
 
Unfortunately, there is no real information about why the swtpm_setup call failed. Can you check the failed VM 103 - Start task (double click to open the log), in the bottom panel in the UI or your node's Task History? Maybe there is additional output there.

task started by HA resource agent
/dev/rbd0
TPM2_EvictControl failed: 0x14c
create_ek failed: 0x1
An error occurred. Authoring the TPM state failed.
swtpm_setup: Starting vTPM manufacturing as root:root @ Thu 15 Feb 2024 04:09:38 PM CET
swtpm_setup: TPM is listening on Unix socket.
swtpm_setup: Ending vTPM manufacturing @ Thu 15 Feb 2024 04:09:38 PM CET
TASK ERROR: start failed: command 'swtpm_setup --tpmstate file:///dev/rbd-pve/4a2d9f75-72f5-40fd-abd3-8537a976ff9f/CephDiskSSD/vm-103-disk-2 --createek --create-ek-cert --create-platform-cert --lock-nvram --config /etc/swtpm_setup.conf --runas 0 --not-overwrite --tpm2 --ecc' failed: exit code 1
 
task started by HA resource agent
/dev/rbd0
TPM2_EvictControl failed: 0x14c
create_ek failed: 0x1
Was this during a live migration or a fresh start?

From what I could quickly gather from the swtpm/libtmps source code, this indicates that a certain handle for the EK(=Endorsement Key) was already present in the TPM image. But the command has a --not-overwrite parameter so it should return early before performing these steps. Since it only happens on one machine and storage combination, there might be a timing issue somewhere.

What is the output when you run the following
Code:
rbd -p <your pool> rbd map vm-103-disk-2 && swtpm socket --tpm2 --print-states --tpmstate backend-uri=file:///dev/rbd-pve/4a2d9f75-72f5-40fd-abd3-8537a976ff9f/CephDiskSSD/vm-103-disk-2
rbd -p <your pool> unmap vm-103-disk-2
replacing <your pool> with the actual pool name?
 
Was this during a live migration or a fresh start?

From what I could quickly gather from the swtpm/libtmps source code, this indicates that a certain handle for the EK(=Endorsement Key) was already present in the TPM image. But the command has a --not-overwrite parameter so it should return early before performing these steps. Since it only happens on one machine and storage combination, there might be a timing issue somewhere.

What is the output when you run the following
Code:
rbd -p <your pool> rbd map vm-103-disk-2 && swtpm socket --tpm2 --print-states --tpmstate backend-uri=file:///dev/rbd-pve/4a2d9f75-72f5-40fd-abd3-8537a976ff9f/CephDiskSSD/vm-103-disk-2
rbd -p <your pool> unmap vm-103-disk-2
replacing <your pool> with the actual pool name?

This was during a fresh start. VM shutdown, moved to faulty node and started again. Right after start it failed, HA tried again (same error) and then HA migrated/moved it (offline) to a non-faulty node where it started normally.

The requested output bothers me, i get errors while running the rbd :
Code:
rbd -p CephDiskSSD map vm-103-disk-2
did not load config file, using default settings.
2024-02-19T13:11:22.334+0100 7f93246a7500 -1 Errors while parsing config file!

2024-02-19T13:11:22.334+0100 7f93246a7500 -1 can't open ceph.conf: (2) No such file or directory

2024-02-19T13:11:22.334+0100 7f93246a7500 -1 Errors while parsing config file!

2024-02-19T13:11:22.334+0100 7f93246a7500 -1 can't open ceph.conf: (2) No such file or directory

unable to get monitor info from DNS SRV with service name: 2024-02-19T13:11:22.338+0100 7f93246a7500 -1 failed for service _ceph-mon._tcp

ceph-mon
2024-02-19T13:11:22.338+0100 7f93246a7500 -1 monclient: get_monmap_and_config cannot identify monitors to contact

rbd: couldn't connect to the cluster!

However cluster is running fine....

swtpm returns:
Code:
{ "type": "swtpm", "states": [ {"name": "permall", "size": 8699} ]
 
This was during a fresh start. VM shutdown, moved to faulty node and started again. Right after start it failed, HA tried again (same error) and then HA migrated/moved it (offline) to a non-faulty node where it started normally.

The requested output bothers me, i get errors while running the rbd :
Code:
rbd -p CephDiskSSD map vm-103-disk-2
did not load config file, using default settings.
2024-02-19T13:11:22.334+0100 7f93246a7500 -1 Errors while parsing config file!

2024-02-19T13:11:22.334+0100 7f93246a7500 -1 can't open ceph.conf: (2) No such file or directory

2024-02-19T13:11:22.334+0100 7f93246a7500 -1 Errors while parsing config file!

2024-02-19T13:11:22.334+0100 7f93246a7500 -1 can't open ceph.conf: (2) No such file or directory

unable to get monitor info from DNS SRV with service name: 2024-02-19T13:11:22.338+0100 7f93246a7500 -1 failed for service _ceph-mon._tcp

ceph-mon
2024-02-19T13:11:22.338+0100 7f93246a7500 -1 monclient: get_monmap_and_config cannot identify monitors to contact

rbd: couldn't connect to the cluster!
Can you try with rbd -c /etc/pve/ceph.conf -p CephDiskSSD map vm-103-disk-2? Or is this an external cluster?

However cluster is running fine....

swtpm returns:
Code:
{ "type": "swtpm", "states": [ {"name": "permall", "size": 8699} ]
This looks fine and should cause swtpm_setup ... --not-overwrite ... to skip, which it doesn't for some reason. But the command should only work if the disk was already mapped.

What you could still try is mapping the disk before attempting to start the VM. Maybe there is some weird issue when there is not enough time between mapping and accessing.
 
Can you try with rbd -c /etc/pve/ceph.conf -p CephDiskSSD map vm-103-disk-2? Or is this an external cluster?


This looks fine and should cause swtpm_setup ... --not-overwrite ... to skip, which it doesn't for some reason. But the command should only work if the disk was already mapped.

What you could still try is mapping the disk before attempting to start the VM. Maybe there is some weird issue when there is not enough time between mapping and accessing.
Hi Fiona,

You are correct, if i map the TPM disk first by shell the VM starts normally. Shutting it down again (it unmaps automatically) and start it again fails it again. But only on this node it works on 4 other nodes.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!