Can't start VM with OVMF and UEFI disk on Ceph

papaf76 · Jan 13, 2021

Hi all,
as per title, if I put the UEFI disk required for OVMF on my Ceph shared storage, the VM will not start. Rather, I should really say it doesn't work properly, as it shows as started on the web UI and the qemu process is running, but the machine is inaccessible with the console and the resource in use clearly aren't what you would expect out of a running VM (0%C CPU usage and very low RAM usage).
As soon as I move the UEFI disk to a local storage or if I chose SeaBIOS, everything is fine.
BTW, I'm trying to use OVMF as it's currently the easiest way for me to have a higher resolution in windows.

Alwin · Jan 13, 2021

papaf76 said:
as per title, if I put the UEFI disk required for OVMF on my Ceph shared storage, the VM will not start. Rather, I should really say it doesn't work properly, as it shows as started on the web UI and the qemu process is running, but the machine is inaccessible with the console and the resource in use clearly aren't what you would expect out of a running VM (0%C CPU usage and very low RAM usage).

I tested this on my cluster and a VM started without issue. What's the pveversion -v of your node? And the qm config <vmid>?

papaf76 · Jan 13, 2021

Here's the output of the two commands.
pveversion -v:

Code:

root@hyper3:~# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph: 15.2.8-pve2
ceph-fuse: 15.2.8-pve2
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-4
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-4
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-8
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

and the vm config:

Code:

root@hyper3:~# qm config 401
balloon: 2048
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 2
efidisk0: shared1:vm-401-disk-1,size=1M
ide0: iso:iso/virtio-win-0.1.189.iso,media=cdrom,size=488766K
ide2: iso:iso/en_windows_server_2019_updated_march_2019_x64_dvd_2ae967ab.iso,media=cdrom
memory: 4096
name: winsvr2019
net0: virtio=72:01:2D:1F:24:57,bridge=vmbr0
numa: 0
ostype: win10
scsi0: shared1:vm-401-disk-0,discard=on,size=50G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=7418dc77-1c83-452c-bc20-7ea1c78d8026
sockets: 1
vga: vmware
vmgenid: 67ce111a-34d1-4804-bcfa-8fa5fe82a7a0

Alwin · Jan 13, 2021

My test VM starts with the above settings. Or do you mean the OS doesn't boot?

papaf76 · Jan 13, 2021

Alwin said:
My test VM starts with the above settings. Or do you mean the OS doesn't boot?

Yeah the VM status is started and I can see some memory usage, however it's not possible to connect to the console and the memory usage stays very low, when I know that windows installation on an empty VM, like this case, usually spikes pretty high. It's as if the machine isn't able to get past the POST.
Related to this, I'm working with a test setup here and Ceph is working with a probably very bad configuration. It's using 3 120Gb SSDs and they were more than 60% full and the I/O was getting horrible. Could this be the culprit? Also, I since removed other VMs and Ceph now reports 10% usage, but still the I/O is very bad. Do I need to do anything else? When this setup was just installed I/O was fine.

need2gcm · Jan 13, 2021

Ran into this last night on a Windows 10 VM after I completed a kernel switch reboot and upgraded Ceph to Octopus.

What fixed it for me was deleting and recreating the EFI disk.

EDIT: My package versions:

Code:

proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph: 15.2.8-pve2
ceph-fuse: 15.2.8-pve2
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: not correctly installed
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-4
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-4
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-8
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

papaf76 · Jan 14, 2021

I have the same Ceph version but unfortunately, even recreating the EFI disk doesn't solve the issue.

The problem seems to be an extreme case of slow speed as if I leave the VM sitting there long enough it seems it starts. Now I had to convert it to SeaBios as I need it working, will test more later on.

need2gcm · Jan 25, 2021

Just had this occur on another UEFI VM that had not previously had issues on this package version set. I'll poke around for a bit and share if I find anything I think is useful.

~~EDIT 1: Interesting, the host that the UEFI VM on it gets about 6% more I/O Delay than is normal for that host, and no other hosts show that behavior during attempted boots of that VM.~~ Root cause was unrelated.

EDIT 2: Even halting the VM takes a while. The syslog:

Code:

Jan 25 09:09:50 pvenode4 pvedaemon[2634330]: VM 105 qmp command failed - VM 105 qmp command 'guest-ping' failed - got timeout
Jan 25 09:09:52 pvenode4 pve-ha-crm[2516]: got crm command: stop vm:105 0
Jan 25 09:09:52 pvenode4 pve-ha-crm[2516]: request immediate service hard-stop for service 'vm:105'
Jan 25 09:09:52 pvenode4 pve-ha-crm[2516]: service 'vm:105': state changed from 'started' to 'request_stop'  (timeout = 0)
Jan 25 09:10:00 pvenode4 systemd[1]: Starting Proxmox VE replication runner...
Jan 25 09:10:00 pvenode4 systemd[1]: pvesr.service: Succeeded.
Jan 25 09:10:00 pvenode4 systemd[1]: Started Proxmox VE replication runner.
Jan 25 09:10:01 pvenode4 pve-ha-lrm[2401273]: stopping service vm:105 (timeout=0)
Jan 25 09:10:01 pvenode4 pve-ha-lrm[2401277]: stop VM 105: UPID:pvenode4:0024A3FD:06801CEB:600EFB69:qmstop:105:root@pam:
Jan 25 09:10:01 pvenode4 pve-ha-lrm[2401273]: <root@pam> starting task UPID:pvenode4:0024A3FD:06801CEB:600EFB69:qmstop:105:root@pam:
Jan 25 09:10:06 pvenode4 pve-ha-lrm[2401273]: Task 'UPID:pvenode4:0024A3FD:06801CEB:600EFB69:qmstop:105:root@pam:' still active, waiting
Jan 25 09:10:10 pvenode4 pvedaemon[1802413]: VM 105 qmp command failed - VM 105 qmp command 'guest-ping' failed - got timeout
Jan 25 09:10:10 pvenode4 pvedaemon[2634330]: VM 105 qmp command failed - VM 105 qmp command 'query-proxmox-support' failed - unable to connect to VM 105 qmp socket - timeout after 31 retries
Jan 25 09:10:11 pvenode4 pve-ha-lrm[2401273]: Task 'UPID:pvenode4:0024A3FD:06801CEB:600EFB69:qmstop:105:root@pam:' still active, waiting
Jan 25 09:10:11 pvenode4 pvestatd[2482]: VM 105 qmp command failed - VM 105 qmp command 'query-proxmox-support' failed - unable to connect to VM 105 qmp socket - timeout after 31 retries
Jan 25 09:10:12 pvenode4 pvestatd[2482]: status update time (6.376 seconds)
Jan 25 09:10:16 pvenode4 pve-ha-lrm[2401273]: Task 'UPID:pvenode4:0024A3FD:06801CEB:600EFB69:qmstop:105:root@pam:' still active, waiting
Jan 25 09:10:21 pvenode4 pvestatd[2482]: VM 105 qmp command failed - VM 105 qmp command 'query-proxmox-support' failed - unable to connect to VM 105 qmp socket - timeout after 31 retries
Jan 25 09:10:21 pvenode4 pve-ha-lrm[2401273]: Task 'UPID:pvenode4:0024A3FD:06801CEB:600EFB69:qmstop:105:root@pam:' still active, waiting
Jan 25 09:10:21 pvenode4 pvestatd[2482]: status update time (6.393 seconds)
Jan 25 09:10:25 pvenode4 pmxcfs[2011]: [status] notice: received log
Jan 25 09:10:26 pvenode4 pve-ha-lrm[2401273]: Task 'UPID:pvenode4:0024A3FD:06801CEB:600EFB69:qmstop:105:root@pam:' still active, waiting
Jan 25 09:10:31 pvenode4 pve-ha-lrm[2401273]: Task 'UPID:pvenode4:0024A3FD:06801CEB:600EFB69:qmstop:105:root@pam:' still active, waiting
Jan 25 09:10:31 pvenode4 pvestatd[2482]: VM 105 qmp command failed - VM 105 qmp command 'query-proxmox-support' failed - unable to connect to VM 105 qmp socket - timeout after 31 retries
Jan 25 09:10:31 pvenode4 pvestatd[2482]: status update time (6.390 seconds)
Jan 25 09:10:36 pvenode4 pve-ha-lrm[2401273]: Task 'UPID:pvenode4:0024A3FD:06801CEB:600EFB69:qmstop:105:root@pam:' still active, waiting
Jan 25 09:10:41 pvenode4 pve-ha-lrm[2401273]: Task 'UPID:pvenode4:0024A3FD:06801CEB:600EFB69:qmstop:105:root@pam:' still active, waiting
Jan 25 09:10:42 pvenode4 pvestatd[2482]: VM 105 qmp command failed - VM 105 qmp command 'query-proxmox-support' failed - unable to connect to VM 105 qmp socket - timeout after 31 retries
Jan 25 09:10:42 pvenode4 pvestatd[2482]: status update time (6.380 seconds)
Jan 25 09:10:46 pvenode4 pve-ha-lrm[2401273]: Task 'UPID:pvenode4:0024A3FD:06801CEB:600EFB69:qmstop:105:root@pam:' still active, waiting
Jan 25 09:10:51 pvenode4 pvestatd[2482]: VM 105 qmp command failed - VM 105 qmp command 'query-proxmox-support' failed - unable to connect to VM 105 qmp socket - timeout after 31 retries
Jan 25 09:10:51 pvenode4 pve-ha-lrm[2401273]: Task 'UPID:pvenode4:0024A3FD:06801CEB:600EFB69:qmstop:105:root@pam:' still active, waiting
Jan 25 09:10:51 pvenode4 pvestatd[2482]: status update time (6.379 seconds)

need2gcm · Jan 25, 2021

Sorry for the double post. Wanted to have this separate:

Only thing that got this VM working was making sure it was stopped, delete the EFI disk, migrate to a different host, then create a new EFI disk and re-add it to the UEFI boot.

papaf76 · Jan 27, 2021

Sorry for the late reply.
Appreciate the effort. I tried a couple more times but it simply isn't working reliably enough for me, so I just switched to a SeaBIOS setup which works fine.
Again, if I create the EFI disk anyplace else, it works fine.

Search

Search

Can't start VM with OVMF and UEFI disk on Ceph

papaf76

New Member

Alwin

Proxmox Retired Staff

papaf76

New Member

Alwin

Proxmox Retired Staff

papaf76

New Member

need2gcm

Member

papaf76

New Member

need2gcm

Member

need2gcm

Member

papaf76

New Member