Guest hangs when halt / shutdown issued (OpenMediaVault 5.x)?

verulian · Jul 29, 2021

Attempting to halt an OpenMediaVault 5 guest running on Proxmox 7, but the guest vm seems to hang after issuing halt directly on the guest.

When on PVE I see, qm list:

Code:

      VMID NAME                 STATUS     MEM(MB)    BOOTDISK(GB) PID       
       111 OpenMediaVault       running    24000             32.00 48090

Clearly it's still being reported as running.

I installed qemu-guest-agent and the behavior didn't change, which was a big surprise to me.

Screenshot taken a minute or two after halt issued and with the shell entirely absent and unresponsive - cannot even get sshd to respond:

Left the system like this for quite a while and there was no change, except oddly after a long time the "CPU usage" went up from around 0.038 to 0.23...

Network traffic and disk I/O went to and stayed at zero.

I did "stop" on it from PVE and it went on down. It also restarted without any problem.

What do you all think about this? This is really problematic when you are shutting down PVE and have to wait a very, very, very long time for the system to end up killing the guest(s) that have this problem.

verulian · Aug 2, 2021

This is a real headache and I've found no solution for this as of yet. I can reboot the system and then just stop it, but that's clearly unnatural and bulky. There must be a better way to make guest vms shut down more correctly, no?

Stefan_R · Aug 2, 2021

Please post your pveversion -v output and any logs from the time around where you shut down your VM (journalctl -e). You can also try qm status --verbose <vmid> when the VM is in the stuck state.

There can be multiple reasons for this, either on PVE side or in the guest. Does this happen to any other guests as well? Is there perhaps a kernel update or similar available in OMV to try?

verulian · Aug 3, 2021

Thank you for looking into this further with me. I will need to wait until some off hours to do the actual shutdown to reproduce the issue, but here is the pveversion -v output in the meantime:

Code:

proxmox-ve: 7.0-2 (running kernel: 5.11.22-2-pve)
pve-manager: 7.0-10 (running version: 7.0-10/d2f465d3)
pve-kernel-5.11: 7.0-5
pve-kernel-helper: 7.0-5
pve-kernel-5.11.22-2-pve: 5.11.22-4
pve-kernel-5.11.22-1-pve: 5.11.22-2
ceph-fuse: 15.2.13-pve1
corosync: 3.1.2-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx2
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.21-pve1
libproxmox-acme-perl: 1.2.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-5
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-2
libpve-storage-perl: 7.0-9
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.7-1
proxmox-backup-file-restore: 2.0.7-1
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-5
pve-cluster: 7.0-3
pve-container: 4.0-8
pve-docs: 7.0-5
pve-edk2-firmware: 3.20200531-1
pve-firewall: 4.2-2
pve-firmware: 3.2-4
pve-ha-manager: 3.3-1
pve-i18n: 2.4-1
pve-qemu-kvm: 6.0.0-2
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-10
smartmontools: 7.2-1
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.5-pve1

There are no updates for OMV to obtain yet and since the hangs have been experienced.

verulian · Aug 3, 2021

Again, thank you for your suggestion. There are no other guests running (or installed at this time on this server).

During the hang I executed journalctl -e, but there was nothing eventful therein. Only the repeating:

Code:

Aug 03 16:42:00 server systemd[1]: Starting Proxmox VE replication runner...
Aug 03 16:42:01 server systemd[1]: pvesr.service: Succeeded.
Aug 03 16:42:01 server systemd[1]: Finished Proxmox VE replication runner.
Aug 03 16:42:01 server systemd[1]: pvesr.service: Consumed 1.053s CPU time.

I also grabbed the verbose status for review after the system was supposedly halted for about a minute:

Code:

root@server:~# qm status 111 --verbose
balloon: 34359738368
balloon_min: 1073741824
ballooninfo:
    actual: 34359738368
    free_mem: 25126653952
    last_update: 1628023342
    major_page_faults: 578011
    max_mem: 34359738368
    mem_swapped_in: 5638205440
    mem_swapped_out: 7291154432
    minor_page_faults: 111254810
    total_mem: 33667584000
blockstat:
    efidisk0:
        account_failed: 1
        account_invalid: 1
        failed_flush_operations: 0
        failed_rd_operations: 0
        failed_unmap_operations: 0
        failed_wr_operations: 0
        flush_operations: 0
        flush_total_time_ns: 0
        invalid_flush_operations: 0
        invalid_rd_operations: 0
        invalid_unmap_operations: 0
        invalid_wr_operations: 0
        rd_bytes: 0
        rd_merged: 0
        rd_operations: 0
        rd_total_time_ns: 0
        timed_stats:
        unmap_bytes: 0
        unmap_merged: 0
        unmap_operations: 0
        unmap_total_time_ns: 0
        wr_bytes: 0
        wr_highest_offset: 25088
        wr_merged: 0
        wr_operations: 0
        wr_total_time_ns: 0
    ide2:
        account_failed: 0
        account_invalid: 0
        failed_flush_operations: 0
        failed_rd_operations: 0
        failed_unmap_operations: 0
        failed_wr_operations: 0
        flush_operations: 0
        flush_total_time_ns: 0
        idle_time_ns: 253467993683924
        invalid_flush_operations: 0
        invalid_rd_operations: 0
        invalid_unmap_operations: 0
        invalid_wr_operations: 0
        rd_bytes: 4092
        rd_merged: 0
        rd_operations: 216
        rd_total_time_ns: 1515234
        timed_stats:
        unmap_bytes: 0
        unmap_merged: 0
        unmap_operations: 0
        unmap_total_time_ns: 0
        wr_bytes: 0
        wr_highest_offset: 0
        wr_merged: 0
        wr_operations: 0
        wr_total_time_ns: 0
    pflash0:
        account_failed: 1
        account_invalid: 1
        failed_flush_operations: 0
        failed_rd_operations: 0
        failed_unmap_operations: 0
        failed_wr_operations: 0
        flush_operations: 0
        flush_total_time_ns: 0
        invalid_flush_operations: 0
        invalid_rd_operations: 0
        invalid_unmap_operations: 0
        invalid_wr_operations: 0
        rd_bytes: 0
        rd_merged: 0
        rd_operations: 0
        rd_total_time_ns: 0
        timed_stats:
        unmap_bytes: 0
        unmap_merged: 0
        unmap_operations: 0
        unmap_total_time_ns: 0
        wr_bytes: 0
        wr_highest_offset: 0
        wr_merged: 0
        wr_operations: 0
        wr_total_time_ns: 0
    virtio0:
        account_failed: 1
        account_invalid: 1
        failed_flush_operations: 0
        failed_rd_operations: 0
        failed_unmap_operations: 0
        failed_wr_operations: 0
        flush_operations: 44949
        flush_total_time_ns: 96579814727
        idle_time_ns: 66344045848
        invalid_flush_operations: 0
        invalid_rd_operations: 0
        invalid_unmap_operations: 0
        invalid_wr_operations: 0
        rd_bytes: 10493559808
        rd_merged: 933
        rd_operations: 753049
        rd_total_time_ns: 163810431094
        timed_stats:
        unmap_bytes: 0
        unmap_merged: 0
        unmap_operations: 0
        unmap_total_time_ns: 0
        wr_bytes: 11541217792
        wr_highest_offset: 14948868096
        wr_merged: 16888
        wr_operations: 212049
        wr_total_time_ns: 99011488701
cpus: 12
disk: 0
diskread: 10493563900
diskwrite: 11541217792
freemem: 25126653952
maxdisk: 34359738368
maxmem: 34359738368
mem: 8540930048
name: server-omv
netin: 6558920433
netout: 14159546576
nics:
    tap111i0:
        netin: 6558920433
        netout: 14159546576
pid: 2075857
proxmox-support:
    pbs-dirty-bitmap: 1
    pbs-dirty-bitmap-migration: 1
    pbs-dirty-bitmap-savevm: 1
    pbs-library-version: 1.2.0 (6e555bc73a7dcfb4d0b47355b958afd101ad27b5)
    pbs-masterkey: 1
    query-bitmap-info: 1
qmpstatus: running
running-machine: pc-q35-6.0+pve0
running-qemu: 6.0.0
shares: 1000
status: running
uptime: 268362
vmid: 111

Stefan_R · Aug 4, 2021

The 'status' output and the fact there's nothing in the logs would indicate that QEMU is operating normally, i.e. the VM is running as expected, it just never received the command to shut down. Could you also post your VM config ('qm config <vmid>')?

To me this almost looks like ACPI is turned off, or maybe there's a problem in how OMV handles it. Does this happen on any other VMs or just this one? Is there a way to boot an older or alternative kernel inside OMV and checking if it happens there too?

verulian · Aug 9, 2021

Code:

$ qm config 111
agent: 1
balloon: 1024
bios: ovmf
boot: order=virtio0;ide2;net0
cores: 6
efidisk0: local-zfs:vm-111-disk-1,size=1M
hostpci0: 0000:04:00.0
hostpci1: 0000:09:00.0
hostpci2: 0000:0c:00.0
ide2: none,media=cdrom
localtime: 0
machine: q35
memory: 32768
name: server-omv
net0: virtio=0A:F8:89:94:EE:33,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
protection: 1
scsihw: virtio-scsi-pci
smbios1: uuid=b7adfbd0-364f-4563-a4b9-7d78390614db
sockets: 2
startup: order=2,up=3,down=60
virtio0: local-zfs:vm-111-disk-0,size=32G
vmgenid: ae90620f-1c40-4a3e-b529-7e0f1fa9e769

And just for reference, on OMV doing dmesg | grep -i acpi shows:

Code:

[    0.000000] BIOS-e820: [mem 0x0000000000800000-0x0000000000807fff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x0000000000810000-0x00000000008fffff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x000000007fb6f000-0x000000007fb7efff] ACPI data
[    0.000000] BIOS-e820: [mem 0x000000007fb7f000-0x000000007fbfefff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x000000007ff20000-0x000000007fffffff] ACPI NVS
[    0.000000] efi: SMBIOS=0x7f9ac000 ACPI=0x7fb7e000 ACPI 2.0=0x7fb7e014 MEMATTR=0x7ec25018
[    0.017123] ACPI: Early table checksum verification disabled
[    0.017141] ACPI: RSDP 0x000000007FB7E014 000024 (v02 BOCHS )
[    0.017148] ACPI: XSDT 0x000000007FB7D0E8 00005C (v01 BOCHS  BXPC     00000001      01000013)
[    0.017164] ACPI: FACP 0x000000007FB79000 0000F4 (v03 BOCHS  BXPC     00000001 BXPC 00000001)
[    0.017173] ACPI: DSDT 0x000000007FB7A000 002251 (v01 BOCHS  BXPC     00000001 BXPC 00000001)
[    0.017180] ACPI: FACS 0x000000007FBDC000 000040
[    0.017187] ACPI: APIC 0x000000007FB78000 0000D0 (v01 BOCHS  BXPC     00000001 BXPC 00000001)
[    0.017194] ACPI: SSDT 0x000000007FB77000 0000CA (v01 BOCHS  VMGENID  00000001 BXPC 00000001)
[    0.017201] ACPI: HPET 0x000000007FB76000 000038 (v01 BOCHS  BXPC     00000001 BXPC 00000001)
[    0.017208] ACPI: MCFG 0x000000007FB75000 00003C (v01 BOCHS  BXPC     00000001 BXPC 00000001)
[    0.017214] ACPI: WAET 0x000000007FB74000 000028 (v01 BOCHS  BXPC     00000001 BXPC 00000001)
[    0.017221] ACPI: BGRT 0x000000007FB73000 000038 (v01 INTEL  EDK2     00000002      01000013)
[    0.017225] ACPI: Reserving FACP table memory at [mem 0x7fb79000-0x7fb790f3]
[    0.017226] ACPI: Reserving DSDT table memory at [mem 0x7fb7a000-0x7fb7c250]
[    0.017227] ACPI: Reserving FACS table memory at [mem 0x7fbdc000-0x7fbdc03f]
[    0.017228] ACPI: Reserving APIC table memory at [mem 0x7fb78000-0x7fb780cf]
[    0.017229] ACPI: Reserving SSDT table memory at [mem 0x7fb77000-0x7fb770c9]
[    0.017229] ACPI: Reserving HPET table memory at [mem 0x7fb76000-0x7fb76037]
[    0.017230] ACPI: Reserving MCFG table memory at [mem 0x7fb75000-0x7fb7503b]
[    0.017231] ACPI: Reserving WAET table memory at [mem 0x7fb74000-0x7fb74027]
[    0.017231] ACPI: Reserving BGRT table memory at [mem 0x7fb73000-0x7fb73037]
[    0.017260] ACPI: Local APIC address 0xfee00000
[    0.027247] ACPI: PM-Timer IO Port: 0x608
[    0.027251] ACPI: Local APIC address 0xfee00000
[    0.027275] ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1])
[    0.027338] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[    0.027340] ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level)
[    0.027341] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[    0.027345] ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level)
[    0.027346] ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level)
[    0.027348] ACPI: IRQ0 used by override.
[    0.027349] ACPI: IRQ5 used by override.
[    0.027350] ACPI: IRQ9 used by override.
[    0.027351] ACPI: IRQ10 used by override.
[    0.027351] ACPI: IRQ11 used by override.
[    0.027354] Using ACPI (MADT) for SMP configuration information
[    0.027356] ACPI: HPET id: 0x8086a201 base: 0xfed00000
[    0.084941] ACPI: Core revision 20200925
[    0.432362] PM: Registering ACPI NVS region [mem 0x00800000-0x00807fff] (32768 bytes)
[    0.432362] PM: Registering ACPI NVS region [mem 0x00810000-0x008fffff] (983040 bytes)
[    0.432362] PM: Registering ACPI NVS region [mem 0x7fb7f000-0x7fbfefff] (524288 bytes)
[    0.432362] PM: Registering ACPI NVS region [mem 0x7ff20000-0x7fffffff] (917504 bytes)
[    0.435947] ACPI: bus type PCI registered
[    0.435950] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
[    0.775735] ACPI: Added _OSI(Module Device)
[    0.775738] ACPI: Added _OSI(Processor Device)
[    0.775739] ACPI: Added _OSI(3.0 _SCP Extensions)
[    0.775741] ACPI: Added _OSI(Processor Aggregator Device)
[    0.775743] ACPI: Added _OSI(Linux-Dell-Video)
[    0.775745] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio)
[    0.775746] ACPI: Added _OSI(Linux-HPI-Hybrid-Graphics)
[    0.777809] ACPI: 2 ACPI AML tables successfully acquired and loaded
[    0.779250] ACPI: Interpreter enabled
[    0.779274] ACPI: (supports S0 S3 S4 S5)
[    0.779276] ACPI: Using IOAPIC for interrupt routing
[    0.779317] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
[    0.779561] ACPI: Enabled 2 GPEs in block 00 to 3F
[    0.784616] ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff])
[    0.784624] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
[    0.784757] acpi PNP0A08:00: _OSC: platform does not support [LTR]
[    0.784874] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug SHPCHotplug PME AER PCIeCapability]
[    0.948901] pci 0000:00:1f.0: quirk: [io  0x0600-0x067f] claimed by ICH6 ACPI/GPIO/TCO
[    1.156366] ACPI: PCI Interrupt Link [LNKA] (IRQs 5 *10 11)
[    1.156555] ACPI: PCI Interrupt Link [LNKB] (IRQs 5 *10 11)
[    1.156734] ACPI: PCI Interrupt Link [LNKC] (IRQs 5 10 *11)
[    1.156914] ACPI: PCI Interrupt Link [LNKD] (IRQs 5 10 *11)
[    1.157093] ACPI: PCI Interrupt Link [LNKE] (IRQs 5 *10 11)
[    1.175782] ACPI: PCI Interrupt Link [LNKF] (IRQs 5 *10 11)
[    1.191764] ACPI: PCI Interrupt Link [LNKG] (IRQs 5 10 *11)
[    1.208334] ACPI: PCI Interrupt Link [LNKH] (IRQs 5 10 *11)
[    1.208423] ACPI: PCI Interrupt Link [GSIA] (IRQs *16)
[    1.208444] ACPI: PCI Interrupt Link [GSIB] (IRQs *17)
[    1.208463] ACPI: PCI Interrupt Link [GSIC] (IRQs *18)
[    1.208482] ACPI: PCI Interrupt Link [GSID] (IRQs *19)
[    1.208500] ACPI: PCI Interrupt Link [GSIE] (IRQs *20)
[    1.208518] ACPI: PCI Interrupt Link [GSIF] (IRQs *21)
[    1.208537] ACPI: PCI Interrupt Link [GSIG] (IRQs *22)
[    1.208555] ACPI: PCI Interrupt Link [GSIH] (IRQs *23)
[    1.212192] PCI: Using ACPI for IRQ routing
[    1.372329] pnp: PnP ACPI init
[    1.372456] pnp 00:00: Plug and Play ACPI device, IDs PNP0303 (active)
[    1.372501] pnp 00:01: Plug and Play ACPI device, IDs PNP0f13 (active)
[    1.372537] pnp 00:02: Plug and Play ACPI device, IDs PNP0b00 (active)
[    1.372668] system 00:03: Plug and Play ACPI device, IDs PNP0c01 (active)
[    1.373194] pnp: PnP ACPI: found 4 devices
[    1.380661] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    3.066308] ACPI: bus type USB registered
[   20.445560] ACPI: Power Button [PWRF]

I don't have other VMs on this particular server to test at the moment and I can't do much with the kernel right now on OMV.

verulian · Aug 9, 2021

I ended up having 20 minutes free so I set up another guest as a test to see if I could install it alright. During the process (before install) I decided to stop the system and change it from UEFI to BIOS. Oddly I found I couldn't kill the OS or get it to halt from the UEFI loader even. It was very peculiar.

After many attempts of halting the OS (I could issue a restart from UEFI) and failing, I followed some steps here:
https://bobcares.com/blog/proxmox-cant-stop-vm/

qm stop 112 couldn't acquire a lock (got timeout).

qm unlock 112 couldn't unlock (got timeout).

Got the PID (ps aux | grep "/usr/bin/kvm -id 112") and kill -9'd it. It did appear to try to start back up again, but then I was able to unlock it and kill it with the above commands (qm unlock 112; qm stop 112).

Just thought I'd share this peculiar experience.

Stefan_R · Aug 9, 2021

verulian said:
During the process (before install) I decided to stop the system and change it from UEFI to BIOS.

How did you attempt this? Did you request a "stop" or a "shutdown" from the UI? In the latter case, it's clear that it can't work, as the "shutdown" command is equivalent to just pressing the power button, to which SeaBIOS/OVMF simply don't have a handler installed. As this shutdown task then proceeds to wait for the process to exit, it keeps holding the lock, which would explain the "got timeout" errors you experienced. The correct solution in that scenario would be to "Stop" the shutdown task from the task list at the bottom of the GUI, then issue a "stop" to the VM (e.g. 'qm stop').

verulian · Aug 9, 2021

In this latter situation of just getting into the BIOS (UEFI specifically), which must not be related to the originally reported (and still remaining issue) "Stop" was used and failed. A direct command into the UEFI was sought, but I could only find a restart option, which didn't halt the guest.

Can you clarify what you mean by bottom of the GUI? I only see the top of the GUI having the option when I click the down arrow beside Shutdown:

Perhaps I should make a new forum post to address this or we should split this off so I stop conflating the original issue with this new one?

verulian · Aug 10, 2021

Could the OpenMediaVault halt-hang issue be related to the PCI Passthrough HBA PCI card?

Stefan_R · Aug 10, 2021

verulian said:
In this latter situation of just getting into the BIOS (UEFI specifically), which must not be related to the originally reported (and still remaining issue) "Stop" was used and failed.

Hm, a "Stop" should never fail this way, even if it would, it should end up doing a SIGKILL after a bit, which worked once you did it manually... There might be something wrong in general with your setup then. If "Stop" doesn't work, "Shutdown" is almost guaranteed to fail too.

verulian said:
Can you clarify what you mean by bottom of the GUI? I only see the top of the GUI having the option when I click the down arrow beside Shutdown:

I meant the task viewer at the bottom. You can double click the running "Stop" task and "Stop" that one in the popup window. That should release any locks automatically.

verulian said:
Could the OpenMediaVault halt-hang issue be related to the PCI Passthrough HBA PCI card?

Certainly, PCI passthrough is always a tricky beast. I don't suppose you could try the same VM without the passthrough just to see if it fixes it?

verulian · Jun 9, 2022

Just an update on this older issue. On Proxmox 7.2 I'm still noticing that some situations like with a very minimal Ubuntu 20.04 will still hang on halt indefinitely and I have to issue a STOP to speed up reboot a little. Kind of curious why a halt process that seems to succeed should hang as it does for a minute or more.

Search

Search

Guest hangs when halt / shutdown issued (OpenMediaVault 5.x)?

verulian

Well-Known Member

verulian

Well-Known Member

Stefan_R

Proxmox Retired Staff

verulian

Well-Known Member

verulian

Well-Known Member

Stefan_R

Proxmox Retired Staff

verulian

Well-Known Member

verulian

Well-Known Member

Stefan_R

Proxmox Retired Staff

verulian

Well-Known Member

verulian

Well-Known Member

Stefan_R

Proxmox Retired Staff

verulian

Well-Known Member