Some VM's kvm error after minor upgrade on 7.0

auduny

New Member
Sep 15, 2021
5
0
1
48
Oslo, Norway
audun.ytterdal.net
Hi. Last night I did a dist-upgrade of the my whole home 3 node proxmox cluster. I was allready 7.0 and upgraded a few packages to their latest version
Code:
root@foxtrot:/var/log# egrep " (upgrade) " dpkg.log
2021-09-14 13:35:30 upgrade libpve-storage-perl:all 7.0-10 7.0-11
2021-09-14 13:35:31 upgrade pve-i18n:all 2.4-1 2.5-1
2021-09-14 13:35:31 upgrade pve-qemu-kvm:amd64 6.0.0-3 6.0.0-4

After this an a reboot quite a few (but not all) of VM won't boot properly.

Code:
Sep 15 12:03:13 delta pvedaemon[117654]: <root@pam> starting task UPID:delta:0004C19E:005B283B:6141C4E1:qmstart:106:root@pam:
Sep 15 12:03:13 delta pvedaemon[311710]: start VM 106: UPID:delta:0004C19E:005B283B:6141C4E1:qmstart:106:root@pam:
Sep 15 12:03:13 delta systemd[1]: Started 106.scope.
Sep 15 12:03:14 delta systemd-udevd[311816]: Using default interface naming scheme 'v247'.
Sep 15 12:03:14 delta systemd-udevd[311816]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Sep 15 12:03:14 delta kernel: device tap106i0 entered promiscuous mode
Sep 15 12:03:14 delta kernel: vmbr0: port 7(tap106i0) entered blocking state
Sep 15 12:03:14 delta kernel: vmbr0: port 7(tap106i0) entered disabled state
Sep 15 12:03:14 delta kernel: vmbr0: port 7(tap106i0) entered blocking state
Sep 15 12:03:14 delta kernel: vmbr0: port 7(tap106i0) entered forwarding state
Sep 15 12:03:14 delta pvedaemon[117654]: <root@pam> end task UPID:delta:0004C19E:005B283B:6141C4E1:qmstart:106:root@pam: OK
Sep 15 12:03:17 delta pvedaemon[97590]: VM 106 qmp command failed - VM 106 qmp command 'guest-ping' failed - got timeout
Sep 15 12:03:23 delta QEMU[311772]: KVM internal error. Suberror: 1
Sep 15 12:03:23 delta QEMU[311772]: emulation failure
Sep 15 12:03:23 delta QEMU[311772]: EAX=0000c1c9 EBX=000f000d ECX=00001234 EDX=0000db80
Sep 15 12:03:23 delta QEMU[311772]: ESI=000efd40 EDI=000eb684 EBP=0000fe58 ESP=000eb658
Sep 15 12:03:23 delta QEMU[311772]: EIP=0000ff68 EFL=00000013 [----A-C] CPL=0 II=0 A20=1 SMM=1 HLT=0
Sep 15 12:03:23 delta QEMU[311772]: ES =0000 00000000 ffffffff 00809300
Sep 15 12:03:23 delta QEMU[311772]: CS =a000 000a0000 ffffffff 00809300
Sep 15 12:03:23 delta QEMU[311772]: SS =0000 00000000 ffffffff 00809300
Sep 15 12:03:23 delta QEMU[311772]: DS =0000 00000000 ffffffff 00809300
Sep 15 12:03:23 delta QEMU[311772]: FS =0000 00000000 ffffffff 00809300
Sep 15 12:03:23 delta QEMU[311772]: GS =0000 00000000 ffffffff 00809300
Sep 15 12:03:23 delta QEMU[311772]: LDT=0000 00000000 0000ffff 00008200
Sep 15 12:03:23 delta QEMU[311772]: TR =0000 00000000 0000ffff 00008b00
Sep 15 12:03:23 delta QEMU[311772]: GDT=     00008280 00000027
Sep 15 12:03:23 delta QEMU[311772]: IDT=     00000000 00000000
Sep 15 12:03:23 delta QEMU[311772]: CR0=00000010 CR2=00000000 CR3=00000000 CR4=00000000
Sep 15 12:03:23 delta QEMU[311772]: DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
Sep 15 12:03:23 delta QEMU[311772]: DR6=00000000ffff0ff0 DR7=0000000000000400
Sep 15 12:03:23 delta QEMU[311772]: EFER=0000000000000000
Sep 15 12:03:23 delta QEMU[311772]: Code=kvm: /build/pve-qemu/pve-qemu-kvm-6.0.0/include/hw/core/cpu.h:627: cpu_asidx_from_attrs: Assertion `ret < cpu->num_ases && ret >= 0' failed.
Sep 15 12:03:23 delta kernel: vmbr0: port 7(tap106i0) entered disabled state
Sep 15 12:03:23 delta kernel: vmbr0: port 7(tap106i0) entered disabled state
Sep 15 12:03:23 delta systemd[1]: 106.scope: Succeeded.
Sep 15 12:03:23 delta systemd[1]: 106.scope: Consumed 6.879s CPU time.
Sep 15 12:03:24 delta qmeventd[311892]: Starting cleanup for 106
Sep 15 12:03:24 delta qmeventd[311892]: Finished cleanup for 106

This goes for quite a few VM's, but not all. I have not managed to understand what the similarity might be between them, but it might have to be something about some of them are self-installed, while others are downloaded and converted typical cloudimages.

The only reference I've seen of something similar is this: https://forum.proxmox.com/threads/qemu-kvm-internal-error-smm-removal-bit-rot.94211

Any ideas?
 
So from the changelog
Code:
pve-qemu-kvm (6.0.0-4) bullseye; urgency=medium

  * drop the ancient workaround that force disabled SMM due to observing VM
    hangs on old kernel versions.

  * monitor/qmp: fix race with clients disconnecting early resulting in other
    clients receiving a message with the (now wrong) ID of the former

 -- Proxmox Support Team <support@proxmox.com>  Mon, 06 Sep 2021 07:30:00 +0200
So i guess the smm thing is still a thing. When they say old kernel version. Is that host or vm i wonder?
 
Code:
root@delta:/var/log# qm config 108
agent: 1
boot: c
bootdisk: scsi0
cipassword: **********
ciuser: root
ide2: ceph:vm-108-cloudinit,media=cdrom,size=4M
ipconfig0: ip=dhcp
memory: 2048
name: web
net0: virtio=36:33:E5:EC:7A:6F,bridge=vmbr0
onboot: 1
scsi0: ceph:base-107-disk-0/vm-108-disk-0,size=12G
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=81651d10-908f-4b09-909f-4aab553293ff
sshkeys: ssh-rsa%20LONGSSHPUBLIC_KEYD%3D%20audun%40ytterdal.net
vga: serial0
vmgenid: 7daa2d58-8352-4f04-90e5-8b574cae02a7

Code:
root@delta:/var/log# lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   40 bits physical, 48 bits virtual
CPU(s):                          16
On-line CPU(s) list:             0-15
Thread(s) per core:              2
Core(s) per socket:              4
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           26
Model name:                      Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
Stepping:                        5
CPU MHz:                         2336.441
CPU max MHz:                     2933.0000
CPU min MHz:                     1600.0000
BogoMIPS:                        5867.29
Virtualization:                  VT-x
L1d cache:                       256 KiB
L1i cache:                       256 KiB
L2 cache:                        2 MiB
L3 cache:                        16 MiB
NUMA node0 CPU(s):               0,2,4,6,8,10,12,14
NUMA node1 CPU(s):               1,3,5,7,9,11,13,15
Vulnerability Itlb multihit:     KVM: Mitigation: Split huge pages
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulne
                                 rable
Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerab
                                 le
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitizatio
                                 n
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, STIBP disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat p
                                 se36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp
                                  lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nons
                                 top_tsc cpuid aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse
                                 3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm pti tpr_shadow vn
                                 mi flexpriority ept vpid dtherm ida

And the kernel on the image

Code:
root@web:~# uname -a
Linux web 4.19.0-17-cloud-amd64 #1 SMP Debian 4.19.194-3 (2021-07-18) x86_64 GNU/Linux
root@web:~# cat /etc/issue
Debian GNU/Linux 10 \n \l

And to make things a bit iffi. Some other VM's did not have issues booting...
 
seem to be related (without solution)

https://access.redhat.com/solutions/5749121

"
If you're running a guest on an Intel machine without unrestricted mode
support, the failure can be most likely due to the guest entering an invalid
state for Intel VT. For example, the guest maybe running in big real mode
which is not supported on less recent Intel processors."
 
Cool. Thanks for digging, I found some of the same links while looking into this.

While I wait for the patch going into KVM, what is the different ways to prevent this? Is there anything I can do to disable SSM and/or something on host or VM? (bios/kernel/proxmox/vm-config) - except to stay on 6.0.0-3 :)
 
Last edited:
I'm not sure if it's related, but in the last few days, I"m seeing this during a backup (which had been backing up without issue previously). The VM continues to be usable, just the backup fails.


Code:
INFO: starting new backup job: vzdump 101 --mode snapshot --remove 0 --storage backups --node nuc
INFO: Starting Backup of VM 101 (qemu)
INFO: Backup started at 2021-09-17 15:03:43
INFO: status = running
INFO: VM Name: dev
INFO: include disk 'scsi0' 'local-btrfs:101/vm-101-disk-0.raw' 64G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: snapshots found (not included into backup)
INFO: creating Proxmox Backup Server archive 'vm/101/2021-09-17T14:03:43Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task 'a3a3e976-4111-4045-99a5-3e986f82485a'
INFO: resuming VM again
ERROR: VM 101 qmp command 'cont' failed - got timeout
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 101 failed - VM 101 qmp command 'cont' failed - got timeout
INFO: Failed at 2021-09-17 15:03:52
INFO: Backup job finished with errors
TASK ERROR: job errors

VM details as host cpu info:

Code:
$ sudo qm config 101
agent: 1
boot: order=scsi0;ide2;net0
cores: 8
cpu: host
ide2: local-btrfs:iso/debian-11.0.0-amd64-netinst.iso,media=cdrom
memory: 32768
name: dev
net0: virtio=56:D2:86:D7:09:34,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
parent: most_things_setup
scsi0: local-btrfs:101/vm-101-disk-0.raw,size=64G
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=713eb89e-56c8-4649-b2a2-fffbfba87629
sockets: 1
vmgenid: 78827e43-f010-4569-a2a2-2606d2189135


$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              2
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           140
Model name:                      11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
Stepping:                        1
CPU MHz:                         2800.000
CPU max MHz:                     4700.0000
CPU min MHz:                     400.0000
BogoMIPS:                        5606.40
Virtualization:                  VT-x
L1d cache:                       192 KiB
L1i cache:                       128 KiB
L2 cache:                        5 MiB
L3 cache:                        12 MiB
NUMA node0 CPU(s):               0-7
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss
                                  ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonsto
                                 p_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtp
                                 r pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowp
                                 refetch cpuid_fault epb cat_l2 invpcid_single cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexp
                                 riority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a avx512f avx512dq rdseed ad
                                 x smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves s
                                 plit_lock_detect dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pk
                                 u ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid movdiri movdir64b
                                 fsrm avx512_vp2intersect md_clear flush_l1d arch_capabilities

and the pve journal contains:


Code:
Sep 17 15:03:43 nuc pvedaemon[2704538]: <nick@pam> starting task UPID:nuc:002945C3:061F5B74:6144A03F:vzdump:101:nick@pam:
Sep 17 15:03:43 nuc pvedaemon[2704835]: INFO: starting new backup job: vzdump 101 --mode snapshot --remove 0 --storage backups --node nuc
Sep 17 15:03:43 nuc pvedaemon[2704835]: INFO: Starting Backup of VM 101 (qemu)
Sep 17 15:03:47 nuc pvedaemon[2704835]: VM 101 qmp command failed - VM 101 qmp command 'cont' failed - got timeout
Sep 17 15:03:49 nuc pvedaemon[2704537]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - got timeout
Sep 17 15:03:52 nuc pvedaemon[2704835]: ERROR: Backup of VM 101 failed - VM 101 qmp command 'cont' failed - got timeout
Sep 17 15:03:52 nuc pvedaemon[2704835]: INFO: Backup job finished with errors
Sep 17 15:03:52 nuc pvedaemon[2704835]: job errors
Sep 17 15:03:52 nuc pvedaemon[2704538]: <nick@pam> end task UPID:nuc:002945C3:061F5B74:6144A03F:vzdump:101:nick@pam: job errors
 
Last edited:
I'm not sure if it's related, but in the last few days, I"m seeing this during a backup (which had been backing up without issue previously). The VM continues to be usable, just the backup fails.
That does not seem related to OPs issue which affects guest boot on an Intel CPU released ~12 years ago and an accompanying error like
Sep 15 12:03:23 delta QEMU[311772]: KVM internal error. Suberror: 1
Sep 15 12:03:23 delta QEMU[311772]: emulation failure
in the syslog, but not some QMP timeout during a backup with a modern CPU released barely a year ago.
That could well just be an overload of some host/storage resources, but in any case better discussed in its own thread.
 
  • Like
Reactions: NickB

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!