[SOLVED] Proxmox 7.0-14+1 crashes VM during migrate to other host

hvillemoes · Nov 14, 2021

Hi

I get this kernel error during migrate to another host. This stops the VM and aborts the migrate.
I have migrated the very same VM to this host a few days ago.

Any idea will be appreciated.

syslog:
Nov 14 10:30:21 proxmox3 kernel: [94205.987480] kvm[2288]: segfault at 68 ip 000055f4a3123991 sp 00007f5a200c0eb0 error 6 in qemu-system-x86_64[55f4a2cd8000+545000]
Nov 14 10:30:21 proxmox3 kernel: [94205.987500] Code: 49 89 c1 48 8b 47 38 4c 01 c0 48 01 f0 48 f7 f1 48 39 fd 74 d4 4d 39 e1 77 cf 48 83 e8 01 49 39 c6 77 c6 48 83 7f 68 00 75 bf <48> 89 7d 68 31 f6 48
83 c7 50 e8 a0 0d 0f 00 48 c7 45 68 00 00 00
Nov 14 10:30:21 proxmox3 kernel: [94206.226401] fwbr10013i0: port 2(tap10013i0) entered disabled state
Nov 14 10:30:21 proxmox3 kernel: [94206.226893] fwbr10013i0: port 2(tap10013i0) entered disabled state
Nov 14 10:30:21 proxmox3 systemd[1]: 10013.scope: Succeeded.
Nov 14 10:30:21 proxmox3 systemd[1]: 10013.scope: Consumed 13h 2min 53.425s CPU time.
Nov 14 10:30:21 proxmox3 pvedaemon[1037823]: VM 10013 qmp command failed - VM 10013 not running
Nov 14 10:30:21 proxmox3 pvedaemon[1037823]: VM 10013 qmp command failed - VM 10013 not running
Nov 14 10:30:21 proxmox3 pvedaemon[1037823]: VM 10013 qmp command failed - VM 10013 not running
Nov 14 10:30:21 proxmox3 pvedaemon[1037823]: VM 10013 qmp command failed - VM 10013 not running
Nov 14 10:30:21 proxmox3 pvedaemon[1037823]: VM 10013 qmp command failed - VM 10013 not running
Nov 14 10:30:21 proxmox3 pvedaemon[1037823]: VM 10013 qmp command failed - VM 10013 not running
Nov 14 10:30:21 proxmox3 pvedaemon[1037823]: VM 10013 qmp command failed - VM 10013 not running
Nov 14 10:30:21 proxmox3 pvedaemon[1037823]: VM 10013 qmp command failed - VM 10013 not running

/ Harald

hvillemoes · Nov 14, 2021

hvillemoes said:
Hi

I get this kernel error during migrate to another host. This stops the VM and aborts the migrate.
I have migrated the very same VM to this host a few days ago.

Any idea will be appreciated.

syslog:
Nov 14 10:30:21 proxmox3 kernel: [94205.987480] kvm[2288]: segfault at 68 ip 000055f4a3123991 sp 00007f5a200c0eb0 error 6 in qemu-system-x86_64[55f4a2cd8000+545000]
Nov 14 10:30:21 proxmox3 kernel: [94205.987500] Code: 49 89 c1 48 8b 47 38 4c 01 c0 48 01 f0 48 f7 f1 48 39 fd 74 d4 4d 39 e1 77 cf 48 83 e8 01 49 39 c6 77 c6 48 83 7f 68 00 75 bf <48> 89 7d 68 31 f6 48
83 c7 50 e8 a0 0d 0f 00 48 c7 45 68 00 00 00
Nov 14 10:30:21 proxmox3 kernel: [94206.226401] fwbr10013i0: port 2(tap10013i0) entered disabled state
Nov 14 10:30:21 proxmox3 kernel: [94206.226893] fwbr10013i0: port 2(tap10013i0) entered disabled state
Nov 14 10:30:21 proxmox3 systemd[1]: 10013.scope: Succeeded.
Nov 14 10:30:21 proxmox3 systemd[1]: 10013.scope: Consumed 13h 2min 53.425s CPU time.
Nov 14 10:30:21 proxmox3 pvedaemon[1037823]: VM 10013 qmp command failed - VM 10013 not running
Nov 14 10:30:21 proxmox3 pvedaemon[1037823]: VM 10013 qmp command failed - VM 10013 not running
Nov 14 10:30:21 proxmox3 pvedaemon[1037823]: VM 10013 qmp command failed - VM 10013 not running
Nov 14 10:30:21 proxmox3 pvedaemon[1037823]: VM 10013 qmp command failed - VM 10013 not running
Nov 14 10:30:21 proxmox3 pvedaemon[1037823]: VM 10013 qmp command failed - VM 10013 not running
Nov 14 10:30:21 proxmox3 pvedaemon[1037823]: VM 10013 qmp command failed - VM 10013 not running
Nov 14 10:30:21 proxmox3 pvedaemon[1037823]: VM 10013 qmp command failed - VM 10013 not running
Nov 14 10:30:21 proxmox3 pvedaemon[1037823]: VM 10013 qmp command failed - VM 10013 not running

/ Harald

I tried to use the command line migrate with reduced transfer rate, but with same result:
qm migrate 10013 proxmox2 --bwlimit 50000 --migration_type insecure --targetstorage data2 --online yes --with-local-disks yes
...
drive-sata0: transferred 31.5 GiB of 100.0 GiB (31.52%) in 11m 4s
drive-sata3: Cancelling block job
drive-sata0: Cancelling block job
drive-sata1: Cancelling block job
drive-sata2: Cancelling block job
2021-11-14 16:24:12 ERROR: online migrate failure - block job (mirror) error: VM 10013 not running
2021-11-14 16:24:12 aborting phase 2 - cleanup resources
2021-11-14 16:24:12 migrate_cancel
2021-11-14 16:24:12 migrate_cancel error: VM 10013 not running
drive-sata3: Cancelling block job
drive-sata0: Cancelling block job
drive-sata1: Cancelling block job
drive-sata2: Cancelling block job
2021-11-14 16:24:12 ERROR: VM 10013 not running
2021-11-14 16:24:18 ERROR: migration finished with problems (duration 03:59:05)
migration problems

fabian · Nov 15, 2021

could you please post the following:

pveversion -v
config of VM
storage.cfg

thanks!

hvillemoes · Nov 15, 2021

hvillemoes said:
I tried to use the command line migrate with reduced transfer rate, but with same result:
qm migrate 10013 proxmox2 --bwlimit 50000 --migration_type insecure --targetstorage data2 --online yes --with-local-disks yes
...
drive-sata0: transferred 31.5 GiB of 100.0 GiB (31.52%) in 11m 4s
drive-sata3: Cancelling block job
drive-sata0: Cancelling block job
drive-sata1: Cancelling block job
drive-sata2: Cancelling block job
2021-11-14 16:24:12 ERROR: online migrate failure - block job (mirror) error: VM 10013 not running
2021-11-14 16:24:12 aborting phase 2 - cleanup resources
2021-11-14 16:24:12 migrate_cancel
2021-11-14 16:24:12 migrate_cancel error: VM 10013 not running
drive-sata3: Cancelling block job
drive-sata0: Cancelling block job
drive-sata1: Cancelling block job
drive-sata2: Cancelling block job
2021-11-14 16:24:12 ERROR: VM 10013 not running
2021-11-14 16:24:18 ERROR: migration finished with problems (duration 03:59:05)
migration problems

root@proxmox3:~# pveversion -v
proxmox-ve: 7.0-2 (running kernel: 5.11.22-7-pve)
pve-manager: 7.0-14+1 (running version: 7.0-14+1/08975a4c)
pve-kernel-helper: 7.1-4
pve-kernel-5.11: 7.0-10
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-5-pve: 5.11.22-10
pve-kernel-5.11.22-4-pve: 5.11.22-9
pve-kernel-5.11.22-3-pve: 5.11.22-7
pve-kernel-5.11.22-1-pve: 5.11.22-2
ceph-fuse: 15.2.13-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-12
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-3
libpve-storage-perl: 7.0-13
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.13-1
proxmox-backup-file-restore: 2.0.13-1
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.1-1
pve-docs: 7.0-5
pve-edk2-firmware: 3.20210831-1
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.5-1
pve-qemu-kvm: 6.1.0-1
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-18
smartmontools: 7.2-1
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3

root@proxmox3:/etc/pve# cat storage.cfg
dir: local
path /var/lib/vz
content iso,backup,vztmpl

lvmthin: local-lvm
thinpool data
vgname pve
content images,rootdir

nfs: QnapProxmox
export /proxmox
path /mnt/pve/QnapProxmox
server 193.162.159.88
content rootdir,images,iso
prune-backups keep-all=1

pbs: pbs
datastore pbs-qnap
server proxmox2.agurk.dk
content backup
fingerprint 14:04:e5:f3:b6:55:ca:06:5b:35:8b:e9:74:42:8f:a4:b8:ec:d4:ff:91:80:34:39:7d:bf:ea:49:54:3d:4c:07
prune-backups keep-all=1
username root@pam

nfs: hp1nas
export /mnt/data1
path /mnt/pve/hp1nas
server hp1nas.agurk.dk
content vztmpl,iso,snippets,rootdir,backup,images
prune-backups keep-all=1

dir: data2
path /mnt/pve/data2
content vztmpl,snippets,iso,rootdir,images,backup
is_mountpoint 1
nodes proxmox2

dir: data1
path /mnt/pve/data1
content iso,snippets,vztmpl,backup,images,rootdir
is_mountpoint 1
nodes proxmox1

dir: data3
path /mnt/pve/data3
content iso,snippets,vztmpl,backup,images,rootdir
is_mountpoint 1
nodes proxmox3

root@proxmox3:/etc/pve#

root@proxmox3:/etc/pve/nodes# cat proxmox3/qemu-server/10013.conf
agent: 1
boot: order=sata0
cores: 8
memory: 32924
name: agurk8-webhotel
net0: e1000=76:23:91:7F:A5:94,bridge=vmbr0,firewall=1
onboot: 1
ostype: l26
sata0: local-lvm:vm-10013-disk-0,format=raw,size=100G
sata1: local-lvm:vm-10013-disk-1,format=raw,size=100G
sata2: local-lvm:vm-10013-disk-2,format=raw,size=300G
sata3: local-lvm:vm-10013-disk-3,format=raw,size=250G
sata4: local-lvm:vm-10013-disk-4,format=raw,size=500G
smbios1: uuid=8a1498e8-bd63-4cc3-a80b-d06d80577bcf
startup: up=5,down=30
vmgenid: ed69a0a9-51c1-413b-afb2-e49484cacd90
root@proxmox3:/etc/pve/nodes#

fabian · Nov 15, 2021

could you try adding ',aio=native' to the sata disks in the VM config, then power-cycle the VM (full shutdown, than cold start) and report back whether the issue persists?

hvillemoes · Nov 15, 2021

fabian said:
could you try adding ',aio=native' to the sata disks in the VM config, then power-cycle the VM (full shutdown, than cold start) and report back whether the issue persists?

Same result, the problem persists.
VM config:
root@proxmox3:/etc/pve/nodes/proxmox3/qemu-server# cat 10013.conf
agent: 1
boot: order=sata0
cores: 8
memory: 32924
name: agurk8-webhotel
net0: e1000=76:23:91:7F:A5:94,bridge=vmbr0,firewall=1
onboot: 1
ostype: l26
sata0: local-lvm:vm-10013-disk-0,format=raw,size=100G,aio=native
sata1: local-lvm:vm-10013-disk-1,format=raw,size=100G,aio=native
sata2: local-lvm:vm-10013-disk-2,format=raw,size=300G,aio=native
sata3: local-lvm:vm-10013-disk-3,format=raw,size=250G,aio=native
sata4: local-lvm:vm-10013-disk-4,format=raw,size=500G,aio=native
smbios1: uuid=8a1498e8-bd63-4cc3-a80b-d06d80577bcf
startup: up=5,down=30
vmgenid: ed69a0a9-51c1-413b-afb2-e49484cacd90
root@proxmox3:/etc/pve/nodes/proxmox3/qemu-server#

fabian · Nov 15, 2021

okay, then the next step would be to install qemu dbg symbols, attach a gdb instance to the running source VM, and then collect some backtraces to see where the segfault is happening..

e.g., the following should work:

Code:

apt install gdb pve-qemu-kvm-dbg   # do this beforehand

# start your VM if not already running

VMID=100 # change this!
VM_PID=$(cat /var/run/qemu-server/${VMID}.pid)

gdb -p $VM_PID -ex='handle SIGUSR1 nostop noprint pass' -ex='handle SIGPIPE nostop print pass' -ex='set logging on' -ex='set pagination off' -ex='cont'

leave that running (for example, inside a tmux or screen session) and attempt a migration. once the migration task output indicates you have triggered the crash, enter thread apply all bt in the gdb terminal (this should print lots of info), followed by quit to exit gdb (allowing the kvm process to finish crashing, so you can restart the VM again to resume operations). a file called gdb.txt should be generated in the directory where you started gdb, it should contain more pointers w.r.t. what's going on - please post it here as attachment!

edit: removed erroneous 'attach' and replaced with '-p', results should be the same but without a warning about the nonexistence of a file called 'attach'

fabian · Nov 16, 2021

found the issue, it's already fixed upstream, and I just sent a patch for including the fix in our next qemu package release (>= 6.1.0-2). once that hits the repos and you have upgraded, you'll need to cold-restart (poweroff, than start again) your VMs so that they run the updated, fixed binary.

migration should work again then. note that this only affects live-migration with local disks, so any VMs that don't use local disks should be unaffected (and consequently, don't need to be restarted either).

hvillemoes · Nov 16, 2021

fabian said:
okay, then the next step would be to install qemu dbg symbols, attach a gdb instance to the running source VM, and then collect some backtraces to see where the segfault is happening..

e.g., the following should work:

Code:

apt install gdb pve-qemu-kvm-dbg # do this beforehand # start your VM if not already running VMID=100 # change this! VM_PID=$(cat /var/run/qemu-server/${VMID}.pid) gdb -p $VM_PID -ex='handle SIGUSR1 nostop noprint pass' -ex='handle SIGPIPE nostop print pass' -ex='set logging on' -ex='set pagination off' -ex='cont'

leave that running (for example, inside a tmux or screen session) and attempt a migration. once the migration task output indicates you have triggered the crash, enter thread apply all bt in the gdb terminal (this should print lots of info), followed by quit to exit gdb (allowing the kvm process to finish crashing, so you can restart the VM again to resume operations). a file called gdb.txt should be generated in the directory where you started gdb, it should contain more pointers w.r.t. what's going on - please post it here as attachment!

edit: removed erroneous 'attach' and replaced with '-p', results should be the same but without a warning about the nonexistence of a file called 'attach'

Thank you for your advice.
I tried to "migrate" by restoring from my proxmox backup server. When I powered up the restored server, I got error messages from the /home xfs file system. I had similar problems, when I "migrated" the server from ESXi to proxmox.
My conclusion is, that I must either somehow repair these errors or move the apps (VirtualMin) to a new server. I could at the same time switch from COS6 to Debian 11.

fabian said:
found the issue, it's already fixed upstream, and I just sent a patch for including the fix in our next qemu package release (>= 6.1.0-2). once that hits the repos and you have upgraded, you'll need to cold-restart (poweroff, than start again) your VMs so that they run the updated, fixed binary.

migration should work again then. note that this only affects live-migration with local disks, so any VMs that don't use local disks should be unaffected (and consequently, don't need to be restarted either).

Great - thank you

/ Harald

hvillemoes · Nov 18, 2021

hvillemoes said:
Thank you for your advice.
I tried to "migrate" by restoring from my proxmox backup server. When I powered up the restored server, I got error messages from the /home xfs file system. I had similar problems, when I "migrated" the server from ESXi to proxmox.
My conclusion is, that I must either somehow repair these errors or move the apps (VirtualMin) to a new server. I could at the same time switch from COS6 to Debian 11.

Great - thank you

/ Harald

I can add, that pve-qemu-kvm: 6.1.0-2 worked flawless.

Again thank you for the fast response and fix.

/ Harald

Search

Search

[SOLVED] Proxmox 7.0-14+1 crashes VM during migrate to other host

hvillemoes

Member

Attachments

hvillemoes

Member

fabian

Proxmox Staff Member

hvillemoes

Member

fabian

Proxmox Staff Member

hvillemoes

Member

fabian

Proxmox Staff Member

fabian

Proxmox Staff Member

hvillemoes

Member

hvillemoes

Member