VM migration caused other VM to crash on the destination node

hradaideh · Aug 10, 2023

A couple of days ago, I migrated a VM from one node to another, the target node went crazy

The migration was "successful" and the migrated VM kept running for a few minutes
BUT
All other VMs crashed included the migrated one, they appeared at running state on the hypervisor, and qemu information was missing
And they all had a CPU stuck in logs

And the hypervisor was throwing this error

Code:

[ 458.138690] e1000 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0xfecce000 flags=0x0000]

The address in the log is for the SATA controller

Code:

# lspci | grep 41
41:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9230 PCIe 2.0 x2 4-port SATA 6 Gb/s RAID Controller (rev 11)

So

VMs still unresponsive, I kept one stuck for investigation
I can't reach it over SSH
The console is stuck, can't provide a username/password
I noticed on Zabbix on the CPU jumps chart a spike in the time of the incident
Some applications faced some time travel problems where files were written in the future (+18 days) on the shared storage
Did not find anything useful on NewRelic other than a network gap (maybe the NewRelic agent crashed with the node?)

Code:

CPU(s)
    256 x AMD EPYC 7763 64-Core Processor (2 Sockets)
Kernel Version
    Linux 5.15.64-1-pve #1 SMP PVE 5.15.64-1 (Thu, 13 Oct 2022 10:30:34 +0200)
PVE Manager Version
    pve-manager/7.2-11/b76d3178

with 2 TB of RAM

These are some interesting metrics from NewRelic

Philipp Hufnagl · Aug 10, 2023

can I see the output of journalctl -b > journal.txt

hradaideh · Aug 11, 2023

Philipp Hufnagl said:
can I see the output of journalctl -b > journal.txt

well, it is huge.

Anything in particular you wanna see? a list of services maybe I can filter in that time period?

Philipp Hufnagl · Aug 11, 2023

I would prefer unfiltered results since it is easy to overlook something and remove important information. However if you want to redact something feel free to do so

hradaideh · Aug 14, 2023

Philipp Hufnagl said:
I would prefer unfiltered results

here you go

fiona · Aug 14, 2023

Hi,
this issue has been fixed a long time ago in kernels 5.15.74-1 and newer: https://git.proxmox.com/?p=pve-kernel.git;a=commit;h=561d6d84996f903ca384d3541508343ab1462d1d

hradaideh · Aug 14, 2023

cool

but I see that it was affecting

originally was taken on an Intel CPU on my AMD-based host

and my images are generated on the same machine type, would it be another issue?

fiona · Aug 16, 2023

hradaideh said:
cool

but I see that it was affecting
originally was taken on an Intel CPU on my AMD-based host

and my images are generated on the same machine type, would it be another issue?

That was just my reproducer for the issue. The issue itself happened between AMD and AMD as well.

hradaideh · Aug 16, 2023

interesting, thank you

Search

Search

VM migration caused other VM to crash on the destination node

hradaideh

New Member

Philipp Hufnagl

Active Member

hradaideh

New Member

Philipp Hufnagl

Active Member

hradaideh

New Member

fiona

Proxmox Staff Member

hradaideh

New Member

fiona

Proxmox Staff Member

hradaideh

New Member

We value your privacy