VM migration caused other VM to crash on the destination node

hradaideh

New Member
Jun 7, 2022
17
1
3
A couple of days ago, I migrated a VM from one node to another, the target node went crazy

The migration was "successful" and the migrated VM kept running for a few minutes
BUT
All other VMs crashed included the migrated one, they appeared at running state on the hypervisor, and qemu information was missing
And they all had a CPU stuck in logs

1691660822201.png

And the hypervisor was throwing this error
Code:
[ 458.138690] e1000 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0xfecce000 flags=0x0000]

The address in the log is for the SATA controller
Code:
# lspci | grep 41
41:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9230 PCIe 2.0 x2 4-port SATA 6 Gb/s RAID Controller (rev 11)

So
  1. VMs still unresponsive, I kept one stuck for investigation
  2. I can't reach it over SSH
  3. The console is stuck, can't provide a username/password
  4. I noticed on Zabbix on the CPU jumps chart a spike in the time of the incident
  5. Some applications faced some time travel problems where files were written in the future (+18 days) on the shared storage
  6. Did not find anything useful on NewRelic other than a network gap (maybe the NewRelic agent crashed with the node?)
Code:
CPU(s)
    256 x AMD EPYC 7763 64-Core Processor (2 Sockets)
Kernel Version
    Linux 5.15.64-1-pve #1 SMP PVE 5.15.64-1 (Thu, 13 Oct 2022 10:30:34 +0200)
PVE Manager Version
    pve-manager/7.2-11/b76d3178
with 2 TB of RAM

These are some interesting metrics from NewRelic
1691661445498.png

1691661525261.png

1691661540981.png
1691661913336.png
 
  • Like
Reactions: umomany
can I see the output of journalctl -b > journal.txt
 
I would prefer unfiltered results since it is easy to overlook something and remove important information. However if you want to redact something feel free to do so
 
cool

but I see that it was affecting
originally was taken on an Intel CPU on my AMD-based host
and my images are generated on the same machine type, would it be another issue?
 
cool

but I see that it was affecting
originally was taken on an Intel CPU on my AMD-based host

and my images are generated on the same machine type, would it be another issue?
That was just my reproducer for the issue. The issue itself happened between AMD and AMD as well.
 
  • Like
Reactions: hradaideh

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!