A couple of days ago, I migrated a VM from one node to another, the target node went crazy
The migration was "successful" and the migrated VM kept running for a few minutes
BUT
All other VMs crashed included the migrated one, they appeared at running state on the hypervisor, and qemu information was missing
And they all had a CPU stuck in logs
And the hypervisor was throwing this error
The address in the log is for the SATA controller
So
with 2 TB of RAM
These are some interesting metrics from NewRelic
The migration was "successful" and the migrated VM kept running for a few minutes
BUT
All other VMs crashed included the migrated one, they appeared at running state on the hypervisor, and qemu information was missing
And they all had a CPU stuck in logs
And the hypervisor was throwing this error
Code:
[ 458.138690] e1000 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0xfecce000 flags=0x0000]
The address in the log is for the SATA controller
Code:
# lspci | grep 41
41:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9230 PCIe 2.0 x2 4-port SATA 6 Gb/s RAID Controller (rev 11)
So
- VMs still unresponsive, I kept one stuck for investigation
- I can't reach it over SSH
- The console is stuck, can't provide a username/password
- I noticed on Zabbix on the CPU jumps chart a spike in the time of the incident
- Some applications faced some time travel problems where files were written in the future (+18 days) on the shared storage
- Did not find anything useful on NewRelic other than a network gap (maybe the NewRelic agent crashed with the node?)
Code:
CPU(s)
256 x AMD EPYC 7763 64-Core Processor (2 Sockets)
Kernel Version
Linux 5.15.64-1-pve #1 SMP PVE 5.15.64-1 (Thu, 13 Oct 2022 10:30:34 +0200)
PVE Manager Version
pve-manager/7.2-11/b76d3178
These are some interesting metrics from NewRelic