Hi guys,
I've recently incorporated a second host into my environment and have been testing the live migration process. Surprisingly, it doesn't seem to work as expected. About 75% of the time, I've had to forcefully reboot the VM after migration due to crashes.
I noticed a few posts about this issue here but they seem outdated and reference a previous major version of the PM kernel.
How are things holding up in 2024?
I'm running "Linux pve2 6.5.13-5-pve #1 SMP PREEMPT_DYNAMIC PMX 6.5.13-5 (2024-04-05T11:03Z) x86_64" on both nodes.
Is it possible that the crashes occur because the hardware between both hosts is too different? Additionally, I've observed the chances of a crash appear to be higher when I move VMs with large drives over a 2.5G network, which tends to take around 20 minutes.
Looking forward to your thoughts!
I've recently incorporated a second host into my environment and have been testing the live migration process. Surprisingly, it doesn't seem to work as expected. About 75% of the time, I've had to forcefully reboot the VM after migration due to crashes.
I noticed a few posts about this issue here but they seem outdated and reference a previous major version of the PM kernel.
How are things holding up in 2024?
I'm running "Linux pve2 6.5.13-5-pve #1 SMP PREEMPT_DYNAMIC PMX 6.5.13-5 (2024-04-05T11:03Z) x86_64" on both nodes.
Markdown (GitHub flavored):
### 1st node
H/W path Device Class Description
===========================================================================
system System Product Name (SKU)
/0 bus Pro WS WRX90E-SAGE SE
/0/0 memory 64KiB BIOS
/0/13 memory 2MiB L1 cache
/0/14 memory 32MiB L2 cache
/0/15 memory 128MiB L3 cache
/0/16 processor AMD Ryzen Threadripper PRO 7975WX 32-Cores
/0/19 memory 192GiB System Memory
/0/19/0 memory 96GiB DIMM Synchronous Registered (Buffered) 5600 MHz (0.2 ns)
/0/19/1 memory [empty]
/0/19/2 memory [empty]
/0/19/3 memory [empty]
/0/19/4 memory 96GiB DIMM Synchronous Registered (Buffered) 5600 MHz (0.2 ns)
/0/19/5 memory [empty]
/0/19/6 memory [empty]
/0/19/7 memory [empty]
/0/100/5.1/0 eno1 network Ethernet Controller X710 for 10GBASE-T
/0/100/5.1/0.1 eno2 network Ethernet Controller X710 for 10GBASE-T
/0/100/7.1/0.4/0 usb3 bus xHCI Host Controller
/0/100/7.1/0.4/1 usb4 bus xHCI Host Controller
/0/100/7.1/0.7/0 input10 input HD-Audio Generic Rear Mic
/0/100/7.1/0.7/1 input11 input HD-Audio Generic Front Mic
/0/100/7.1/0.7/2 input12 input HD-Audio Generic Line Out
/0/100/7.1/0.7/3 input13 input HD-Audio Generic Front Headphone
/0/100/14 bus FCH SMBus Controller
/0/100/14.3 bridge FCH LPC Bridge
### 2st node
H/W path Device Class Description
==============================================================
system MS-7C77 (Default string)
/0 bus MEG Z490I UNIFY (MS-7C77)
/0/0 memory 64KiB BIOS
/0/3a memory 64GiB System Memory
/0/3a/0 memory 32GiB DIMM DDR4 Synchronous 3600 MHz (0.3 ns)
/0/3a/1 memory [empty]
/0/3a/2 memory 32GiB DIMM DDR4 Synchronous 3600 MHz (0.3 ns)
/0/3a/3 memory [empty]
/0/47 memory 640KiB L1 cache
/0/48 memory 2560KiB L2 cache
/0/49 memory 20MiB L3 cache
/0/4a processor Intel(R) Core(TM) i9-10900K CPU @ 3.70GHz
/0/100 bridge Comet Lake-S 6c Host Bridge/DRAM Controller
/0/100/1 bridge 6th-10th Gen Core Processor PCIe Controller (x16)
Is it possible that the crashes occur because the hardware between both hosts is too different? Additionally, I've observed the chances of a crash appear to be higher when I move VMs with large drives over a 2.5G network, which tends to take around 20 minutes.
Looking forward to your thoughts!