I'm facing an issue with a cluster with 2 Proxmox VE 7.3 machines. Both are very different machines and presents the same issue.
The issue is: the host freezes during a heavy load.
I found a few threads about it, but nothing solved my issue.
Basically, when I try to migrate a VM/CT from a node to another, the node that receives the VM/CT freezes and do a lot of timeouts. It also happens when one VM/CT does a heavy task (that uses a lot of CPU or I/O).
For example, here we have a migration with a VM turned OFF that is happening right now. The first node is working well (the node that is sending the VM), but the second node, the node receiving the VM freezes. It also freezes all VMs and CTs running on this node, all became unresponsible.
Screenshots of the Summary page on these different nodes:
It started since I installed PVE 7.2 on these machines (this was my first time using Proxmox).
I tried a lot of things:
- Updating Kernel
- Changing scheduler governor to a more efficient
- Updating Intel's microcode
- Limiting ZFS memory cache size (in percentage)
The only thing that helped me a bit was limiting ZFS cache. It reduced a lot the amount of freezes, but still happens during migrations and some heavier workloads. So I think it can be something related with I/O.
The common things between these 2 servers:
- Both use Intel CPU
- Both use consumer Sata SSDs with ZFS for OS and main storage pool
- Both run Proxmox VE 7.3 with Kernel 6.1 (I updated from 5.15 trying to fix this issue)
Main differences:
- CPU: Core i7 8700T (Optiplex 3060 Micro) vs Xeon E5 2650 v4 (on a Lenovo server MB)
- Storage: one uses a single Samsung 840 EVO 250GB and the other uses a dual Kingston A400 250GB in RAIDZ-1
- RAM: one uses 32GB SODIMM DDR4 (2666), other uses 64GB DIMM DDR4 with ECC
I installed both OS on the same day. I don't think it's an hardware issue, as the Optiplex machine was working very well with Ubuntu Server previously.
It's not a network issue, as I tried with different network ports on the second machine.
I tried all possible solutions I found on internet and nothing solved my issue.
As I'm not so familiar with PVE, maybe I did something wrong during the PVE installation (it was my first time doing it, but I tried to follow a few tutorials)
By the way, this attempt of migration failed with this error message:
The issue is: the host freezes during a heavy load.
I found a few threads about it, but nothing solved my issue.
Basically, when I try to migrate a VM/CT from a node to another, the node that receives the VM/CT freezes and do a lot of timeouts. It also happens when one VM/CT does a heavy task (that uses a lot of CPU or I/O).
For example, here we have a migration with a VM turned OFF that is happening right now. The first node is working well (the node that is sending the VM), but the second node, the node receiving the VM freezes. It also freezes all VMs and CTs running on this node, all became unresponsible.
Screenshots of the Summary page on these different nodes:
It started since I installed PVE 7.2 on these machines (this was my first time using Proxmox).
I tried a lot of things:
- Updating Kernel
- Changing scheduler governor to a more efficient
- Updating Intel's microcode
- Limiting ZFS memory cache size (in percentage)
The only thing that helped me a bit was limiting ZFS cache. It reduced a lot the amount of freezes, but still happens during migrations and some heavier workloads. So I think it can be something related with I/O.
The common things between these 2 servers:
- Both use Intel CPU
- Both use consumer Sata SSDs with ZFS for OS and main storage pool
- Both run Proxmox VE 7.3 with Kernel 6.1 (I updated from 5.15 trying to fix this issue)
Main differences:
- CPU: Core i7 8700T (Optiplex 3060 Micro) vs Xeon E5 2650 v4 (on a Lenovo server MB)
- Storage: one uses a single Samsung 840 EVO 250GB and the other uses a dual Kingston A400 250GB in RAIDZ-1
- RAM: one uses 32GB SODIMM DDR4 (2666), other uses 64GB DIMM DDR4 with ECC
I installed both OS on the same day. I don't think it's an hardware issue, as the Optiplex machine was working very well with Ubuntu Server previously.
It's not a network issue, as I tried with different network ports on the second machine.
I tried all possible solutions I found on internet and nothing solved my issue.
As I'm not so familiar with PVE, maybe I did something wrong during the PVE installation (it was my first time doing it, but I tried to follow a few tutorials)
By the way, this attempt of migration failed with this error message:
Code:
2023-03-14 19:57:26 19:57:26 15.2G rpool/data/vm-106-disk-0@__migration__
2023-03-14 19:57:27 19:57:27 15.3G rpool/data/vm-106-disk-0@__migration__
2023-03-14 19:57:28 19:57:28 15.5G rpool/data/vm-106-disk-0@__migration__
2023-03-14 20:06:02 command 'zfs destroy rpool/data/vm-106-disk-0@__migration__' failed: got timeout
send/receive failed, cleaning up snapshot(s)..
2023-03-14 20:06:02 ERROR: storage migration for 'local-zfs:vm-106-disk-0' to storage 'local-zfs' failed - command 'set -o pipefail && pvesm export local-zfs:vm-106-disk-0 zfs - -with-snapshots 0 -snapshot __migration__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=Nikita' root@10.0.0.11 -- pvesm import local-zfs:vm-106-disk-0 zfs - -with-snapshots 0 -snapshot __migration__ -delete-snapshot __migration__ -allow-rename 1' failed: exit code 4
2023-03-14 20:06:02 aborting phase 1 - cleanup resources
2023-03-14 20:06:02 ERROR: migration aborted (duration 00:35:52): storage migration for 'local-zfs:vm-106-disk-0' to storage 'local-zfs' failed - command 'set -o pipefail && pvesm export local-zfs:vm-106-disk-0 zfs - -with-snapshots 0 -snapshot __migration__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=Nikita' root@10.0.0.11 -- pvesm import local-zfs:vm-106-disk-0 zfs - -with-snapshots 0 -snapshot __migration__ -delete-snapshot __migration__ -allow-rename 1' failed: exit code 4
TASK ERROR: migration aborted