Proxmox VE 7.3 freezes during heavy load

GabrielLando · Mar 15, 2023

I'm facing an issue with a cluster with 2 Proxmox VE 7.3 machines. Both are very different machines and presents the same issue.

The issue is: the host freezes during a heavy load.
I found a few threads about it, but nothing solved my issue.

Basically, when I try to migrate a VM/CT from a node to another, the node that receives the VM/CT freezes and do a lot of timeouts. It also happens when one VM/CT does a heavy task (that uses a lot of CPU or I/O).

For example, here we have a migration with a VM turned OFF that is happening right now. The first node is working well (the node that is sending the VM), but the second node, the node receiving the VM freezes. It also freezes all VMs and CTs running on this node, all became unresponsible.

Screenshots of the Summary page on these different nodes:

It started since I installed PVE 7.2 on these machines (this was my first time using Proxmox).
I tried a lot of things:
- Updating Kernel
- Changing scheduler governor to a more efficient
- Updating Intel's microcode
- Limiting ZFS memory cache size (in percentage)

The only thing that helped me a bit was limiting ZFS cache. It reduced a lot the amount of freezes, but still happens during migrations and some heavier workloads. So I think it can be something related with I/O.

The common things between these 2 servers:
- Both use Intel CPU
- Both use consumer Sata SSDs with ZFS for OS and main storage pool
- Both run Proxmox VE 7.3 with Kernel 6.1 (I updated from 5.15 trying to fix this issue)

Main differences:
- CPU: Core i7 8700T (Optiplex 3060 Micro) vs Xeon E5 2650 v4 (on a Lenovo server MB)
- Storage: one uses a single Samsung 840 EVO 250GB and the other uses a dual Kingston A400 250GB in RAIDZ-1
- RAM: one uses 32GB SODIMM DDR4 (2666), other uses 64GB DIMM DDR4 with ECC

I installed both OS on the same day. I don't think it's an hardware issue, as the Optiplex machine was working very well with Ubuntu Server previously.
It's not a network issue, as I tried with different network ports on the second machine.

I tried all possible solutions I found on internet and nothing solved my issue.

As I'm not so familiar with PVE, maybe I did something wrong during the PVE installation (it was my first time doing it, but I tried to follow a few tutorials)

By the way, this attempt of migration failed with this error message:

Code:

2023-03-14 19:57:26 19:57:26   15.2G   rpool/data/vm-106-disk-0@__migration__
2023-03-14 19:57:27 19:57:27   15.3G   rpool/data/vm-106-disk-0@__migration__
2023-03-14 19:57:28 19:57:28   15.5G   rpool/data/vm-106-disk-0@__migration__
2023-03-14 20:06:02 command 'zfs destroy rpool/data/vm-106-disk-0@__migration__' failed: got timeout
send/receive failed, cleaning up snapshot(s)..
2023-03-14 20:06:02 ERROR: storage migration for 'local-zfs:vm-106-disk-0' to storage 'local-zfs' failed - command 'set -o pipefail && pvesm export local-zfs:vm-106-disk-0 zfs - -with-snapshots 0 -snapshot __migration__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=Nikita' root@10.0.0.11 -- pvesm import local-zfs:vm-106-disk-0 zfs - -with-snapshots 0 -snapshot __migration__ -delete-snapshot __migration__ -allow-rename 1' failed: exit code 4
2023-03-14 20:06:02 aborting phase 1 - cleanup resources
2023-03-14 20:06:02 ERROR: migration aborted (duration 00:35:52): storage migration for 'local-zfs:vm-106-disk-0' to storage 'local-zfs' failed - command 'set -o pipefail && pvesm export local-zfs:vm-106-disk-0 zfs - -with-snapshots 0 -snapshot __migration__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=Nikita' root@10.0.0.11 -- pvesm import local-zfs:vm-106-disk-0 zfs - -with-snapshots 0 -snapshot __migration__ -delete-snapshot __migration__ -allow-rename 1' failed: exit code 4
TASK ERROR: migration aborted

Falk R. · Mar 15, 2023

Hi,
You can't expect much performance with consumer SSDs. The high I/O delay on the source machine speaks for waiting times on the disk.
In addition, it would be interesting for me to see how heavily the CPU cores are overbooked.
When slow disks/SSDs meet heavily overbooked CPUs, exactly such effects can occur.

Max Carrara · Mar 15, 2023

gabriellando said:
So I think it can be something related with I/O.

Your guess would be correct.

I've dug around a little bit and found some interesting information regarding your Samsung 840 EVO which confirms my suspicions:

[...] a buffer up to 12GB in size to which data is first written before later being transferred to the SSD when there is idle time. The buffer varies by capacity at 3GB for 120 and 250GB, 6GB at 500GB, 9GB at 750GB and 12GB at the top-tier 1TB model.

(Source: https://www.storagereview.com/review/samsung-840-evo-ssd-review)

You can expect a similar thing of your two Kingston A400s.

The vast majority of consumer SSDs have an internal cache, e.g. either a DRAM cache or a TLC flash cache. The cache is faster than the actual storage cells and will buffer all write operations behind the scenes. You won't even notice it exists until, well, the cache is full. Then your drives' performance will degrade rapidly. If you wanna look for similar issues, just search for "Samsung QVO" here on the forums. A lot of similar threads will pop up; it's unfortunately a rather common issue.

In short: Don't use consumer SSDs for your professional workloads. They are, to put it mildly, garbage and often mislead the customer.

gabriellando said:
a dual Kingston A400 250GB in RAIDZ-1

By the way, do you mean in a mirror? RAID-Z1 requires at least three drives.

Max Carrara · Mar 15, 2023

Max Carrara said:
RAID-Z1 requires at least three drives.

Mea culpa; it seems that it's indeed possible to have a pool in RAID-Z1 with only two drives.

GabrielLando · Mar 15, 2023

Max Carrara said:
Mea culpa; it seems that it's indeed possible to have a pool in RAID-Z1 with only two drives.

Sorry, my mistake, I'm using RAID1 with ZFS. I was taking a look at my zpool and it appears as "mirror-0".

Max Carrara said:
In short: Don't use consumer SSDs for your professional workloads. They are, to put it mildly, garbage and often mislead the customer.

Thanks, I'll change my SSDs to enterprise SSDs, but as they are very expensive (even second hand) here in Brazil, I'll do it in the next month or two.

I didn't know that a consumer SSD could cause this type of issue, I thought they were fast enough, but I was wrong.

Thanks a lot for your patience

If the issue persists, I can reopen this thread if necessary.

GabrielLando · Jun 8, 2023

Hi guys, thanks for the tips

finally I could replace my SSDs. I bought some second-handed Samsung SM863 with 1% wearout. I replaced the consumer SSDs in both servers and now everything is awesome.

Search

Search

Proxmox VE 7.3 freezes during heavy load

GabrielLando

New Member

Falk R.

Distinguished Member

Max Carrara

Active Member

Max Carrara

Active Member

GabrielLando

New Member

GabrielLando

New Member