Hi
We are running a three node PVE 5.1 cluster. Two nodes for running virtual machines and a backup node for storing VM backups and use it for virtual machines in case of an emergency.
All servers use locally attached storage with 2x 4TB SATA drives in a ZFS mirror and 2x 450GB NVMe drives, on the two primary VM hosts. The backup node runs 12x4TB SAS drives in a ZFS RAID-Z2 and no SSDs.
Within the last few weeks (shortly after Microsoft released their Spectre/Meltdown patches), we patched some Windows Server 2012 R2 VMs on one of the hosts and then users started reporting issues with random hangs. We removed the patches from the VMs, but that did not change.
The random hangs are described as being between 5-30 seconds and then it runs fine again.
Our attention then turned to the physical PVE hosts, which have also been patched. No BIOS/UEFI patches have been installed yet.
Until a few days ago, we had one SSD for L2ARC and one SSD for ZIL. This has been performing great so far, but as an experiment, we tried to change the configuration, so that we partition the drives and then assign them.
This was done based on the user feedback, it seemed like random cache misses, where data is no longer in the L2ARC and therefore it queries the slower SATA drives for the data. Due to the access times on the disk, when multiple users query at the same time, we see random hangs (Application not responding).
The server where we have the issues, is primarily running Windows servers (due to Windows Server licensing), while the other node is running everything else (Primarily CentOS 6, CentOS 7 and Debian 9)
We did the same configuration on both primary hosts, as they are physically the same type of server. After changing this, we allowed the servers a few days to rebuild the cache and just a few hours ago we checked again and performance has degraded further on both nodes.
The original configuration looked like this:
Both servers are running running this version:
Tonight we will revert to the old configuration to regain the desired performance.
Does anyone else see similar issues?
Does anyone have any tips for how to find out if this is the case?
Can we get anything interesting from monitoring with arcstat?
We are running a three node PVE 5.1 cluster. Two nodes for running virtual machines and a backup node for storing VM backups and use it for virtual machines in case of an emergency.
All servers use locally attached storage with 2x 4TB SATA drives in a ZFS mirror and 2x 450GB NVMe drives, on the two primary VM hosts. The backup node runs 12x4TB SAS drives in a ZFS RAID-Z2 and no SSDs.
Within the last few weeks (shortly after Microsoft released their Spectre/Meltdown patches), we patched some Windows Server 2012 R2 VMs on one of the hosts and then users started reporting issues with random hangs. We removed the patches from the VMs, but that did not change.
The random hangs are described as being between 5-30 seconds and then it runs fine again.
Our attention then turned to the physical PVE hosts, which have also been patched. No BIOS/UEFI patches have been installed yet.
Until a few days ago, we had one SSD for L2ARC and one SSD for ZIL. This has been performing great so far, but as an experiment, we tried to change the configuration, so that we partition the drives and then assign them.
This was done based on the user feedback, it seemed like random cache misses, where data is no longer in the L2ARC and therefore it queries the slower SATA drives for the data. Due to the access times on the disk, when multiple users query at the same time, we see random hangs (Application not responding).
The server where we have the issues, is primarily running Windows servers (due to Windows Server licensing), while the other node is running everything else (Primarily CentOS 6, CentOS 7 and Debian 9)
Code:
# zpool status
pool: rpool
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda2 ONLINE 0 0 0
sdb2 ONLINE 0 0 0
logs
mirror-1 ONLINE 0 0 0
nvme0n1p1 ONLINE 0 0 0
nvme1n1p1 ONLINE 0 0 0
cache
nvme0n1p2 ONLINE 0 0 0
nvme1n1p2 ONLINE 0 0 0
errors: No known data errors
We did the same configuration on both primary hosts, as they are physically the same type of server. After changing this, we allowed the servers a few days to rebuild the cache and just a few hours ago we checked again and performance has degraded further on both nodes.
The original configuration looked like this:
Code:
# zpool status
pool: rpool
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda2 ONLINE 0 0 0
sdb2 ONLINE 0 0 0
logs
nvme1n1 ONLINE 0 0 0
cache
nvme0n1 ONLINE 0 0 0
errors: No known data errors
Both servers are running running this version:
Code:
pve-manager/5.1-49/1e427a54 (running kernel: 4.13.13-6-pve)
Does anyone else see similar issues?
Does anyone have any tips for how to find out if this is the case?
Can we get anything interesting from monitoring with arcstat?