HP ProLiant dL380 Gen8+9 and SAS Issues Extremly slow PVE 7.0-11

Feb 15, 2020
12
0
21
59
I have here 2 HP Servers which have the same problems (3 more servers of dL380 with older PV which are working):

* SATA and SSD HD are well working
* SAS RAID is extremly slow. Backup recover with wmrestore from 30 GB HDD needs 5 hours.


Gen8: Smart Array P420i on Gen9: Smart Array P840 . Boot: SSD, 8x600 GB SAS in Raid 6 -> /dev/sdb and on the other 4x4TB SAS Raid also Raid 6
-> The SAS-RAID has LVM-thin
-> When recovering a VM to the SAS-RAID, it starts Ok, but after about 50% the speed slows extremly down - instead of 1 minute I needed 5-8 hours to recover one VM.
-> When recovering to a local SATA, performance is OK.
_> No Error Messages in dmesg or wherever.
-> CPU, Memory etc. is completly OK

Ps help! The only solution I see, is to put the SAS away an buy SSD .. But could this be a problem of PVE7?

root@taurus:# pveperf
CPU BOGOMIPS: 207865.20
REGEX/SECOND: 2778674
HD SIZE: 93.93 GB (/dev/mapper/pve-root)
BUFFERED READS: 154.11 MB/sec
AVERAGE SEEK TIME: 13.19 ms
FSYNCS/SECOND: 18.91
DNS EXT: 24.50 ms
DNS INT: 29.17 ms (vipweb.at)
 
Additional information: I have reinstalled 2 more new servers and had the same problem: qmrestore to a single HDD or single SSD had no problems. But using the RAID for the server to e.g. RAID 5 for 3 x SAS or 3xSSD (lvm-thin), increases the speed for the restore process. For 30 GB VM, it will be about 30 min for 100% - then about 10 min waiting - then it is restored. But for a 80GB Vm, after about 25% it stucks - and needs weeks. After stopping this task, nothing works any more correctly. There is a lock on the lvm-thin - and no backup etc will work any more.
 
That is some curious behavior. Have you checked the kernel messages in these situations? (dmesg, /var/log/kern.log)

Can you try to copy some large file to that storage? See if it also stalls at some point?

Do the RAID controllers have a BBU? If so, is it still okay?
Is the latest firmware installed?
 
Hi, No, there is no BBU, but write Cache is disabled. Anyhow, the drives are working really fast. But only in directy access mode. Not in LVM or LVM-thin mode. Yes, I have updated to the actual software. I have tested with 3 differnt HP 380 servers (gen8, Gen9, differnt configs). With Proxmox less then 6.4 (about) I never had these problems (and have still running VMS which ddon't have these troubles). But after upgrade, the qmrestore stucks at about 20-30 GBytes .. and then I have to wait logarithmic! Therefore I have breaked this process (ctrl-c)
Main problem for now is, that
* if I boot the server, all VMs are lost - I can't recover the VMs - they are still in a LVM-Thin container but not accessible
* if I want to make a backup, there is a lock onto the LVM-thin, so I can't acces this.
So for now, I'm in a dead-lock
 
No Kernel entries. In the log-files of vzdump is only the notice of starting backup, then nothing more.
I can reproduce this behaviour: New installed server, actual PVE. Use 3 disks (e.g. SAS) and one boot disk with spare for backup. copy a 80 GB VM-Backup from a running VME to this server and qmrestore to the lvm-thin. Backup stucks at about 25% ..
 
Please never use Raid5 or 6 without BBU. Then you have an average of max. 30MB/s.
With BBU you have between 300 and 2000MB/s depending on the disks.
With Raid controllers the disk cache should always be off (except Raid0).
If you don't have a BBU at hand, you can work better with software raid, which is then faster.
 
Maybe you are right. But this does not explain, why the first 20 GBYtes are OK in speed, and then exponentially the restore needs more time. It does also not explain, why this has been worked correctly with VME before 6.4. (e.g. 5.4 it works like a sharm). It does not explain, why a break locks the whole server, and I can't make a backup. And it does not explain, why it works without Raid FASTER (normal speed). And it does not explain, why after a break of wmrestore the whole vm-thin is "nearly" destroyed. So no backup is possible after a break, not move of a disk is possible, no reboot is possible. What can I do for now?
 
FSYNCS/SECOND: 18.91
Your storage is VERY slow, you will never get acceptable results with this raid setup. Fix your raid controller cache settings (BBU).
 
  • Like
Reactions: Falk R.
I will fix this - but before, I want to make a Backup. But due to a software problem from qmrestore there is a lock to the lvm-thin - and no backup is possible. Do you have an idea to solve the situation? (is there a fscp for lvm-thin which solves locks and repairs the partition inside)? tx
 
there is a lock to the lvm-thin - and no backup is possible
How do you want to make a backup and what do you mean with that there is a lock on the lvm-thin?
If you run into some errors, please show the command you try to run and the error you get :)
 
>Perfect Idea! But there is no error message. It only stucks - making nothing. I nfuture I want to use ProxmoxBackupServer. But for this, I have to make a backup to a local disk:
root@taurus:/BACKUP/dump# vzdump 405
INFO: starting new backup job: vzdump 405
INFO: Starting Backup of VM 405 (qemu)
INFO: Backup started at 2022-04-13 11:12:39
INFO: status = running
INFO: VM Name: TeamSpeak
INFO: include disk 'scsi0' 'local-lvm:vm-405-disk-0' 32G

-> and then nothing (waited 24 hours). .. but nothing happened.. so it seems, that vzdump can't access the lvm-thin partion, where the VM is located.
 
Hmm, I do suspect that there could be more of an issue here with the underlying storage not responding, causing the processes to be stuck waiting for IO responses.

What is the output of ps auxwf | grep vzdump? Please post it inside [code][/code] tags.
 
Here you are. No further output, no entries in logs, no errors.
The VMS are still running, backup from VMS which are NOT in the lvm-thin are working.

Code:
root@taurus:/var/log# ps auxwf | grep vzdump
root     3693050  0.1  0.0 317116 116752 pts/0   S    11:12   0:01  |       \_ /usr/bin/perl -T /usr/bin/vzdump 405
root     3693053  0.0  0.0 324380 99016 pts/0    S+   11:12   0:00  |           \_ task UPID:taurus:003859FD:54CDEC94:62569407:vzdump:405:root@pam:
root     3695953  0.0  0.0   6180   728 pts/1    S+   11:30   0:00              \_ grep vzdump
 
Hmm okay, so the process is in S state and not stuck in D state.

How is the IO Delay in the GUI in the summary panel of the Node?

You should see a temporary file in the directory where you store the backup. Is it growing in size?

Writing or reading anything in that VM works as expected, right?

Anything from the kernel that hints at some IO problems? You can check it either via dmesg or by taking a look at /var/log/kern.log
 
In the gui is mostly everything greyed out.
kern.og shows nothing about disks.
The Backup-Directoy shows no log nor zst file fpr this job.
dmesg does not show any disk-device problems.
Code:
dmesg 
...
[14051133.396868] vmbr0: port 5(fwpr401p0) entered disabled state
[14092732.624156] fwbr402i0: port 2(tap402i0) entered disabled state
[14092732.647546] fwbr402i0: port 1(fwln402i0) entered disabled state
[14092732.647703] vmbr0: port 6(fwpr402p0) entered disabled state
[14092732.648171] device fwln402i0 left promiscuous mode
[14092732.648176] fwbr402i0: port 1(fwln402i0) entered disabled state
[14092732.670682] device fwpr402p0 left promiscuous mode
[14092732.670689] vmbr0: port 6(fwpr402p0) entered disabled state

kern.log
..
Apr 11 21:39:43 taurus kernel: [14092732.648171] device fwln402i0 left promiscuous mode
Apr 11 21:39:43 taurus kernel: [14092732.648176] fwbr402i0: port 1(fwln402i0) entered disabled state
Apr 11 21:39:43 taurus kernel: [14092732.670682] device fwpr402p0 left promiscuous mode
Apr 11 21:39:43 taurus kernel: [14092732.670689] vmbr0: port 6(fwpr402p0) entered disabled state
Sorry, I can't see the fault - but it is reproducable with a newly installed server. And with other servers.
 
In the gui is mostly everything greyed out.
But what about if you click on the server and there on the "summary" menu item?
You should see the CPU and IO Delay graph, similar to this one: 1650459565940.png
How is the IO Delay?


Sorry, I can't see the fault - but it is reproducable with a newly installed server. And with other servers.
With similar hardware? Hardware RAID in RAID 5 with (thin) LVM on it?

Do you still have some similar server that does not have this problem? For example because it is still on Proxmox VE 6?

How is the FSYNC/Second result there if you run pveperf?
 
The described server, where I want to make a backup, has everything in the gui greyed out, cpu/server load and io is nothing displayed.
I have 2 similar server (HP DL 380) but other configurations. One with RADI5 and 3xSSD, one with Mirror SAS, etc. I have upgraded one from 6.1 to one of the last 6.x - before, qmrestore and moving was in normal speed. After upgrade, I had the first desaster. Then I bought a new HP 380 to have a second server to move the VMs. There, the same problem. If the SSD is single, speed and everything is OK. But when using any RAID, qmrestore, moving etc. will not work properly.
For now I see a big bug: when making a break while qmrestore is working (writing to a lvm-thin on a RAID), then the lvm-partition is not accessible.
 
This sounds like Raid without BBWC/FBWC.
HPE also says that you can only run VMs with backed cache.
Without BBWC the raid controller is only supported for OS, if you set BBWC cache to write back then performance is great and all other phenomena should be gone.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!