Problems running Proxmox in Thinksystem SR635

crepitacion

New Member
Mar 16, 2022
11
0
1
48
Hello everyone.

I'm new with Proxmox and I have a difficult experience right now.
Few months ago get to work in an enviroment with 7 HP Proliant DL 360 Gen10 servers Intel Processor, SAS disks, all of them winth Proxmox 7.1-7 running Windows and Linux VM and also some Containers.
6 of the servers are making his backups on a NFS server virtualized on the 7th server. So far all good. Backups and restores where running ok and fast for a 1G network.
The history is starting now.
My boss bought a new server, a Lenovo Thinksystem SR635, very similar to de HPE but in this case with an Epyc AMD Processor.
When I tried to install Proxmox 7.1-2 in this server the installation frezze on the "loading initial ramdisk" screen. After some searching it looks like a kernel problem so I tried to install Proxmox 7.0-2 an 'voilá' the installation runs ok. I created a Test Windows Server VM running great.
The problem came when I tried to restore another VM. This VM has less than 180GB on the backup, and took less than 30 minutes to restore in the original host but in the Lenovo server after 3 hours only had 45% of progress and at the same time the brand new Test Windows Server VM were frezzed. I stop the restore task and after several minutes it close but the Test Windows Server VM was inaccesible.
Some recomendations told me about an update/upgrade and I tried, but the newest kernels pull me to the begining "loading initial ramdisk" fail.
If someone could help me, needs more info please let me know.
Thanks in advance.
 
Default kernels used:
PVE 7.2: 5.15
PVE 7.1: 5.13
PVE 7.0: 5.11
PVE 6.4: 5.4

So if 7.0 and 6.X should work you could try the 5.11 or 5.4 kernel with PVE 7.2.
Also keep in mind that PVE 6.X is End-Of-Life this month.
 
Last edited:
Default kernels used:
PVE 7.2: 5.15
PVE 7.1: 5.13
PVE 7.0: 5.11
PVE 6.4: 5.4

So if 7.0 and 6.X should work you could try the 5.11 or 5.4 kernel with PVE 7.2.
Also keep in mind that PVE 6.X is End-Of-Life this month.
With PVE 7.2:5.11 it works. With higher kernel does't so I'll try to restore the VM with this combination and comment later.
Thanks!
 
Is any of the VMs cpu type set to "host"? That also could be problematic when migrating VMs between nodes with Intel and AMD CPUs.
 
Is any of the VMs cpu type set to "host"? That also could be problematic when migrating VMs between nodes with Intel and AMD CPUs.
In this case, cpu type is kvm64.

And the restoring doesn't look good:

progress 1% (read 13035241472 bytes, duration 0 sec)
progress 2% (read 26070482944 bytes, duration 0 sec)
progress 3% (read 39105724416 bytes, duration 1 sec)
progress 4% (read 52140965888 bytes, duration 1 sec)
progress 5% (read 65176141824 bytes, duration 1 sec)
progress 6% (read 78211383296 bytes, duration 1 sec)
progress 7% (read 91246624768 bytes, duration 1 sec)
progress 8% (read 104281866240 bytes, duration 2 sec)
progress 9% (read 117317042176 bytes, duration 2 sec)
progress 10% (read 130352283648 bytes, duration 2 sec)
progress 11% (read 143387525120 bytes, duration 14 sec)
progress 12% (read 156422766592 bytes, duration 48 sec)
progress 13% (read 169457942528 bytes, duration 83 sec)
progress 14% (read 182493184000 bytes, duration 120 sec)
progress 15% (read 195528425472 bytes, duration 949 sec)
progress 16% (read 208563666944 bytes, duration 1944 sec)

Wath can i do?
 
Disk models and HW raid/HBAs are the same on all nodes? Maybe the new node just got a slower storage?
 
Disk models and HW raid/HBAs are the same on all nodes? Maybe the new node just got a slower storage?
HP servers:
Smart Array P408i-a NC with SAS HDD 12G Enterprise 10K SFF

The Lenovo server:
RAID 730-8i 1GB Cache PCIe 12Gb Adapter with 10K SAS 12Gb Hot Swap 512n

By documentation they are almost the same. The firmware is up to date.
 
Are you restoring from NFS? If yes - are there signs of network problems in "netstat -ni" ? Is a simple "cp" from NFS server as slow? What about "dd bs=32k if=nfs:/path/file of=/dev/null" ?
Try iperf to baseline the network.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
Yes is NFS. No network problems. All other servers using this storage for backups are running fast as it can be over a 1Gbps network. I tested restoring the same VM from other two different servers and the entire task done in less then 30 minutes.
 
A perf3 test still won't hurt. Maybe the Lenovos NICs don't work well with the drivers PVE ships so the network problems could not effect the other nodes.
 
Yes is NFS. No network problems.
imho its a little early to definitively state this. You've done an application test (vzdump) - it indicated slowness.
You should do a basic nfs test (cp/dd), followed by network layer test (iperf). The goal would be to isolate the problem - is it vzdump, disk, nic, network, etc.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Sorry the delay but i had to work in other troubles. This is the update:
When I checked the restore task found this surprise:
progress 100% (read 1303522574336 bytes, duration 58969 sec)
total bytes read 1303522574336, sparse bytes 505612627968 (38.8%)
space reduction due to 4K zero blocks 4.8%
rescan volumes...
TASK OK

And about the network test, I'm doing the test right now. I'll update this ASAP.

Thanks.
 
I'm running sr635 with epyc too, I don't have any problem. (kernel 5.13, 2x m2 ssd in raid1, mellanox nic).

maybe is it related to your raid controller ? lsi controller are known to have some regression on 5.15 kernel. (I have see bug with my dell servers).
 
A perf3 test still won't hurt. Maybe the Lenovos NICs don't work well with the drivers PVE ships so the network problems could not effect the other nodes.
Here it is, and looks ok considering 1Gb network:
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 114 MBytes 958 Mbits/sec 0 419 KBytes
[ 5] 1.00-2.00 sec 112 MBytes 944 Mbits/sec 0 419 KBytes
[ 5] 2.00-3.00 sec 112 MBytes 938 Mbits/sec 0 419 KBytes
[ 5] 3.00-4.00 sec 113 MBytes 946 Mbits/sec 0 419 KBytes
[ 5] 4.00-5.00 sec 112 MBytes 936 Mbits/sec 0 437 KBytes
[ 5] 5.00-6.00 sec 112 MBytes 942 Mbits/sec 0 437 KBytes
[ 5] 6.00-7.00 sec 113 MBytes 946 Mbits/sec 0 437 KBytes
[ 5] 7.00-8.00 sec 112 MBytes 937 Mbits/sec 0 437 KBytes
[ 5] 8.00-9.00 sec 113 MBytes 948 Mbits/sec 0 437 KBytes
[ 5] 9.00-10.00 sec 112 MBytes 940 Mbits/sec 0 530 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 1.10 GBytes 943 Mbits/sec 0 sender
[ 5] 0.00-10.00 sec 1.10 GBytes 941 Mbits/sec receiver

iperf Done.
 
maybe is it related to your raid controller ? lsi controller are known to have some regression on 5.15 kernel. (I have see bug with my dell servers).
Other reported problems too. For example with the lenove 530-8i and 5.15 kernel in the "Proxmox 7.2 Upgrade Broke My RAID" thread: Thread 'Proxmox 7.2 Upgrade Broke My RAID' https://forum.proxmox.com/threads/proxmox-7-2-upgrade-broke-my-raid.109368/

Not sure how similar the 730-8i and 530-8i are.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!