Problems running Proxmox in Thinksystem SR635

crepitacion · Jul 21, 2022

Hello everyone.

I'm new with Proxmox and I have a difficult experience right now.
Few months ago get to work in an enviroment with 7 HP Proliant DL 360 Gen10 servers Intel Processor, SAS disks, all of them winth Proxmox 7.1-7 running Windows and Linux VM and also some Containers.
6 of the servers are making his backups on a NFS server virtualized on the 7th server. So far all good. Backups and restores where running ok and fast for a 1G network.
The history is starting now.
My boss bought a new server, a Lenovo Thinksystem SR635, very similar to de HPE but in this case with an Epyc AMD Processor.
When I tried to install Proxmox 7.1-2 in this server the installation frezze on the "loading initial ramdisk" screen. After some searching it looks like a kernel problem so I tried to install Proxmox 7.0-2 an 'voilá' the installation runs ok. I created a Test Windows Server VM running great.
The problem came when I tried to restore another VM. This VM has less than 180GB on the backup, and took less than 30 minutes to restore in the original host but in the Lenovo server after 3 hours only had 45% of progress and at the same time the brand new Test Windows Server VM were frezzed. I stop the restore task and after several minutes it close but the Test Windows Server VM was inaccesible.
Some recomendations told me about an update/upgrade and I tried, but the newest kernels pull me to the begining "loading initial ramdisk" fail.
If someone could help me, needs more info please let me know.
Thanks in advance.

ness1602 · Jul 21, 2022

I have SR635, i we didn't have any problems with proxmox 6 installation.

Dunuin · Jul 21, 2022

You can always upgrade to the latest PVE version and then boot an older kernel that is working. You can also pin a kernel starting with PVE 7.2: https://pve.proxmox.com/wiki/Host_Bootloader#sysboot_kernel_pin

crepitacion · Jul 21, 2022

Dunuin said:
You can always upgrade to the latest PVE version and then boot an older kernel that is working. You can also pin a kernel starting with PVE 7.2: https://pve.proxmox.com/wiki/Host_Bootloader#sysboot_kernel_pin

Interesting, I'm gonna try this ASAP and give some feedback. Thanks.

crepitacion · Jul 21, 2022

ness1602 said:
I have SR635, i we didn't have any problems with proxmox 6 installation.

Thanks for share. I'll try with kernel stuff suggested by @Dunuin. Then if they not work, Íll try with version 6.

Dunuin · Jul 21, 2022

Default kernels used:
PVE 7.2: 5.15
PVE 7.1: 5.13
PVE 7.0: 5.11
PVE 6.4: 5.4

So if 7.0 and 6.X should work you could try the 5.11 or 5.4 kernel with PVE 7.2.
Also keep in mind that PVE 6.X is End-Of-Life this month.

crepitacion · Jul 21, 2022

Dunuin said:
Default kernels used:
PVE 7.2: 5.15
PVE 7.1: 5.13
PVE 7.0: 5.11
PVE 6.4: 5.4

So if 7.0 and 6.X should work you could try the 5.11 or 5.4 kernel with PVE 7.2.
Also keep in mind that PVE 6.X is End-Of-Life this month.

With PVE 7.2:5.11 it works. With higher kernel does't so I'll try to restore the VM with this combination and comment later.
Thanks!

Dunuin · Jul 21, 2022

Is any of the VMs cpu type set to "host"? That also could be problematic when migrating VMs between nodes with Intel and AMD CPUs.

crepitacion · Jul 22, 2022

Dunuin said:
Is any of the VMs cpu type set to "host"? That also could be problematic when migrating VMs between nodes with Intel and AMD CPUs.

In this case, cpu type is kvm64.

And the restoring doesn't look good:

progress 1% (read 13035241472 bytes, duration 0 sec)
progress 2% (read 26070482944 bytes, duration 0 sec)
progress 3% (read 39105724416 bytes, duration 1 sec)
progress 4% (read 52140965888 bytes, duration 1 sec)
progress 5% (read 65176141824 bytes, duration 1 sec)
progress 6% (read 78211383296 bytes, duration 1 sec)
progress 7% (read 91246624768 bytes, duration 1 sec)
progress 8% (read 104281866240 bytes, duration 2 sec)
progress 9% (read 117317042176 bytes, duration 2 sec)
progress 10% (read 130352283648 bytes, duration 2 sec)
progress 11% (read 143387525120 bytes, duration 14 sec)
progress 12% (read 156422766592 bytes, duration 48 sec)
progress 13% (read 169457942528 bytes, duration 83 sec)
progress 14% (read 182493184000 bytes, duration 120 sec)
progress 15% (read 195528425472 bytes, duration 949 sec)
progress 16% (read 208563666944 bytes, duration 1944 sec)

Wath can i do?

Dunuin · Jul 22, 2022

Disk models and HW raid/HBAs are the same on all nodes? Maybe the new node just got a slower storage?

crepitacion · Jul 22, 2022

Dunuin said:
Disk models and HW raid/HBAs are the same on all nodes? Maybe the new node just got a slower storage?

HP servers:
Smart Array P408i-a NC with SAS HDD 12G Enterprise 10K SFF

The Lenovo server:
RAID 730-8i 1GB Cache PCIe 12Gb Adapter with 10K SAS 12Gb Hot Swap 512n

By documentation they are almost the same. The firmware is up to date.

bbgeek17 · Jul 22, 2022

Are you restoring from NFS? If yes - are there signs of network problems in "netstat -ni" ? Is a simple "cp" from NFS server as slow? What about "dd bs=32k if=nfs:/path/file of=/dev/null" ?
Try iperf to baseline the network.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

crepitacion · Jul 22, 2022

bbgeek17 said:
Are you restoring from NFS? If yes - are there signs of network problems in "netstat -ni" ? Is a simple "cp" from NFS server as slow? What about "dd bs=32k if=nfs:/path/file of=/dev/null" ?
Try iperf to baseline the network.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Yes is NFS. No network problems. All other servers using this storage for backups are running fast as it can be over a 1Gbps network. I tested restoring the same VM from other two different servers and the entire task done in less then 30 minutes.

Dunuin · Jul 22, 2022

A perf3 test still won't hurt. Maybe the Lenovos NICs don't work well with the drivers PVE ships so the network problems could not effect the other nodes.

bbgeek17 · Jul 22, 2022

crepitacion said:
Yes is NFS. No network problems.

imho its a little early to definitively state this. You've done an application test (vzdump) - it indicated slowness.
You should do a basic nfs test (cp/dd), followed by network layer test (iperf). The goal would be to isolate the problem - is it vzdump, disk, nic, network, etc.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

ness1602 · Jul 22, 2022

OKay,i forgot to tell you that there were some problems with 5.15 kernel, so i'm sticking with 5.13 for now.

crepitacion · Jul 22, 2022

Sorry the delay but i had to work in other troubles. This is the update:
When I checked the restore task found this surprise:
progress 100% (read 1303522574336 bytes, duration 58969 sec)
total bytes read 1303522574336, sparse bytes 505612627968 (38.8%)
space reduction due to 4K zero blocks 4.8%
rescan volumes...
TASK OK

And about the network test, I'm doing the test right now. I'll update this ASAP.

Thanks.

spirit · Jul 22, 2022

I'm running sr635 with epyc too, I don't have any problem. (kernel 5.13, 2x m2 ssd in raid1, mellanox nic).

maybe is it related to your raid controller ? lsi controller are known to have some regression on 5.15 kernel. (I have see bug with my dell servers).

crepitacion · Jul 22, 2022

Dunuin said:
A perf3 test still won't hurt. Maybe the Lenovos NICs don't work well with the drivers PVE ships so the network problems could not effect the other nodes.

Here it is, and looks ok considering 1Gb network:
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 114 MBytes 958 Mbits/sec 0 419 KBytes
[ 5] 1.00-2.00 sec 112 MBytes 944 Mbits/sec 0 419 KBytes
[ 5] 2.00-3.00 sec 112 MBytes 938 Mbits/sec 0 419 KBytes
[ 5] 3.00-4.00 sec 113 MBytes 946 Mbits/sec 0 419 KBytes
[ 5] 4.00-5.00 sec 112 MBytes 936 Mbits/sec 0 437 KBytes
[ 5] 5.00-6.00 sec 112 MBytes 942 Mbits/sec 0 437 KBytes
[ 5] 6.00-7.00 sec 113 MBytes 946 Mbits/sec 0 437 KBytes
[ 5] 7.00-8.00 sec 112 MBytes 937 Mbits/sec 0 437 KBytes
[ 5] 8.00-9.00 sec 113 MBytes 948 Mbits/sec 0 437 KBytes
[ 5] 9.00-10.00 sec 112 MBytes 940 Mbits/sec 0 530 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 1.10 GBytes 943 Mbits/sec 0 sender
[ 5] 0.00-10.00 sec 1.10 GBytes 941 Mbits/sec receiver

iperf Done.

Dunuin · Jul 22, 2022

maybe is it related to your raid controller ? lsi controller are known to have some regression on 5.15 kernel. (I have see bug with my dell servers).

Other reported problems too. For example with the lenove 530-8i and 5.15 kernel in the "Proxmox 7.2 Upgrade Broke My RAID" thread: Thread 'Proxmox 7.2 Upgrade Broke My RAID' https://forum.proxmox.com/threads/proxmox-7-2-upgrade-broke-my-raid.109368/

Not sure how similar the 730-8i and 530-8i are.

Problems running Proxmox in Thinksystem SR635

New Member

Renowned Member

Distinguished Member

New Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

Distinguished Member

Renowned Member

New Member

Distinguished Member

New Member

Distinguished Member