Hi folks,
I've been using proxmox for years now and I never had any problems ... until recently.
I have a root server rented at a hoster and installed proxmox on it. There is a vanilla debian installation with a software raid on top and I used that to install proxmox. Everything was fine until last week, when one of the two hard drives failed. I got it replaced and shut everything down to give the raid rebuild top prio, but even then it was quite slow and took a couple of days. After the rebuild was complete I restarted my vms and containers and everything was really slow and IOwait jumped to 80%. That was odd. I tried tuning some parameters, but nothing helped. With a simple dd I tried to read from both disks and I could see, that the new disk was quite slow, when reading.
To verify that I booted into the rescue system and I could not verify this problem with the rescue system (which was an ubuntu, but I don't remember the actual kernel version). So I booted in my proxmox again and my problem was still there. I tried a couple other things but nothing. So I gave up and reinstalled the whole system, just to be sure. But even after a complete reinstallation the problem persists. When I have the new hard disk in the RAID-1 the system is terribly slow, when I mark is as faulty the system is back to normal state. I checked smart, but no failures there. Of course I also waited for the raid sync to complete before trying to measure anything.
Long story short: with both drives in my RAID the system is very slow. With the new drive disabled the system behaves normal. Both drives show no smart errors. I can't verify the behaviour with an ubuntu rescue system. I tried the no-subscription kernel and the new 5.4 kernel, but behaviour is the same.
Some measurements:
# both drives in the raid => r_await and w_await > 100 for the loop device (my one container that's running)
# new disk disabled => r_await and w_await < 10, same load situation as above
Do you have any idea what this could be?
Best regards,
Eike
I've been using proxmox for years now and I never had any problems ... until recently.
I have a root server rented at a hoster and installed proxmox on it. There is a vanilla debian installation with a software raid on top and I used that to install proxmox. Everything was fine until last week, when one of the two hard drives failed. I got it replaced and shut everything down to give the raid rebuild top prio, but even then it was quite slow and took a couple of days. After the rebuild was complete I restarted my vms and containers and everything was really slow and IOwait jumped to 80%. That was odd. I tried tuning some parameters, but nothing helped. With a simple dd I tried to read from both disks and I could see, that the new disk was quite slow, when reading.
To verify that I booted into the rescue system and I could not verify this problem with the rescue system (which was an ubuntu, but I don't remember the actual kernel version). So I booted in my proxmox again and my problem was still there. I tried a couple other things but nothing. So I gave up and reinstalled the whole system, just to be sure. But even after a complete reinstallation the problem persists. When I have the new hard disk in the RAID-1 the system is terribly slow, when I mark is as faulty the system is back to normal state. I checked smart, but no failures there. Of course I also waited for the raid sync to complete before trying to measure anything.
Long story short: with both drives in my RAID the system is very slow. With the new drive disabled the system behaves normal. Both drives show no smart errors. I can't verify the behaviour with an ubuntu rescue system. I tried the no-subscription kernel and the new 5.4 kernel, but behaviour is the same.
Some measurements:
# both drives in the raid => r_await and w_await > 100 for the loop device (my one container that's running)
Bash:
eholtz@titan719:~$ iostat -xyz 30 -c 1
Linux 5.4.24-1-pve (titan719) 03/18/2020 _x86_64_ (8 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
0.09 0.00 0.05 1.57 0.00 1.69
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
loop0 25.33 8.67 370.80 48.13 0.00 0.00 0.00 0.00 362.46 153.33 10.47 14.64 5.55 1.44 4.89
sda 15.93 29.77 375.20 205.87 0.03 7.83 0.21 20.83 4.04 14.13 0.43 23.55 6.92 1.49 6.79
md2 23.83 51.10 560.13 286.40 0.00 0.00 0.00 0.00 0.00 0.00 0.00 23.50 5.60 0.00 0.00
sdb 7.83 29.03 184.93 202.53 0.03 7.87 0.42 21.32 90.83 84.23 3.10 23.61 6.98 1.36 5.03
# new disk disabled => r_await and w_await < 10, same load situation as above
Bash:
eholtz@titan719:~$ iostat -xyz 30 -c 1
Linux 5.4.24-1-pve (titan719) 03/18/2020 _x86_64_ (8 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
0.72 0.00 0.19 0.15 0.00 2.19
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
loop0 87.90 27.60 1283.60 131.20 0.00 0.00 0.00 0.00 5.54 5.22 0.56 14.60 4.75 1.72 19.91
sda 82.17 61.63 1854.27 381.48 0.07 11.20 0.08 15.38 3.45 12.08 0.87 22.57 6.19 1.65 23.69
md2 82.20 69.97 1857.60 376.80 0.00 0.00 0.00 0.00 0.00 0.00 0.00 22.60 5.39 0.00 0.00
Do you have any idea what this could be?
Best regards,
Eike