Bug IOWait on Proxmox 7.2.11

Perzival · Oct 3, 2022

Hello,

I have a problem on my proxmox 7.2.11

I can no longer make a backup on my dedicated even locally on a disk because otherwise I have the IoWait which goes up to 100% and then I completely lose access to the proxmox for 5 minutes.

On the other hand, I still have access to the vps that I have on this dedicated but it has almost 100% cpu use with 0 software running on it.

I don't know what to do at all to fix this problem because I've had them since September 27 and I thought it would pass but it never changed.

You have in this thread a display of the status of the machine and you will see that even if I do not make a backup I have the io wait which goes up anyway (not at 100% but usually it runs at not even 1 % when I make 0 backup)

Link website : https://status.winheberg.fr/report/uptime/77bb271c36e2af9cee6a40838ab60c31/

Machine technical specification:

- Ryzen 9 3950x
- 128 Gb de ram ddr4
- 2x 2 To en RAID 1 Crucial MX500 ( The raid1 and manage by proxmox)
- 2 To HDD

Even restarting the machine nothing changes.

Also most of the vps i have are LXC vps and i have a KVM

all LXC vps are on debian 10 and KVM on debian 11

I would like to know if it is a hardware or software problem and if it is software know where I can look to fix the problem

Perzival · Oct 3, 2022

Here is what is happening on the machine at the moment, the IoWait is running at full speed for a ridiculous cpu consumption..

bekax5 · Oct 3, 2022

https://forum.proxmox.com/threads/backups-in-proxmox7-slow-restore-high-io-on-ssd.116011/

I'm suffering the same problem...
With the same SSD...

Could you show your iostat while backing up ?

LnxBil · Oct 3, 2022

Perzival said:
Crucial MX500

Maybe you hit an internal SSD limit based on writes. Those SSDs are known to be dead slow and not recommended for any non-desktop use.

bekax5 · Oct 3, 2022

LnxBil said:
Maybe you hit an internal SSD limit based on writes. Those SSDs are known to be dead slow and not recommended for any non-desktop use.

Most of the time I am able to write 4k at 60MBps+ sustained with fio or 300MBps+ if sequential writes

However vzdump is making something weird and making the ssd crawl to 300KBps. Can't really understand if its due to the ssd, to the system, or what kind of weird unstable behaviour vzdump is creating

Dunuin · Oct 3, 2022

If you want to see worst case write performance try

fio --filename=/dev/sdb --rw=write --direct=1 --bs=1M --numjobs=4 --ioengine=libaio --refill_buffers --size=75G --name=async_seq_write --group_reporting --iodepth=16 && fio --filename=/dev/sdb --rw=randwrite --direct=1 --bs=4K --sync=1 --numjobs=1 --ioengine=psync --iodepth=1 --refill_buffers --time_based --runtime=120 --name=sync_rand_write && fio --filename=/dev/sdb --rw=write --direct=1 --bs=1M --sync=1 --numjobs=1 --ioengine=psync --iodepth=1 --refill_buffers --time_based --runtime=120 --name=sync_seq_write

.
But keep in mind that this will directly write to the device, destroying data on it, so you want to run it on an empty SSD. And you should write a lot of data so the RAM and SLC cache get full, so the above test will first do 300GB of async sequential writes to fill up the caches and directly after that 2 minutes of sync random writes and 2 minutes of sync sequential writes.
Performance will most likely fall down into the low MB/s or even KB/s range.

And keep in mind that this test directly writes to the SSDs without a filesystem or storage solution like ZFS. This would add additional overhead crippling the performance even more. And a full SSD usually is slower than a new and empty SSD, as the the NAND chips will get fragemented over time and the more data is on that SSD, the less space can be used for SLC caching (SSDs don't got dedicated SLC NAND chips, they just write to TLC/QLC NAND in SLC mode).

UdoB · Oct 3, 2022

I've never tested my drives with this specific kind of parameter set. It is brutal! A very, very cheap Intenso SSD (in a test/dummy machine) gets down to one single IOP. My expensive Intel Enterprise SSDs (mirrored) stays up with 480 IOPS.

bekax5 · Oct 3, 2022

Dunuin said:
If you want to see worst case write performance try fio --filename=/dev/sdb --rw=write --direct=1 --bs=1M --numjobs=4 --ioengine=libaio --refill_buffers --size=75G --name=async_seq_write --group_reporting --iodepth=16 && fio --filename=/dev/sdb --rw=randwrite --direct=1 --bs=4K --sync=1 --numjobs=1 --ioengine=psync --iodepth=1 --refill_buffers --time_based --runtime=120 --name=sync_rand_write && fio --filename=/dev/sdb --rw=write --direct=1 --bs=1M --sync=1 --numjobs=1 --ioengine=psync --iodepth=1 --refill_buffers --time_based --runtime=120 --name=sync_seq_write.
But keep in mind that this will directly write to the device, destroying data on it, so you want to run it on an empty SSD. And you should write a lot of data so the RAM and SLC cache get full, so the above test will first do 300GB of async sequential writes to fill up the caches and directly after that 2 minutes of sync random writes and 2 minutes of sync sequential writes.
Performance will most likely fall down into the low MB/s or even KB/s range.

And keep in mind that this test directly writes to the SSDs without a filesystem or storage solution like ZFS. This would add additional overhead crippling the performance even more. And a full SSD usually is slower than a new and empty SSD, as the the NAND chips will get fragemented over time and the more data is on that SSD, the less space can be used for SLC caching (SSDs don't got dedicated SLC NAND chips, they just write to TLC/QLC NAND in SLC mode).

My MX500 ran this OK in my opinion...

1st fio - 180MBps 180 iops
2nd fio - 1.3 MBps 326 iops
3rd fio - 122Mbps 122 iops

Bash:

root@pve:~# fio --filename=/dev/sdb --rw=write --direct=1 --bs=1M --numjobs=4 --ioengine=libaio --refill_buffers --size=75G --name=async_seq_write --group_reporting --iodepth=16 && fio --filename=/dev/sdb --rw=randwrite --direct=1 --bs=4K --sync=1 --numjobs=1 --ioengine=psync --iodepth=1 --refill_buffers --time_based --runtime=120 --name=sync_rand_write && fio --filename=/dev/sdb --rw=write --direct=1 --bs=1M --sync=1 --numjobs=1 --ioengine=psync --iodepth=1 --refill_buffers --time_based --runtime=120 --name=sync_seq_write
async_seq_write: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=16
...
fio-3.25
Starting 4 processes
Jobs: 3 (f=3): [W(3),_(1)][100.0%][w=138MiB/s][w=138 IOPS][eta 00m:00s]
async_seq_write: (groupid=0, jobs=4): err= 0: pid=357967: Mon Oct  3 21:22:18 2022
  write: IOPS=181, BW=181MiB/s (190MB/s)(300GiB/1694298msec); 0 zone resets
    slat (usec): min=9, max=619461, avg=21925.18, stdev=41263.19
    clat (msec): min=47, max=1436, avg=330.82, stdev=121.32
     lat (msec): min=54, max=1537, avg=352.75, stdev=122.83
    clat percentiles (msec):
     |  1.00th=[  188],  5.00th=[  188], 10.00th=[  190], 20.00th=[  251],
     | 30.00th=[  251], 40.00th=[  253], 50.00th=[  253], 60.00th=[  359],
     | 70.00th=[  477], 80.00th=[  485], 90.00th=[  493], 95.00th=[  502],
     | 99.00th=[  550], 99.50th=[  567], 99.90th=[  902], 99.95th=[  944],
     | 99.99th=[ 1334]
   bw (  KiB/s): min=32768, max=321847, per=100.00%, avg=185900.65, stdev=15922.07, samples=13542
   iops        : min=   32, max=  314, avg=181.43, stdev=15.54, samples=13542
  lat (msec)   : 50=0.01%, 100=0.01%, 250=21.13%, 500=73.38%, 750=5.28%
  lat (msec)   : 1000=0.19%, 2000=0.02%
  cpu          : usr=0.56%, sys=0.06%, ctx=77980, majf=0, minf=56
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,307200,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=181MiB/s (190MB/s), 181MiB/s-181MiB/s (190MB/s-190MB/s), io=300GiB (322GB), run=1694298-1694298msec

Disk stats (read/write):
  sdb: ios=402/614360, merge=0/0, ticks=44680/102269033, in_queue=102313714, util=100.00%
sync_rand_write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.25
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=1296KiB/s][w=324 IOPS][eta 00m:00s]
sync_rand_write: (groupid=0, jobs=1): err= 0: pid=362637: Mon Oct  3 21:24:18 2022
  write: IOPS=326, BW=1304KiB/s (1335kB/s)(153MiB/120003msec); 0 zone resets
    clat (usec): min=1518, max=40827, avg=3055.96, stdev=1703.04
     lat (usec): min=1518, max=40827, avg=3056.15, stdev=1703.04
    clat percentiles (usec):
     |  1.00th=[ 1598],  5.00th=[ 1680], 10.00th=[ 1696], 20.00th=[ 1696],
     | 30.00th=[ 1713], 40.00th=[ 1778], 50.00th=[ 3687], 60.00th=[ 4015],
     | 70.00th=[ 4113], 80.00th=[ 4178], 90.00th=[ 4228], 95.00th=[ 4359],
     | 99.00th=[ 5211], 99.50th=[13304], 99.90th=[13829], 99.95th=[19006],
     | 99.99th=[40633]
   bw (  KiB/s): min= 1008, max= 1384, per=100.00%, avg=1305.37, stdev=59.40, samples=239
   iops        : min=  252, max=  346, avg=326.34, stdev=14.85, samples=239
  lat (msec)   : 2=46.34%, 4=9.64%, 10=43.06%, 20=0.91%, 50=0.04%
  cpu          : usr=0.72%, sys=0.57%, ctx=39124, majf=0, minf=12
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,39123,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1304KiB/s (1335kB/s), 1304KiB/s-1304KiB/s (1335kB/s-1335kB/s), io=153MiB (160MB), run=120003-120003msec

Disk stats (read/write):
  sdb: ios=75/39084, merge=0/0, ticks=55/118422, in_queue=118477, util=100.00%
sync_seq_write: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.25
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=128MiB/s][w=128 IOPS][eta 00m:00s]
sync_seq_write: (groupid=0, jobs=1): err= 0: pid=362984: Mon Oct  3 21:26:18 2022
  write: IOPS=122, BW=122MiB/s (128MB/s)(14.3GiB/120002msec); 0 zone resets
    clat (msec): min=4, max=466, avg= 8.08, stdev= 8.88
     lat (msec): min=4, max=466, avg= 8.08, stdev= 8.88
    clat percentiles (msec):
     |  1.00th=[    6],  5.00th=[    8], 10.00th=[    8], 20.00th=[    8],
     | 30.00th=[    8], 40.00th=[    8], 50.00th=[    8], 60.00th=[    8],
     | 70.00th=[    8], 80.00th=[    8], 90.00th=[    9], 95.00th=[   10],
     | 99.00th=[   18], 99.50th=[   18], 99.90th=[   47], 99.95th=[   64],
     | 99.99th=[  460]
   bw (  KiB/s): min=32768, max=137216, per=100.00%, avg=125022.26, stdev=14446.51, samples=239
   iops        : min=   32, max=  134, avg=122.09, stdev=14.11, samples=239
  lat (msec)   : 10=96.74%, 20=3.12%, 50=0.07%, 100=0.03%, 250=0.01%
  lat (msec)   : 500=0.04%
  cpu          : usr=1.57%, sys=0.77%, ctx=14649, majf=0, minf=14
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,14641,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=122MiB/s (128MB/s), 122MiB/s-122MiB/s (128MB/s-128MB/s), io=14.3GiB (15.4GB), run=120002-120002msec

Disk stats (read/write):
  sdb: ios=75/29254, merge=0/0, ticks=105/180192, in_queue=180298, util=100.00%

Still not as hard as a backup restore...

I'm still trying to understand why VM backups just hang the SSD...
Can't figure it out..

300KBps 50 iops
Have been unable to restore VMs since it takes more than a couple weeks!

spirit · Oct 3, 2022

AFAIK, proxmox backup use 64k block , with iodepth=1, sequentially.

(but crucial ssd are pretty slow, around 300-400 iops wit iodepth=1 4k. vs 10000-20000iops for an datacenter ssd because of their supercapacitor and writecache)

Perzival · Oct 3, 2022

Well I think my problem comes from my
proxmox configuration which makes any big manipulation on the disk impossible..

it's in zfs (the disks have been put in raid thanks to the proxmox installer

I'm going to reset my entire dedicate to the datacenter this weekend to see if I find a solution I'll come back here if I have a problem!

spirit · Oct 3, 2022

Note that with zfs, you shouldn't use consumer ssd, because zfs do syncronous write for his journal.

Perzival · Oct 3, 2022

spirit said:
Note that with zfs, you shouldn't use consumer ssd, because zfs do syncronous write for his journal.

J'apprend a utiliser proxmox encore aujourd'hui donc je ne suis pas a l'abris de quelques erreurs de débutant mais je te remercie pour cette remarque car je pensais pas qu'il fallait un disque spécifique pour utiliser le zfs

bekax5 · Oct 4, 2022

spirit said:
AFAIK, proxmox backup use 64k block , with iodepth=1, sequentially.

(but crucial ssd are pretty slow, around 300-400 iops wit iodepth=1 4k. vs 10000-20000iops for an datacenter ssd because of their supercapacitor and writecache)

Code:

root@pve:~# fio --filename=/dev/sdb --name=randfile --ioengine=libaio --iodepth=1 --rw=write --bs=4k --direct=1 --size=1G --numjobs=1 --group_reporting
randfile: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
^Cbs: 1 (f=1): [W(1)][6.4%][w=6866KiB/s][w=1716 IOPS][eta 02m:27s]
fio: terminating on signal 2

randfile: (groupid=0, jobs=1): err= 0: pid=381444: Mon Oct  3 23:16:35 2022
  write: IOPS=1685, BW=6743KiB/s (6905kB/s)(65.3MiB/9913msec); 0 zone resets
    slat (nsec): min=4043, max=40764, avg=7533.39, stdev=1793.83
    clat (usec): min=390, max=29329, avg=584.23, stdev=240.11
     lat (usec): min=399, max=29337, avg=591.93, stdev=240.09
    clat percentiles (usec):
     |  1.00th=[  437],  5.00th=[  469], 10.00th=[  490], 20.00th=[  562],
     | 30.00th=[  578], 40.00th=[  578], 50.00th=[  578], 60.00th=[  578],
     | 70.00th=[  578], 80.00th=[  578], 90.00th=[  586], 95.00th=[  898],
     | 99.00th=[  930], 99.50th=[  955], 99.90th=[ 1004], 99.95th=[ 1012],
     | 99.99th=[ 1909]
   bw (  KiB/s): min= 6344, max= 6928, per=100.00%, avg=6744.42, stdev=128.79, samples=19
   iops        : min= 1586, max= 1732, avg=1686.11, stdev=32.20, samples=19
  lat (usec)   : 500=10.48%, 750=83.95%, 1000=5.45%
  lat (msec)   : 2=0.13%, 50=0.01%
  cpu          : usr=2.06%, sys=4.53%, ctx=16712, majf=0, minf=14
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,16712,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=6743KiB/s (6905kB/s), 6743KiB/s-6743KiB/s (6905kB/s-6905kB/s), io=65.3MiB (68.5MB), run=9913-9913msec

Disk stats (read/write):
  sdb: ios=53/16686, merge=0/0, ticks=27/9349, in_queue=9375, util=99.35%

Must be something else.
Still getting 1700 iops 6MBps+ with iodepth=1 4k
Not the best result of the world, but the worst SSD should be capable of handling a backup in that condition, right ?

This is what hapens with a backup...
How long will it take to restore a 1TB VM... ?

Search

Search

Bug IOWait on Proxmox 7.2.11

Perzival

Member

Perzival

Member

Attachments

bekax5

Member

LnxBil

Distinguished Member

bekax5

Member

Dunuin

Distinguished Member

UdoB

Distinguished Member

bekax5

Member

spirit

Distinguished Member

Perzival

Member

spirit

Distinguished Member

Perzival

Member

bekax5

Member

We value your privacy