Bug IOWait on Proxmox 7.2.11

Perzival

New Member
Oct 3, 2022
4
0
1
Hello,

I have a problem on my proxmox 7.2.11


I can no longer make a backup on my dedicated even locally on a disk because otherwise I have the IoWait which goes up to 100% and then I completely lose access to the proxmox for 5 minutes.

On the other hand, I still have access to the vps that I have on this dedicated but it has almost 100% cpu use with 0 software running on it.

I don't know what to do at all to fix this problem because I've had them since September 27 and I thought it would pass but it never changed.

You have in this thread a display of the status of the machine and you will see that even if I do not make a backup I have the io wait which goes up anyway (not at 100% but usually it runs at not even 1 % when I make 0 backup)


Link website : https://status.winheberg.fr/report/uptime/77bb271c36e2af9cee6a40838ab60c31/

Machine technical specification:

- Ryzen 9 3950x
- 128 Gb de ram ddr4
- 2x 2 To en RAID 1 Crucial MX500 ( The raid1 and manage by proxmox)
- 2 To HDD

Even restarting the machine nothing changes.

Also most of the vps i have are LXC vps and i have a KVM

all LXC vps are on debian 10 and KVM on debian 11



I would like to know if it is a hardware or software problem and if it is software know where I can look to fix the problem
 
Here is what is happening on the machine at the moment, the IoWait is running at full speed for a ridiculous cpu consumption..
 

Attachments

  • Desktop Screenshot 2022.10.03 - 12.49.09.61 (2).png
    Desktop Screenshot 2022.10.03 - 12.49.09.61 (2).png
    102.1 KB · Views: 30
Maybe you hit an internal SSD limit based on writes. Those SSDs are known to be dead slow and not recommended for any non-desktop use.
Most of the time I am able to write 4k at 60MBps+ sustained with fio or 300MBps+ if sequential writes

Screenshot 2022-10-02 at 11.23.55.png

However vzdump is making something weird and making the ssd crawl to 300KBps. Can't really understand if its due to the ssd, to the system, or what kind of weird unstable behaviour vzdump is creating
 
If you want to see worst case write performance try fio --filename=/dev/sdb --rw=write --direct=1 --bs=1M --numjobs=4 --ioengine=libaio --refill_buffers --size=75G --name=async_seq_write --group_reporting --iodepth=16 && fio --filename=/dev/sdb --rw=randwrite --direct=1 --bs=4K --sync=1 --numjobs=1 --ioengine=psync --iodepth=1 --refill_buffers --time_based --runtime=120 --name=sync_rand_write && fio --filename=/dev/sdb --rw=write --direct=1 --bs=1M --sync=1 --numjobs=1 --ioengine=psync --iodepth=1 --refill_buffers --time_based --runtime=120 --name=sync_seq_write.
But keep in mind that this will directly write to the device, destroying data on it, so you want to run it on an empty SSD. And you should write a lot of data so the RAM and SLC cache get full, so the above test will first do 300GB of async sequential writes to fill up the caches and directly after that 2 minutes of sync random writes and 2 minutes of sync sequential writes.
Performance will most likely fall down into the low MB/s or even KB/s range.

And keep in mind that this test directly writes to the SSDs without a filesystem or storage solution like ZFS. This would add additional overhead crippling the performance even more. And a full SSD usually is slower than a new and empty SSD, as the the NAND chips will get fragemented over time and the more data is on that SSD, the less space can be used for SLC caching (SSDs don't got dedicated SLC NAND chips, they just write to TLC/QLC NAND in SLC mode).
 
Last edited:
  • Like
Reactions: UdoB
I've never tested my drives with this specific kind of parameter set. It is brutal! A very, very cheap Intenso SSD (in a test/dummy machine) gets down to one single IOP. My expensive Intel Enterprise SSDs (mirrored) stays up with 480 IOPS.
 
If you want to see worst case write performance try fio --filename=/dev/sdb --rw=write --direct=1 --bs=1M --numjobs=4 --ioengine=libaio --refill_buffers --size=75G --name=async_seq_write --group_reporting --iodepth=16 && fio --filename=/dev/sdb --rw=randwrite --direct=1 --bs=4K --sync=1 --numjobs=1 --ioengine=psync --iodepth=1 --refill_buffers --time_based --runtime=120 --name=sync_rand_write && fio --filename=/dev/sdb --rw=write --direct=1 --bs=1M --sync=1 --numjobs=1 --ioengine=psync --iodepth=1 --refill_buffers --time_based --runtime=120 --name=sync_seq_write.
But keep in mind that this will directly write to the device, destroying data on it, so you want to run it on an empty SSD. And you should write a lot of data so the RAM and SLC cache get full, so the above test will first do 300GB of async sequential writes to fill up the caches and directly after that 2 minutes of sync random writes and 2 minutes of sync sequential writes.
Performance will most likely fall down into the low MB/s or even KB/s range.

And keep in mind that this test directly writes to the SSDs without a filesystem or storage solution like ZFS. This would add additional overhead crippling the performance even more. And a full SSD usually is slower than a new and empty SSD, as the the NAND chips will get fragemented over time and the more data is on that SSD, the less space can be used for SLC caching (SSDs don't got dedicated SLC NAND chips, they just write to TLC/QLC NAND in SLC mode).

My MX500 ran this OK in my opinion...

1st fio - 180MBps 180 iops
2nd fio - 1.3 MBps 326 iops
3rd fio - 122Mbps 122 iops

Bash:
root@pve:~# fio --filename=/dev/sdb --rw=write --direct=1 --bs=1M --numjobs=4 --ioengine=libaio --refill_buffers --size=75G --name=async_seq_write --group_reporting --iodepth=16 && fio --filename=/dev/sdb --rw=randwrite --direct=1 --bs=4K --sync=1 --numjobs=1 --ioengine=psync --iodepth=1 --refill_buffers --time_based --runtime=120 --name=sync_rand_write && fio --filename=/dev/sdb --rw=write --direct=1 --bs=1M --sync=1 --numjobs=1 --ioengine=psync --iodepth=1 --refill_buffers --time_based --runtime=120 --name=sync_seq_write
async_seq_write: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=16
...
fio-3.25
Starting 4 processes
Jobs: 3 (f=3): [W(3),_(1)][100.0%][w=138MiB/s][w=138 IOPS][eta 00m:00s]
async_seq_write: (groupid=0, jobs=4): err= 0: pid=357967: Mon Oct  3 21:22:18 2022
  write: IOPS=181, BW=181MiB/s (190MB/s)(300GiB/1694298msec); 0 zone resets
    slat (usec): min=9, max=619461, avg=21925.18, stdev=41263.19
    clat (msec): min=47, max=1436, avg=330.82, stdev=121.32
     lat (msec): min=54, max=1537, avg=352.75, stdev=122.83
    clat percentiles (msec):
     |  1.00th=[  188],  5.00th=[  188], 10.00th=[  190], 20.00th=[  251],
     | 30.00th=[  251], 40.00th=[  253], 50.00th=[  253], 60.00th=[  359],
     | 70.00th=[  477], 80.00th=[  485], 90.00th=[  493], 95.00th=[  502],
     | 99.00th=[  550], 99.50th=[  567], 99.90th=[  902], 99.95th=[  944],
     | 99.99th=[ 1334]
   bw (  KiB/s): min=32768, max=321847, per=100.00%, avg=185900.65, stdev=15922.07, samples=13542
   iops        : min=   32, max=  314, avg=181.43, stdev=15.54, samples=13542
  lat (msec)   : 50=0.01%, 100=0.01%, 250=21.13%, 500=73.38%, 750=5.28%
  lat (msec)   : 1000=0.19%, 2000=0.02%
  cpu          : usr=0.56%, sys=0.06%, ctx=77980, majf=0, minf=56
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,307200,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=181MiB/s (190MB/s), 181MiB/s-181MiB/s (190MB/s-190MB/s), io=300GiB (322GB), run=1694298-1694298msec

Disk stats (read/write):
  sdb: ios=402/614360, merge=0/0, ticks=44680/102269033, in_queue=102313714, util=100.00%
sync_rand_write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.25
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=1296KiB/s][w=324 IOPS][eta 00m:00s]
sync_rand_write: (groupid=0, jobs=1): err= 0: pid=362637: Mon Oct  3 21:24:18 2022
  write: IOPS=326, BW=1304KiB/s (1335kB/s)(153MiB/120003msec); 0 zone resets
    clat (usec): min=1518, max=40827, avg=3055.96, stdev=1703.04
     lat (usec): min=1518, max=40827, avg=3056.15, stdev=1703.04
    clat percentiles (usec):
     |  1.00th=[ 1598],  5.00th=[ 1680], 10.00th=[ 1696], 20.00th=[ 1696],
     | 30.00th=[ 1713], 40.00th=[ 1778], 50.00th=[ 3687], 60.00th=[ 4015],
     | 70.00th=[ 4113], 80.00th=[ 4178], 90.00th=[ 4228], 95.00th=[ 4359],
     | 99.00th=[ 5211], 99.50th=[13304], 99.90th=[13829], 99.95th=[19006],
     | 99.99th=[40633]
   bw (  KiB/s): min= 1008, max= 1384, per=100.00%, avg=1305.37, stdev=59.40, samples=239
   iops        : min=  252, max=  346, avg=326.34, stdev=14.85, samples=239
  lat (msec)   : 2=46.34%, 4=9.64%, 10=43.06%, 20=0.91%, 50=0.04%
  cpu          : usr=0.72%, sys=0.57%, ctx=39124, majf=0, minf=12
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,39123,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1304KiB/s (1335kB/s), 1304KiB/s-1304KiB/s (1335kB/s-1335kB/s), io=153MiB (160MB), run=120003-120003msec

Disk stats (read/write):
  sdb: ios=75/39084, merge=0/0, ticks=55/118422, in_queue=118477, util=100.00%
sync_seq_write: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.25
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=128MiB/s][w=128 IOPS][eta 00m:00s]
sync_seq_write: (groupid=0, jobs=1): err= 0: pid=362984: Mon Oct  3 21:26:18 2022
  write: IOPS=122, BW=122MiB/s (128MB/s)(14.3GiB/120002msec); 0 zone resets
    clat (msec): min=4, max=466, avg= 8.08, stdev= 8.88
     lat (msec): min=4, max=466, avg= 8.08, stdev= 8.88
    clat percentiles (msec):
     |  1.00th=[    6],  5.00th=[    8], 10.00th=[    8], 20.00th=[    8],
     | 30.00th=[    8], 40.00th=[    8], 50.00th=[    8], 60.00th=[    8],
     | 70.00th=[    8], 80.00th=[    8], 90.00th=[    9], 95.00th=[   10],
     | 99.00th=[   18], 99.50th=[   18], 99.90th=[   47], 99.95th=[   64],
     | 99.99th=[  460]
   bw (  KiB/s): min=32768, max=137216, per=100.00%, avg=125022.26, stdev=14446.51, samples=239
   iops        : min=   32, max=  134, avg=122.09, stdev=14.11, samples=239
  lat (msec)   : 10=96.74%, 20=3.12%, 50=0.07%, 100=0.03%, 250=0.01%
  lat (msec)   : 500=0.04%
  cpu          : usr=1.57%, sys=0.77%, ctx=14649, majf=0, minf=14
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,14641,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=122MiB/s (128MB/s), 122MiB/s-122MiB/s (128MB/s-128MB/s), io=14.3GiB (15.4GB), run=120002-120002msec

Disk stats (read/write):
  sdb: ios=75/29254, merge=0/0, ticks=105/180192, in_queue=180298, util=100.00%

Still not as hard as a backup restore...

I'm still trying to understand why VM backups just hang the SSD...
Can't figure it out..

300KBps 50 iops
Have been unable to restore VMs since it takes more than a couple weeks!
 
AFAIK, proxmox backup use 64k block , with iodepth=1, sequentially.

(but crucial ssd are pretty slow, around 300-400 iops wit iodepth=1 4k. vs 10000-20000iops for an datacenter ssd because of their supercapacitor and writecache)
 
Well I think my problem comes from my
proxmox configuration which makes any big manipulation on the disk impossible..

it's in zfs (the disks have been put in raid thanks to the proxmox installer

I'm going to reset my entire dedicate to the datacenter this weekend to see if I find a solution I'll come back here if I have a problem! :)
 
Note that with zfs, you shouldn't use consumer ssd, because zfs do syncronous write for his journal.
J'apprend a utiliser proxmox encore aujourd'hui donc je ne suis pas a l'abris de quelques erreurs de débutant mais je te remercie pour cette remarque car je pensais pas qu'il fallait un disque spécifique pour utiliser le zfs
 
AFAIK, proxmox backup use 64k block , with iodepth=1, sequentially.

(but crucial ssd are pretty slow, around 300-400 iops wit iodepth=1 4k. vs 10000-20000iops for an datacenter ssd because of their supercapacitor and writecache)

Code:
root@pve:~# fio --filename=/dev/sdb --name=randfile --ioengine=libaio --iodepth=1 --rw=write --bs=4k --direct=1 --size=1G --numjobs=1 --group_reporting
randfile: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
^Cbs: 1 (f=1): [W(1)][6.4%][w=6866KiB/s][w=1716 IOPS][eta 02m:27s]
fio: terminating on signal 2

randfile: (groupid=0, jobs=1): err= 0: pid=381444: Mon Oct  3 23:16:35 2022
  write: IOPS=1685, BW=6743KiB/s (6905kB/s)(65.3MiB/9913msec); 0 zone resets
    slat (nsec): min=4043, max=40764, avg=7533.39, stdev=1793.83
    clat (usec): min=390, max=29329, avg=584.23, stdev=240.11
     lat (usec): min=399, max=29337, avg=591.93, stdev=240.09
    clat percentiles (usec):
     |  1.00th=[  437],  5.00th=[  469], 10.00th=[  490], 20.00th=[  562],
     | 30.00th=[  578], 40.00th=[  578], 50.00th=[  578], 60.00th=[  578],
     | 70.00th=[  578], 80.00th=[  578], 90.00th=[  586], 95.00th=[  898],
     | 99.00th=[  930], 99.50th=[  955], 99.90th=[ 1004], 99.95th=[ 1012],
     | 99.99th=[ 1909]
   bw (  KiB/s): min= 6344, max= 6928, per=100.00%, avg=6744.42, stdev=128.79, samples=19
   iops        : min= 1586, max= 1732, avg=1686.11, stdev=32.20, samples=19
  lat (usec)   : 500=10.48%, 750=83.95%, 1000=5.45%
  lat (msec)   : 2=0.13%, 50=0.01%
  cpu          : usr=2.06%, sys=4.53%, ctx=16712, majf=0, minf=14
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,16712,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=6743KiB/s (6905kB/s), 6743KiB/s-6743KiB/s (6905kB/s-6905kB/s), io=65.3MiB (68.5MB), run=9913-9913msec

Disk stats (read/write):
  sdb: ios=53/16686, merge=0/0, ticks=27/9349, in_queue=9375, util=99.35%

Must be something else.
Still getting 1700 iops 6MBps+ with iodepth=1 4k
Not the best result of the world, but the worst SSD should be capable of handling a backup in that condition, right ?


This is what hapens with a backup...
How long will it take to restore a 1TB VM... ?
KVqn1x1.png
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!