[SOLVED] High IO delay and high load average by backup job

NeODarK · Feb 10, 2022

Hi, I have proxmox with an msata for system and an SSD for VMs... also I have a HDD for backups tasks, 250 Gb hard disk... two days ago, I´m experiencing a high IO delay and a load average of 4.0 more or less.... I make the backups at 4.30am and now it´s 2pm and a VM still locked... yesterday was another machine that I deleted thinking it could be the issue, but not... today it´s locked to another VM....

I think it could be a faulty hard disk??

Here is the smartctl results, but I don´t know to interpret the info...

Is any faulty value in this hard disk? if not, where could be the issue?

I hope could get any help for any user / staff people

This is a copy/paste of the task viewer / backup job (last lines)
---------------------

Code:

INFO:  88% (88.0 GiB of 100.0 GiB) in 4m 49s, read: 2.6 GiB/s, write: 0 B/s
INFO:  91% (91.8 GiB of 100.0 GiB) in 4m 52s, read: 1.3 GiB/s, write: 230.7 KiB/s
INFO:  97% (97.1 GiB of 100.0 GiB) in 4m 55s, read: 1.8 GiB/s, write: 0 B/s
INFO: 100% (100.0 GiB of 100.0 GiB) in 4m 57s, read: 1.4 GiB/s, write: 0 B/s
INFO: backup is sparse: 95.06 GiB (95%) total zero data
INFO: transferred 100.00 GiB in 297 seconds (344.8 MiB/s)
INFO: stopping kvm after backup task
INFO: archive file size: 1.86GB
INFO: prune older backups with retention: keep-last=2
INFO: removing backup 'backup:backup/vzdump-qemu-108-2022_02_07-09_21_59.vma.zst'
INFO: pruned 1 backup(s) not covered by keep-retention policy
INFO: Finished Backup of VM 108 (00:04:59)
INFO: Backup finished at 2022-02-10 11:29:31
INFO: Starting Backup of VM 109 (lxc)
INFO: Backup started at 2022-02-10 11:29:31
INFO: status = stopped
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: CT Name: 109-Ubuntu-Pruebas
INFO: including mount point rootfs ('/') in backup
INFO: creating vzdump archive '/mnt/Disco250Gb/backup/dump/vzdump-lxc-109-2022_02_10-11_29_31.tar.zst'

--------------------------------------

Thanks!!

Here is the smartctl report of the HDD (the hard disk where I save the backups)
-------------------

Code:

root@pve1:~# smartctl -a /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.13.19-4-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi Travelstar 5K320
Device Model:     Hitachi HTS543225L9A300
Serial Number:    090807FB8D00LJHBR1KA
LU WWN Device Id: 5 000cca 55ed36a5c
Firmware Version: FBEOC40C
User Capacity:    250,059,350,016 bytes [250 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 3f
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Thu Feb 10 13:56:19 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  645) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 102) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   062    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   040    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0007   253   253   033    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0012   065   065   000    Old_age   Always       -       55611
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   040    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0012   076   076   000    Old_age   Always       -       10561
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       2002
191 G-Sense_Error_Rate      0x000a   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       138
193 Load_Cycle_Count        0x0012   089   089   000    Old_age   Always       -       114215
194 Temperature_Celsius     0x0002   203   203   000    Old_age   Always       -       27 (Min/Max 13/52)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       1
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0
223 Load_Retry_Count        0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

----------------------------------

NeODarK · Feb 11, 2022

Hi, finally solved, finally I go shut down VM to VM and a Windows 10 machine was doing the issues with the backups and causing the IO delay... deleted Windows 10 machine and installed in their place a Linux Mint machine.... issue solved!!

NeODarK · Feb 19, 2022

Hi, issue again... it´s not a Windows 10 machine issue... again it´s with high delay and and un-finished backup task... anyone can help me with this issue?

Dunuin · Feb 19, 2022

Would be way easier to read your console output if you would put it between CODE tags so the formating isn't lost which makes tables unreadable.

The HDD got no uncorrectible/pending errors so I guess the disk is fine. But you could initialize a long selftest using smartctl -t long /dev/sdb and then check some hours later if that test finished successfully using smartctl -a /dev/sdb.

If "INFO: creating vzdump archive '/mnt/Disco250Gb/backup/dump/vzdump-lxc-109-2022_02_10-11_29_31.tar.zst'" was the last that was shown by the backup task...this is quite normal and can take very long and won't give you any feedback ultil the LXC is fully archived. Do you maybe got a very slow CPU?

NeODarK · Feb 19, 2022

Hi, thanks for your prompt reply, I´ve fixed the thread and added "CODE" tags... no, the CPU is not slow... this issue start 10th Feb of this year... machine working two years ago without troubles... then the CPU could not be the trouble... and the same backup configuration....

Regards!!

Dunuin · Feb 19, 2022

You should be consider disableing energy saving for that HDD (you can do that using hdparm. For example setting the APM to 192). If I look at your SMART attributes it will park the heads 11 times per hour and spins up/down the HDD 5 times per hour. This will cause mechanical stress to the disk and the disk might wear faster. A HDD is only rated for a limited time of head parking/spinups and your HDD is already down there from 100% to 65% and 89%.

NeODarK · Feb 19, 2022

Dunuin said:
You should be consider disableing energy saving for that HDD (you can do that using hdparm. For example setting the APM to 192). If I look at your SMART attributes it will park the heads 11 times per hour and spins up/down the HDD 5 times per hour. This will cause mechanical stress to the disk and the disk might wear faster. A HDD is only rated for a limited time of head parking/spinups and your HDD is already down there from 100% to 65% and 89%.

Hi, the actual value is in 254 (max performance)... I think it always is on, right?

Dunuin · Feb 19, 2022

In general a APM value of above 127 should prevent a HDD from spindown. So looks like your HDDs firmware is totally ignoring this with its 55611 spindowns in 10561 hours. My HDDs in the NAS with APM of 192 for example only got 81 spindowns in 21825 hours. And headparking is only 1611 in 21825 hours and you got 114215 headparkings in half the time. So your disk is more working in a super powersave mode. Maybe because it is a laptop disk which isn'T supposed to work 24/7 in a server.

NeODarK · Feb 19, 2022

Dunuin said:
In general a APM value of above 127 should prevent a HDD from spindown. So looks like your HDDs firmware is totally ignoring this with its 55611 spindowns in 10561 hours. My HDDs in the NAS with APM of 192 for example only got 81 spindowns in 21825 hours. And headparking is only 1611 in 21825 hours and you got 114215 headparkings in half the time. So your disk is more working in a super powersave mode. Maybe because it is a laptop disk which isn'T supposed to work 24/7 in a server.

Yes, maybe this disk was from a laptop disk... I use only for store backups, templates and isos... then you saw changing APM to 192 could solve this issue... I´ll change it...

I have other proxmox server with same configuration and same disk drive and has double hours and half headparkings....

The most strange is that this behaviour started this month.... without changing nothing

Thanks for your support!!

Dunuin · Feb 19, 2022

NeODarK said:
Yes, maybe this disk was from a laptop disk... I use only for store backups, templates and isos... then you saw changing APM to 192 could solve this issue... I´ll change it...

254 should even be less powersaving than 192 so changing it to 192 won't help much.

NeODarK said:
I have other proxmox server with same configuration and same disk drive and has double hours and half headparkings....

The most strange is that this behaviour started this month.... without changing nothing

I don't think the backup problems are related to the powersaving of your disk. Just noticed that the disk probably won't life that long with so much spindowns.

What is your nodes summary reporting to be the "IO delay" when you do backups? If the HDD got problems or is bottlenecking than the IO delay should be high.

NeODarK · Feb 19, 2022

Dunuin said:
254 should even be less powersaving than 192 so changing it to 192 won't help much.

I don't think the backup problems are related to the powersaving of your disk. Just noticed that the disk probably won't life that long with so much spindowns.

What is your nodes summary reporting to be the "IO delay" when you do backups? If the HDD got problems or is bottlenecking than the IO delay should be high.

Yes, I know, the hard disk is at minimum power saving... maybe the hard disk could be failing? could be a good idea to change the hard disk for another one?

Dunuin · Feb 19, 2022

NeODarK said:
Yes, I know, the hard disk is at minimum power saving... maybe the hard disk could be failing? could be a good idea to change the hard disk for another one?

There is not a standard what a HDD will do with low or high APM. General consensus should be APM 1 is highest energy savings and APM 255 is highest performance. But who the HDD actually will respond to this APM value is totally defined by the firmware that the manufacturer flashed on the HDD. If the manufacturer don't want you to disable spindown he will just ignore the high APM values.

NeODarK · Feb 21, 2022

Finally , I think found the issue... saturday I change the "backup" hard disk... the hdd that spin on and off lot of times... and the next day I got issue that one proxthin sdd was fail (in the proxmox I have the system in a msata unit and a SDD with the VMs and CTs) and the HDD for backups, templates and isos only... an sdd with 9 days working... and swapped the sdd (for now) for a hdd and system seems working fine... maybe the backup "stuck" come from the sdd failing... the proxmox can´t backup because unit is failing and for that the CT was locked for long time....

Tomorrow I´ll receive a new SDD....

Thanks @Dunuin for your support!!

NeODarK · Feb 24, 2022

Yeah, some days working with the new SSD for VMs and CTs and other HDD for backups and no more delays and no more machines stuck...

The issue, a new SSD

Regards!!

Search

Search

[SOLVED] High IO delay and high load average by backup job

NeODarK

Member

Attachments

NeODarK

Member

NeODarK

Member

Dunuin

Distinguished Member

NeODarK

Member

Dunuin

Distinguished Member

NeODarK

Member

Dunuin

Distinguished Member

NeODarK

Member

Dunuin

Distinguished Member

NeODarK

Member

Dunuin

Distinguished Member

NeODarK

Member

NeODarK

Member