Excessive high server load during backup, server unresponsive

George Michalopoulos

Active Member
Mar 19, 2018
27
9
43
62
hello all,

the title says them all.
during backup, i have excessive high server load which makes the server unresponsive.
server is new in Hetzner, backup goes to a storage box.
tried ztsd, lzo, even none, the same..

these are two screenshots from Zabbix, which i use to monitor the server.
 

Attachments

  • Screenshot 2024-05-14 at 5.14.49 PM.png
    Screenshot 2024-05-14 at 5.14.49 PM.png
    44.2 KB · Views: 16
  • Screenshot 2024-05-14 at 5.14.59 PM.png
    Screenshot 2024-05-14 at 5.14.59 PM.png
    60.7 KB · Views: 16
i installed the server yesterday, it's decent, but i cannot take backups..
it has 2 NVME 1tb in mirror and a 6Tb hard disk..
storage box is attached via cifs
 

Attachments

  • Screenshot 2024-05-14 at 5.24.07 PM.png
    Screenshot 2024-05-14 at 5.24.07 PM.png
    55.1 KB · Views: 10
Your graph indicates high IO wait time:

I/O wait time is a metric used to measure the amount of time the CPU waits for disk I/O operations to complete. A high I/O wait time indicates an idle CPU and outstanding I/O requests—while it might not make a system unhealthy, it will limit the performance of the CPU.

The CPU’s I/O wait signifies that while no processes were in a runnable state, at least one I/O operation was in progress. In simple terms, I/O wait is the time spent by the CPU waiting for I/O completion.

I/O wait simply indicates the state of the CPU or CPU cores. High I/O wait means the CPU is outstanding on requests, but a further investigation is needed to confirm the source and effect.

Here are a few possible causes of high I/O wait time:

  • Bottlenecks in the storage layer that cause the drive to take more time to respond to I/O requests
  • A queue of I/O requests in the storage layer that lead to an increase in latency
  • Block devices (such as physical disks) that are too slow or simply at saturation point
  • Processes that are in an uninterruptible sleep state
  • Processes that perform heavy read and write operations to the disk
  • Swapping of a partition or file that are performed due to RAM shortage on the host or guest operating system
  • Disk and network I/O operations, that are the most common cause of system slowness
  • Slow disk or degraded RAID array that delays accessing the memory for read and write operations



https://www.site24x7.com/learn/linux/troubleshoot-high-io-wait.html

More info needed to troubleshoot this further, but assuming you are not running on a dedicated server, it might be some sort of IOPS limit. Contact Hetzner about this.
 
  • Like
Reactions: Kingneutron
this is very strange..
3 times, during backup load avg went to 200+ and i had to make a hardware reset..
i tried to backup one by one all vms everything went fine.
then i tried to run the backup script for all, everything went fine also..
io delay up to 5-6%..
 

Attachments

  • Screenshot 2024-05-15 at 2.22.01 AM.png
    Screenshot 2024-05-15 at 2.22.01 AM.png
    54.4 KB · Views: 1
here you can see the graphs from zabbix during the last vm by vm and all backup..
 

Attachments

  • Screenshot 2024-05-15 at 2.24.56 AM.png
    Screenshot 2024-05-15 at 2.24.56 AM.png
    63.3 KB · Views: 4
  • Screenshot 2024-05-15 at 2.24.48 AM.png
    Screenshot 2024-05-15 at 2.24.48 AM.png
    45.5 KB · Views: 4
Interesting, I did read somewhere that this issue might be caused by the raid controller doing some sort of initialization after creating a virtual disk, which could take a while with bigger disks. This might have been your issue, so hopefully it is done now thus fixing your issue.
 
well, i finished my debugging.. i got this server because it had 2x1Tb NVME (to use for web servers, etc) and 1x6Tb hdd for storage.
i created a VM in mirrored NVMEs, but after hours, i found out that the problem was not with proxmox itself, but with the VM running from NVME.
stopped the VM, moved it to the 6Tb HDD, and no problems, till now.
everything that was freezing the system yesterday, now can be done, without any problem.
in case everyone else has the same problem, here is some info of the hardware:
Code:
# lspci
00:00.0 Host bridge: Intel Corporation 8th Gen Core Processor Host Bridge/DRAM Registers (rev 07)
00:02.0 VGA compatible controller: Intel Corporation CoffeeLake-S GT2 [UHD Graphics 630]
00:14.0 USB controller: Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller (rev 10)
00:14.2 RAM memory: Intel Corporation Cannon Lake PCH Shared SRAM (rev 10)
00:16.0 Communication controller: Intel Corporation Cannon Lake PCH HECI Controller (rev 10)
00:17.0 SATA controller: Intel Corporation Cannon Lake PCH SATA AHCI Controller (rev 10)
00:1b.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #21 (rev f0)
00:1c.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #5 (rev f0)
00:1c.5 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #6 (rev f0)
00:1d.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #9 (rev f0)
00:1f.0 ISA bridge: Intel Corporation Device a308 (rev 10)
00:1f.4 SMBus: Intel Corporation Cannon Lake PCH SMBus Controller (rev 10)
00:1f.5 Serial bus controller: Intel Corporation Cannon Lake PCH SPI Controller (rev 10)
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (7) I219-V (rev 10)
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
03:00.0 PCI bridge: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge (rev 04)
05:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983

the disks:
Code:
Disk /dev/sda: 5.46 TiB, 6001175126016 bytes, 11721045168 sectors
Disk model: TOSHIBA MG04ACA6
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 12EB3767-7D92-4D1D-A4A4-F35D1F875400

Disk /dev/nvme0n1: 953.87 GiB, 1024209543168 bytes, 2000409264 sectors
Disk model: SAMSUNG MZVLB1T0HBLR-00000             
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x9336de4d

Disk /dev/nvme1n1: 953.87 GiB, 1024209543168 bytes, 2000409264 sectors
Disk model: SAMSUNG MZVLB1T0HBLR-00000             
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xe42de8e4

and some screencaptures of how high can the loadavg can go..
 

Attachments

  • Screenshot 2024-05-15 at 1.34.07 PM.png
    Screenshot 2024-05-15 at 1.34.07 PM.png
    42.1 KB · Views: 3
  • Screenshot 2024-05-15 at 1.34.17 PM.png
    Screenshot 2024-05-15 at 1.34.17 PM.png
    61.7 KB · Views: 3

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!