Proxmox server blocked by heavy i/o

holckj

Active Member
Feb 14, 2021
42
2
28
69
Denmark
This morning, my VMs on my Proxmox server weren't available. I remember message like "No space left on device", but not quite sure. I've tried several reboots. The machine boots, but the VMs and the GUI don't start.

df -h shows

Code:
Filesystem        Size  Used Avail Use% Mounted on
udev               16G     0   16G   0% /dev
tmpfs             3.2G  9.0M  3.2G   1% /run
rpool/ROOT/pve-1  512G  512G  4.8M 100% /
tmpfs              16G   34M   16G   1% /dev/shm
tmpfs             5.0M     0  5.0M   0% /run/lock
rpool             4.9M  128K  4.8M   3% /rpool
rpool/data        4.9M  128K  4.8M   3% /rpool/data
rpool/ROOT        4.9M  128K  4.8M   3% /rpool/ROOT
tmpfs             3.2G     0  3.2G   0% /run/user/0
/dev/fuse         128M   24K  128M   1% /etc/pve

When I start the hardware, the load is very high, even simple commands like "ls" take a long time. There seems to be very much disk-activilty.

Here is what top shows

Code:
top - 09:32:55 up 35 min,  2 users,  load average: 4.36, 3.36, 3.53
Tasks: 213 total,   1 running, 212 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.1 us,  0.2 sy,  0.0 ni, 29.6 id, 70.2 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  31957.5 total,  30641.5 free,   1169.0 used,    147.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  30425.0 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
      2 root      20   0       0      0      0 S   0.3   0.0   0:04.74 kthreadd
     49 root      20   0       0      0      0 I   0.3   0.0   0:01.42 kworker/1:1-events
    274 root       0 -20       0      0      0 S   0.3   0.0   0:05.95 spl_dynamic_tas
    349 root       1 -19       0      0      0 S   0.3   0.0   0:02.27 z_wr_iss
    351 root       0 -20       0      0      0 S   0.3   0.0   0:14.49 z_wr_int
    476 root      20   0       0      0      0 D   0.3   0.0   0:01.61 txg_sync
  11522 root      20   0   10224   3656   2892 S   0.3   0.0   0:02.12 top
  21378 root      20   0  271232  86048   4076 S   0.3   0.3   0:01.36 pve-firewall
 114265 root      20   0   10228   3664   2888 R   0.3   0.0   0:00.01 top
      1 root      20   0  311828   8560   5472 S   0.0   0.0   0:03.94 systemd
      3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp
      4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par_gp
      6 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/0:0H-events_highpri
      9 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 mm_percpu_wq
     10 root      20   0       0      0      0 S   0.0   0.0   0:00.00 rcu_tasks_rude_
     11 root      20   0       0      0      0 S   0.0   0.0   0:00.00 rcu_tasks_trace
     12 root      20   0       0      0      0 S   0.0   0.0   0:00.03 ksoftirqd/0

Any help appreciated :)

Jesper, Denmark
 
This morning, my VMs on my Proxmox server weren't available. I remember message like "No space left on device", but not quite sure. I've tried several reboots. The machine boots, but the VMs and the GUI don't start.

df -h shows

Code:
Filesystem        Size  Used Avail Use% Mounted on
udev               16G     0   16G   0% /dev
tmpfs             3.2G  9.0M  3.2G   1% /run
rpool/ROOT/pve-1  512G  512G  4.8M 100% /
tmpfs              16G   34M   16G   1% /dev/shm
tmpfs             5.0M     0  5.0M   0% /run/lock
rpool             4.9M  128K  4.8M   3% /rpool
rpool/data        4.9M  128K  4.8M   3% /rpool/data
rpool/ROOT        4.9M  128K  4.8M   3% /rpool/ROOT
tmpfs             3.2G     0  3.2G   0% /run/user/0
/dev/fuse         128M   24K  128M   1% /etc/pve

When I start the hardware, the load is very high, even simple commands like "ls" take a long time. There seems to be very much disk-activilty.

Here is what top shows

Code:
top - 09:32:55 up 35 min,  2 users,  load average: 4.36, 3.36, 3.53
Tasks: 213 total,   1 running, 212 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.1 us,  0.2 sy,  0.0 ni, 29.6 id, 70.2 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  31957.5 total,  30641.5 free,   1169.0 used,    147.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  30425.0 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
      2 root      20   0       0      0      0 S   0.3   0.0   0:04.74 kthreadd
     49 root      20   0       0      0      0 I   0.3   0.0   0:01.42 kworker/1:1-events
    274 root       0 -20       0      0      0 S   0.3   0.0   0:05.95 spl_dynamic_tas
    349 root       1 -19       0      0      0 S   0.3   0.0   0:02.27 z_wr_iss
    351 root       0 -20       0      0      0 S   0.3   0.0   0:14.49 z_wr_int
    476 root      20   0       0      0      0 D   0.3   0.0   0:01.61 txg_sync
  11522 root      20   0   10224   3656   2892 S   0.3   0.0   0:02.12 top
  21378 root      20   0  271232  86048   4076 S   0.3   0.3   0:01.36 pve-firewall
 114265 root      20   0   10228   3664   2888 R   0.3   0.0   0:00.01 top
      1 root      20   0  311828   8560   5472 S   0.0   0.0   0:03.94 systemd
      3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp
      4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par_gp
      6 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/0:0H-events_highpri
      9 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 mm_percpu_wq
     10 root      20   0       0      0      0 S   0.0   0.0   0:00.00 rcu_tasks_rude_
     11 root      20   0       0      0      0 S   0.0   0.0   0:00.00 rcu_tasks_trace
     12 root      20   0       0      0      0 S   0.0   0.0   0:00.03 ksoftirqd/0

Any help appreciated :)

Jesper, Denmark
OK, I managed to fix it, I hope. Unfortunately, the external disk I use for nightly backups weren't mounted, so this night Proxmox placed the backup on the local filesystem, filling it totally. Removing the local backup-file solved the problem.
 
Great! Here's what I've now done:
Put this script in /usr/local/bin/backup_hook.sh (my external disk is mounted as /mnt/intenso)
Code:
#!/usr/bin/env bash
grep -qs /mnt/intenso /proc/mounts
if [ $? -ne 0 ]; then
  # skip backup
  echo "Skipping backup - disk not mounted";
  exit 1
else
  # continue
  echo "All ok";
  exit 0
fi

Added a "script" line to /etc/pve/jobs.cfg

Code:
vzdump: 5d63c803ba05596f83d3f6e93851fcbcdfd51e03:1
        schedule 02:15
        compress zstd
        enabled 1
        mailnotification always
        mailto jesper.holck@xsxsxsxsxs.dk
        mode snapshot
        node sonja
        quiet 1
        script /usr/local/bin/backup_hook.sh
        storage intenso
        vmid 102

It seems to work :) It's probably a good idea to also check if destination is writable and has enough space, as VictorSTS suggests.
 
  • Like
Reactions: VictorSTS