I/O Issue during backup

Inferno68

New Member
Apr 23, 2024
8
1
3
Hello all,

Recently I started having issue during mainly backups, proxmox hangs and the management page is not available anymore, backup never completes and everything basically stops. Hard reboot is the only solution in this case.

Machine is a Minisforum MS-01 with the i9-13900H

From what I saw I had high I/O of 45% and then nothing... I guess it is storage related. The boot drive is a cheap 1tb nvme from Lexar, it hosts as well a few low use LXC storage (Pihole, cloudflared, Wireguard server,...) and 1 low use VM.

The storage nvme for all other VM's is a Samsung 990pro 4tb, I have docker, plex, a Windows VM, Roon, home assistant, truenas for test, minecraft server)

Backup was on snapshot for all the vm/lxc.

Do you know or can guide me to where the issue may be ? thanks !

See the relevant part of the log; there is nothing after it, till it reboots (nvme1 is the Lexar drive):

May 11 02:00:06 pve pvescheduler[350646]: INFO: starting new backup job: vzdump 100 101 102 104 105 106 107 999 --quiet 1 --notes-template '{{guestname}}' --mailnotification failure --bwlimit 204800 --compress zstd --storage NAS --node pve --fleecing 0 --mode snapshot --prune-backups 'keep-daily=7,keep-monthly=2,keep-yearly=1'
May 11 02:00:06 pve pvescheduler[350646]: INFO: Starting Backup of VM 100 (qemu)
May 11 02:12:57 pve pvescheduler[350646]: INFO: Finished Backup of VM 100 (00:12:51)
May 11 02:12:57 pve pvescheduler[350646]: INFO: Starting Backup of VM 101 (qemu)
May 11 02:17:01 pve CRON[358679]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
May 11 02:17:01 pve CRON[358680]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
May 11 02:17:01 pve CRON[358679]: pam_unix(cron:session): session closed for user root
May 11 02:21:30 pve pvescheduler[350646]: INFO: Finished Backup of VM 101 (00:08:33)
May 11 02:21:30 pve pvescheduler[350646]: INFO: Starting Backup of VM 102 (qemu)
May 11 02:23:43 pve pve-firewall[1946]: firewall update time (7.329 seconds)
May 11 02:23:44 pve pvestatd[1954]: status update time (7.112 seconds)
May 11 02:24:25 pve pve-firewall[1946]: firewall update time (19.170 seconds)
May 11 02:24:26 pve pvestatd[1954]: status update time (19.386 seconds)
May 11 02:24:30 pve pve-ha-lrm[1990]: loop take too long (32 seconds)
May 11 02:24:44 pve pvescheduler[362733]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
May 11 02:24:44 pve pve-firewall[1946]: firewall update time (9.160 seconds)
May 11 02:24:45 pve pvestatd[1954]: status update time (8.960 seconds)
May 11 02:29:11 pve pvescheduler[350646]: INFO: Finished Backup of VM 102 (00:07:41)
May 11 02:29:12 pve pvescheduler[350646]: INFO: Starting Backup of VM 104 (qemu)
May 11 02:29:35 pve pve-firewall[1946]: firewall update time (9.897 seconds)
May 11 02:29:35 pve systemd[1]: Starting pve-daily-update.service - Daily PVE download activities...
May 11 02:30:03 pve pve-firewall[1946]: firewall update time (18.145 seconds)
May 11 02:30:06 pve pvestatd[1954]: status update time (39.916 seconds)
May 11 02:30:07 pve pveupdate[367530]: <root@pam> starting task UPID:pve:00059C41:004FA0CC:663EBC0F:aptupdate::root@pam:
May 11 02:30:21 pve pvestatd[1954]: status update time (5.479 seconds)
May 11 02:30:46 pve pveproxy[303815]: detected empty handle
May 11 02:30:46 pve pve-firewall[1946]: firewall update time (13.551 seconds)
May 11 02:30:52 pve kernel: nvme nvme1: I/O tag 109 (006d) opcode 0x1 (I/O Cmd) QID 1 timeout, aborting req_op:WRITE(1) size:524288
-- Reboot --
 
Looks like Proxmox cannot write to the NVMe (where it is installed) at least once and then the drive probably disappears (or simply refuses all writes). When something like that happens, no more logs can be written and Proxmox then usually crashes/freezes quite quickly.
Either the NVMe drive is bad (media or controller) or it cannot handle the many writes that Proxmox does (and maybe overheats?), or there is an underlying problem with the PCIe bus or M.2 connector.
Maybe check for a firmware update for the device (and/or motherboard) and/or test with a different (enterprise) NVMe drive (make sure not to use a QLC drive!).
 
Thanks for the hints, do you have a brand/model of drive I could use (affordable, ~300€) ? it's for homelab only. Refursbished ones are not common on Ebay in Europe unfortunately.
 
Thanks for the hints, do you have a brand/model of drive I could use (affordable, ~300€) ? it's for homelab only.
I was much impressed by Micron 7450 (with PLP and therefore way better sync. writes/sec) for ZFS but I also have a Samsung 970 EVO (TLC consumer drive) that is still working. No guarantees though, as every brand/model can have an unlucky production run.

EDIT: All this has been discussed before on this forum but I don't know the magic keyword to find those posts. Maybe search for QLC as those threads always contain an answer to your same question.
 
Last edited:
I finally found 2x Samsung PM9A3 960gb on Ebay, looks like samsung are the only entrprise nmve there... Should arrive in a few weeks. Let's see !

Ah yeah, googled my current drive and it is QLC....
 
Last edited:
  • Like
Reactions: Kevo

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!