Woke up to 3 VMs with "running (io-error)" this morning

jaxjexjox

Member
Mar 25, 2022
33
0
6
Anyone seen this before?
Disk isn't out of space.

Code:
root@proxmox:~# pveversion -v
proxmox-ve: 8.0.2 (running kernel: 6.2.16-8-pve)
pve-manager: 8.0.4 (running version: 8.0.4/d258a813cfa6b390)
proxmox-kernel-helper: 8.0.3
pve-kernel-5.15: 7.4-4
pve-kernel-5.13: 7.1-9
proxmox-kernel-6.2.16-8-pve: 6.2.16-8
proxmox-kernel-6.2: 6.2.16-8
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.74-1-pve: 5.15.74-1
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.15.39-3-pve: 5.15.39-3
pve-kernel-5.15.39-1-pve: 5.15.39-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.1
libpve-access-control: 8.0.4
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.7
libpve-guest-common-perl: 5.0.4
libpve-http-server-perl: 5.0.4
libpve-rs-perl: 0.8.5
libpve-storage-perl: 8.0.2
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 3.0.2-1
proxmox-backup-file-restore: 3.0.2-1
proxmox-kernel-helper: 8.0.3
proxmox-mail-forward: 0.2.0
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.2
proxmox-widget-toolkit: 4.0.6
pve-cluster: 8.0.3
pve-container: 5.0.4
pve-docs: 8.0.4
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.3
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.5
pve-qemu-kvm: 8.0.2-4
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1




Here's some stuff from the syslog.


Code:
2023-09-01T03:56:44.361173+10:00 proxmox pve-firewall[1754]: firewall update time (11.019 seconds)
2023-09-01T03:56:44.519520+10:00 proxmox pvestatd[1758]: status update time (11.188 seconds)
2023-09-01T03:56:44.574173+10:00 proxmox pmxcfs[1618]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/103: -1
2023-09-01T03:56:44.574808+10:00 proxmox pmxcfs[1618]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/102: -1
2023-09-01T03:56:44.601931+10:00 proxmox pmxcfs[1618]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/proxmox/local-zfs: -1
2023-09-01T03:56:44.602535+10:00 proxmox pmxcfs[1618]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/proxmox/local: -1
2023-09-01T04:01:09.591368+10:00 proxmox pvestatd[1758]: status update time (15.293 seconds)
2023-09-01T04:01:09.754211+10:00 proxmox pmxcfs[1618]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/103: -1
2023-09-01T04:01:09.778434+10:00 proxmox pmxcfs[1618]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/102: -1
2023-09-01T04:01:09.878180+10:00 proxmox pmxcfs[1618]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/proxmox/local: -1
2023-09-01T04:01:09.878548+10:00 proxmox pmxcfs[1618]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/proxmox/local-zfs: -1
2023-09-01T04:01:09.916901+10:00 proxmox pve-firewall[1754]: firewall update time (15.578 seconds)
2023-09-01T04:01:40.283266+10:00 proxmox pvestatd[1758]: status update time (10.838 seconds)
2023-09-01T04:01:40.284540+10:00 proxmox pve-firewall[1754]: firewall update time (11.018 seconds)
2023-09-01T04:03:07.712582+10:00 proxmox pvestatd[1758]: status update time (7.272 seconds)
2023-09-01T04:03:07.809692+10:00 proxmox pve-firewall[1754]: firewall update time (7.515 seconds)
2023-09-01T04:03:47.358090+10:00 proxmox pve-firewall[1754]: firewall update time (7.342 seconds)
2023-09-01T04:03:47.990995+10:00 proxmox pvestatd[1758]: status update time (7.049 seconds)
2023-09-01T04:04:56.866241+10:00 proxmox smartd[1276]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 61 to 60
2023-09-01T04:05:38.030596+10:00 proxmox pvestatd[1758]: status update time (17.180 seconds)
2023-09-01T04:05:38.069165+10:00 proxmox pve-firewall[1754]: firewall update time (7.290 seconds)
2023-09-01T04:06:15.407806+10:00 proxmox pvestatd[1758]: status update time (17.284 seconds)
2023-09-01T04:06:15.464802+10:00 proxmox pve-firewall[1754]: firewall update time (15.169 seconds)
2023-09-01T04:06:56.413818+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:07:01.414634+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:07:06.415211+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:07:11.416507+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:07:14.878861+10:00 proxmox pvescheduler[3352245]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
2023-09-01T04:07:14.879160+10:00 proxmox pvescheduler[3352214]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
2023-09-01T04:07:16.417971+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:07:21.419449+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:07:26.420746+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:07:31.422229+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:07:36.423726+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:07:41.425171+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:07:46.426690+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:07:51.428221+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:07:56.429633+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:08:01.431113+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:08:06.432402+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:08:11.433900+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:08:14.895149+10:00 proxmox pvescheduler[3377383]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
2023-09-01T04:08:14.897000+10:00 proxmox pvescheduler[3377382]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
2023-09-01T04:08:16.435202+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:08:21.436638+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:08:26.437885+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:08:31.439163+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:08:36.440527+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error

I've rebooted one of the VMs and it's back working and fine but this obviously isn't ideal.
 
I also have this error when I just search for nvme using dmesg | grep nvme
This may not be a major issue though, I can't be sure if it's always done that.



Code:
[27082.940642] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[27082.941333] nvme 0000:01:00.0:   device [1c5c:174a] error status/mask=00000001/00000000
[27082.942037] nvme 0000:01:00.0:    [ 0] RxErr                  (First)
[27573.689975] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[27573.690708] nvme 0000:01:00.0:   device [1c5c:174a] error status/mask=00000001/00000000
[27573.691502] nvme 0000:01:00.0:    [ 0] RxErr                  (First)
[27660.889033] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[27660.889759] nvme 0000:01:00.0:   device [1c5c:174a] error status/mask=00000001/00000000
[27660.890477] nvme 0000:01:00.0:    [ 0] RxErr                  (First)
[28194.510301] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[28194.511066] nvme 0000:01:00.0:   device [1c5c:174a] error status/mask=00000001/00000000
[28194.511839] nvme 0000:01:00.0:    [ 0] RxErr                  (First)
[81324.699374] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[81324.700126] nvme 0000:01:00.0:   device [1c5c:174a] error status/mask=00000001/00000000
[81324.700882] nvme 0000:01:00.0:    [ 0] RxErr                  (First)
[113714.604486] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[113714.605275] nvme 0000:01:00.0:   device [1c5c:174a] error status/mask=00000001/00000000
[113714.606104] nvme 0000:01:00.0:    [ 0] RxErr                  (First)
[113916.412812] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[113916.413648] nvme 0000:01:00.0:   device [1c5c:174a] error status/mask=00000001/00000000
[113916.414508] nvme 0000:01:00.0:    [ 0] RxErr                  (First)
[114828.372176] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[114828.372990] nvme 0000:01:00.0:   device [1c5c:174a] error status/mask=00000001/00000000
[114828.373819] nvme 0000:01:00.0:    [ 0] RxErr                  (First)
[116044.666102] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[116044.666950] nvme 0000:01:00.0:   device [1c5c:174a] error status/mask=00000001/00000000
[116044.667811] nvme 0000:01:00.0:    [ 0] RxErr                  (First)
[116082.128428] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[116082.129275] nvme 0000:01:00.0:   device [1c5c:174a] error status/mask=00000001/00000000
[116082.130055] nvme 0000:01:00.0:    [ 0] RxErr                  (First)
root@proxmox:~#
 
if ssd, check wear level / smart.
zfs wearout too quickly if they are not plp/datacenter ssd.
 
Maybe remove and reconnect the NVMe drive a few times (to remove) and check for corrosion on the connector? Maybe dial back the PCIe generation/speed on the M.2 slot via the motherboard BIOS? Check the SMART values for wear and errors. Maybe the drive is not the best one that they manufactured and/or already dying, better be sure your backups are working.
 
  • Like
Reactions: fiona

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!