Woke up to 3 VMs with "running (io-error)" this morning

jaxjexjox

Member
Mar 25, 2022
53
0
11
Anyone seen this before?
Disk isn't out of space.

Code:
root@proxmox:~# pveversion -v
proxmox-ve: 8.0.2 (running kernel: 6.2.16-8-pve)
pve-manager: 8.0.4 (running version: 8.0.4/d258a813cfa6b390)
proxmox-kernel-helper: 8.0.3
pve-kernel-5.15: 7.4-4
pve-kernel-5.13: 7.1-9
proxmox-kernel-6.2.16-8-pve: 6.2.16-8
proxmox-kernel-6.2: 6.2.16-8
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.74-1-pve: 5.15.74-1
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.15.39-3-pve: 5.15.39-3
pve-kernel-5.15.39-1-pve: 5.15.39-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.1
libpve-access-control: 8.0.4
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.7
libpve-guest-common-perl: 5.0.4
libpve-http-server-perl: 5.0.4
libpve-rs-perl: 0.8.5
libpve-storage-perl: 8.0.2
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 3.0.2-1
proxmox-backup-file-restore: 3.0.2-1
proxmox-kernel-helper: 8.0.3
proxmox-mail-forward: 0.2.0
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.2
proxmox-widget-toolkit: 4.0.6
pve-cluster: 8.0.3
pve-container: 5.0.4
pve-docs: 8.0.4
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.3
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.5
pve-qemu-kvm: 8.0.2-4
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1




Here's some stuff from the syslog.


Code:
2023-09-01T03:56:44.361173+10:00 proxmox pve-firewall[1754]: firewall update time (11.019 seconds)
2023-09-01T03:56:44.519520+10:00 proxmox pvestatd[1758]: status update time (11.188 seconds)
2023-09-01T03:56:44.574173+10:00 proxmox pmxcfs[1618]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/103: -1
2023-09-01T03:56:44.574808+10:00 proxmox pmxcfs[1618]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/102: -1
2023-09-01T03:56:44.601931+10:00 proxmox pmxcfs[1618]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/proxmox/local-zfs: -1
2023-09-01T03:56:44.602535+10:00 proxmox pmxcfs[1618]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/proxmox/local: -1
2023-09-01T04:01:09.591368+10:00 proxmox pvestatd[1758]: status update time (15.293 seconds)
2023-09-01T04:01:09.754211+10:00 proxmox pmxcfs[1618]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/103: -1
2023-09-01T04:01:09.778434+10:00 proxmox pmxcfs[1618]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/102: -1
2023-09-01T04:01:09.878180+10:00 proxmox pmxcfs[1618]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/proxmox/local: -1
2023-09-01T04:01:09.878548+10:00 proxmox pmxcfs[1618]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/proxmox/local-zfs: -1
2023-09-01T04:01:09.916901+10:00 proxmox pve-firewall[1754]: firewall update time (15.578 seconds)
2023-09-01T04:01:40.283266+10:00 proxmox pvestatd[1758]: status update time (10.838 seconds)
2023-09-01T04:01:40.284540+10:00 proxmox pve-firewall[1754]: firewall update time (11.018 seconds)
2023-09-01T04:03:07.712582+10:00 proxmox pvestatd[1758]: status update time (7.272 seconds)
2023-09-01T04:03:07.809692+10:00 proxmox pve-firewall[1754]: firewall update time (7.515 seconds)
2023-09-01T04:03:47.358090+10:00 proxmox pve-firewall[1754]: firewall update time (7.342 seconds)
2023-09-01T04:03:47.990995+10:00 proxmox pvestatd[1758]: status update time (7.049 seconds)
2023-09-01T04:04:56.866241+10:00 proxmox smartd[1276]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 61 to 60
2023-09-01T04:05:38.030596+10:00 proxmox pvestatd[1758]: status update time (17.180 seconds)
2023-09-01T04:05:38.069165+10:00 proxmox pve-firewall[1754]: firewall update time (7.290 seconds)
2023-09-01T04:06:15.407806+10:00 proxmox pvestatd[1758]: status update time (17.284 seconds)
2023-09-01T04:06:15.464802+10:00 proxmox pve-firewall[1754]: firewall update time (15.169 seconds)
2023-09-01T04:06:56.413818+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:07:01.414634+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:07:06.415211+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:07:11.416507+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:07:14.878861+10:00 proxmox pvescheduler[3352245]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
2023-09-01T04:07:14.879160+10:00 proxmox pvescheduler[3352214]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
2023-09-01T04:07:16.417971+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:07:21.419449+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:07:26.420746+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:07:31.422229+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:07:36.423726+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:07:41.425171+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:07:46.426690+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:07:51.428221+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:07:56.429633+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:08:01.431113+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:08:06.432402+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:08:11.433900+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:08:14.895149+10:00 proxmox pvescheduler[3377383]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
2023-09-01T04:08:14.897000+10:00 proxmox pvescheduler[3377382]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
2023-09-01T04:08:16.435202+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:08:21.436638+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:08:26.437885+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:08:31.439163+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error
2023-09-01T04:08:36.440527+10:00 proxmox pve-ha-lrm[1825]: unable to write lrm status file - unable to delete old temp file: Input/output error

I've rebooted one of the VMs and it's back working and fine but this obviously isn't ideal.
 
I also have this error when I just search for nvme using dmesg | grep nvme
This may not be a major issue though, I can't be sure if it's always done that.



Code:
[27082.940642] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[27082.941333] nvme 0000:01:00.0:   device [1c5c:174a] error status/mask=00000001/00000000
[27082.942037] nvme 0000:01:00.0:    [ 0] RxErr                  (First)
[27573.689975] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[27573.690708] nvme 0000:01:00.0:   device [1c5c:174a] error status/mask=00000001/00000000
[27573.691502] nvme 0000:01:00.0:    [ 0] RxErr                  (First)
[27660.889033] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[27660.889759] nvme 0000:01:00.0:   device [1c5c:174a] error status/mask=00000001/00000000
[27660.890477] nvme 0000:01:00.0:    [ 0] RxErr                  (First)
[28194.510301] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[28194.511066] nvme 0000:01:00.0:   device [1c5c:174a] error status/mask=00000001/00000000
[28194.511839] nvme 0000:01:00.0:    [ 0] RxErr                  (First)
[81324.699374] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[81324.700126] nvme 0000:01:00.0:   device [1c5c:174a] error status/mask=00000001/00000000
[81324.700882] nvme 0000:01:00.0:    [ 0] RxErr                  (First)
[113714.604486] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[113714.605275] nvme 0000:01:00.0:   device [1c5c:174a] error status/mask=00000001/00000000
[113714.606104] nvme 0000:01:00.0:    [ 0] RxErr                  (First)
[113916.412812] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[113916.413648] nvme 0000:01:00.0:   device [1c5c:174a] error status/mask=00000001/00000000
[113916.414508] nvme 0000:01:00.0:    [ 0] RxErr                  (First)
[114828.372176] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[114828.372990] nvme 0000:01:00.0:   device [1c5c:174a] error status/mask=00000001/00000000
[114828.373819] nvme 0000:01:00.0:    [ 0] RxErr                  (First)
[116044.666102] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[116044.666950] nvme 0000:01:00.0:   device [1c5c:174a] error status/mask=00000001/00000000
[116044.667811] nvme 0000:01:00.0:    [ 0] RxErr                  (First)
[116082.128428] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[116082.129275] nvme 0000:01:00.0:   device [1c5c:174a] error status/mask=00000001/00000000
[116082.130055] nvme 0000:01:00.0:    [ 0] RxErr                  (First)
root@proxmox:~#
 
Maybe remove and reconnect the NVMe drive a few times (to remove) and check for corrosion on the connector? Maybe dial back the PCIe generation/speed on the M.2 slot via the motherboard BIOS? Check the SMART values for wear and errors. Maybe the drive is not the best one that they manufactured and/or already dying, better be sure your backups are working.
 
  • Like
Reactions: fiona