Crash VM with io-error

COndor2k

New Member
Feb 19, 2025
4
0
1
Hi,

Sorry I'm very new to Proxmox and I've tried to find similar ports on the forum in regards to my problem, but I'm getting no where.
Can someone please help me out before I reinstall everything, just to find out the problem reappears.
Everything has been running smooth for 2-3 month and now my one and only VM is crashing on startup and I'm getting an io-error triangle in the UI.

root@proxmox:~# pveversion -v
proxmox-ve: 8.3.0 (running kernel: 6.8.12-8-pve)
pve-manager: 8.3.4 (running version: 8.3.4/65224a0f9cd294a3)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-8
proxmox-kernel-6.8.12-8-pve-signed: 6.8.12-8
proxmox-kernel-6.8.12-7-pve-signed: 6.8.12-7
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.2.0
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.1
libpve-storage-perl: 8.3.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.3.3-1
proxmox-backup-file-restore: 3.3.3-1
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.4
pve-cluster: 8.0.10
pve-container: 5.2.4
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-3
pve-ha-manager: 4.0.6
pve-i18n: 3.3.3
pve-qemu-kvm: 9.0.2-5
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.8
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve1

root@proxmox:~# qm config 100
agent: 1
bios: ovmf
boot: order=virtio0;net0;ide2
cores: 6
cpu: host
efidisk0: local:100/vm-100-disk-0.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K
ide2: local:iso/virtio-win-0.1.266.iso,media=cdrom,size=707456K
machine: pc-q35-9.0
memory: 32768
meta: creation-qemu=9.0.2,ctime=1733526923
name: SRV25
net0: virtio=BC:24:11:D6:94:1E,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: win11
scsihw: virtio-scsi-pci
smbios1: uuid=b43e87ef-ece5-4237-8d01-95238e597092
sockets: 1
usb0: host=2-8,usb3=1
virtio0: local:100/vm-100-disk-1.raw,backup=0,iothread=1,size=50G
virtio1: Storage1:100/vm-100-disk-0.qcow2,backup=0,iothread=1,size=1870G
virtio2: Storage2:100/vm-100-disk-0.qcow2,backup=0,iothread=1,size=1910G
virtio3: Storage3:100/vm-100-disk-0.qcow2,backup=0,iothread=1,size=1910G
vmgenid: 61789e8e-a1f4-49f7-9b2c-9ccc08d84c51

The only error's I see the the syslog is this:
Feb 19 16:01:51 proxmox smartd[795]: Device: /dev/nvme2, number of Error Log entries increased from 608 to 611
Feb 19 16:01:51 proxmox smartd[795]: Device: /dev/nvme3, number of Error Log entries increased from 635 to 638

Thanks,
 
Hello COndor2k! At first sight, this looks like a failing disk, or at least bad sectors on the disk. Be careful and, if you have important data, make sure to have backups ready, just in case.

I/O errors can also happen due to other reasons (e.g. bad cables). Do you have them directly connected to the motherboard?

Please provide us with some more information:
  1. Could you please provide us with a longer syslog? Especially around the time when you have issues, and some time (e.g. 30 minutes) before that.
  2. Could you please run the following command for all disks: smartctl -a /dev/DISK_NAME -> at least for nvme2 and nvme3, since they report errors, but also for other ones, just in case.
 
  • Like
Reactions: waltar
Hi l.leahu-vladucu,
Thank you so much for getting back to me.
All disks are brand new as is the whole system incl. cables.
Full log from a reboot 'till crash. Happens within 1.min
Outputs from all 4 disks smartctl -a at the bottom.

smartctl -a /dev/nvme0
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-8-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: WD_BLACK SN850X 1000GB
Serial Number: 22370W800301
Firmware Version: 620281WD
PCI Vendor/Subsystem ID: 0x15b7
IEEE OUI Identifier: 0x001b44
Total NVM Capacity: 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 8224
NVMe Version: 1.4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 001b44 8b4e4c2332
Local Time is: Thu Feb 20 17:57:46 2025 CET
Firmware Updates (0x14): 2 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x00df): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp Verify
Log Page Attributes (0x1e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size: 128 Pages
Warning Comp. Temp. Threshold: 90 Celsius
Critical Comp. Temp. Threshold: 94 Celsius
Namespace 1 Features (0x02): NA_Fields

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 9.00W 9.00W - 0 0 0 0 0 0
1 + 6.00W 6.00W - 0 0 0 0 0 0
2 + 4.50W 4.50W - 0 0 0 0 0 0
3 - 0.0250W - - 3 3 3 3 5000 10000
4 - 0.0050W - - 4 4 4 4 3900 45700

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 2
1 - 4096 0 1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 45 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 1,541,692 [789 GB]
Data Units Written: 11,041,041 [5.65 TB]
Host Read Commands: 21,305,280
Host Write Commands: 173,847,133
Controller Busy Time: 79
Power Cycles: 73
Power On Hours: 2,391
Unsafe Shutdowns: 29
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

smartctl -a /dev/nvme1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-8-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: Netac NVMe SSD 2TB
Serial Number: RN202212162TB058413
Firmware Version: 3.S.F.C
PCI Vendor ID: 0x1f40
PCI Vendor Subsystem ID: 0x5236
IEEE OUI Identifier: 0xa84397
Controller ID: 0
NVMe Version: 1.4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 2,000,398,934,016 [2.00 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: a84397 0000058413
Local Time is: Thu Feb 20 17:54:31 2025 CET
Firmware Updates (0x0e): 7 Slots
Optional Admin Commands (0x0007): Security Format Frmw_DL
Optional NVM Commands (0x0014): DS_Mngmt Sav/Sel_Feat
Log Page Attributes (0x0e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 110 Celsius
Critical Comp. Temp. Threshold: 120 Celsius

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 3.50W - - 0 0 0 0 5 5
1 + 3.30W - - 1 1 1 1 50 100
2 + 2.80W - - 2 2 2 2 50 200
3 - 0.1700W - - 3 3 3 3 500 7500
4 - 0.0200W - - 4 4 4 4 2000 70000

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 54 Celsius
Available Spare: 100%
Available Spare Threshold: 25%
Percentage Used: 3%
Data Units Read: 55,405,953 [28.3 TB]
Data Units Written: 127,960,440 [65.5 TB]
Host Read Commands: 541,847,370
Host Write Commands: 558,384,474
Controller Busy Time: 1
Power Cycles: 123
Power On Hours: 2,223
Unsafe Shutdowns: 2
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 55 Celsius
Temperature Sensor 2: 27 Celsius
Thermal Temp. 1 Transition Count: 1
Thermal Temp. 1 Total Time: 38

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

free(): invalid pointer
Aborted

smartctl -a /dev/nvme2
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-8-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: KINGSTON SKC3000D2048G
Serial Number: 50026B76862C9DE3
Firmware Version: EIFK31.6
PCI Vendor/Subsystem ID: 0x2646
IEEE OUI Identifier: 0x0026b7
Total NVM Capacity: 2,048,408,248,320 [2.04 TB]
Unallocated NVM Capacity: 0
Controller ID: 1
NVMe Version: 1.4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 2,048,408,248,320 [2.04 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 0026b7 6862c9de35
Local Time is: Thu Feb 20 17:56:16 2025 CET
Firmware Updates (0x12): 1 Slot, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005d): Comp DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x08): Telmtry_Lg
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 84 Celsius
Critical Comp. Temp. Threshold: 89 Celsius

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 8.80W - - 0 0 0 0 0 0
1 + 7.10W - - 1 1 1 1 0 0
2 + 5.20W - - 2 2 2 2 0 0
3 - 0.0620W - - 3 3 3 3 2500 7500
4 - 0.0620W - - 4 4 4 4 2500 7500

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 2
1 - 4096 0 1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 26 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 4%
Data Units Read: 774,063 [396 GB]
Data Units Written: 166,798,744 [85.4 TB]
Host Read Commands: 17,724,199
Host Write Commands: 664,825,566
Controller Busy Time: 300
Power Cycles: 123
Power On Hours: 2,407
Unsafe Shutdowns: 74
Media and Data Integrity Errors: 0
Error Information Log Entries: 614
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 2: 60 Celsius

Error Information (NVMe Log 0x01, 16 of 63 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 614 0 0x0014 0x4004 0x028 0 0 -
1 613 0 0x0003 0x4004 - 0 0 -


smartctl -a /dev/nvme3
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-8-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: KINGSTON SKC3000D2048G
Serial Number: 50026B76862C9CB2
Firmware Version: EIFK31.6
PCI Vendor/Subsystem ID: 0x2646
IEEE OUI Identifier: 0x0026b7
Total NVM Capacity: 2,048,408,248,320 [2.04 TB]
Unallocated NVM Capacity: 0
Controller ID: 1
NVMe Version: 1.4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 2,048,408,248,320 [2.04 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 0026b7 6862c9cb25
Local Time is: Thu Feb 20 17:56:56 2025 CET
Firmware Updates (0x12): 1 Slot, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005d): Comp DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x08): Telmtry_Lg
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 84 Celsius
Critical Comp. Temp. Threshold: 89 Celsius

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 8.80W - - 0 0 0 0 0 0
1 + 7.10W - - 1 1 1 1 0 0
2 + 5.20W - - 2 2 2 2 0 0
3 - 0.0620W - - 3 3 3 3 2500 7500
4 - 0.0620W - - 4 4 4 4 2500 7500

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 2
1 - 4096 0 1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 23 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 1%
Data Units Read: 81,616 [41.7 GB]
Data Units Written: 54,738,314 [28.0 TB]
Host Read Commands: 8,780,410
Host Write Commands: 218,100,095
Controller Busy Time: 120
Power Cycles: 122
Power On Hours: 2,407
Unsafe Shutdowns: 73
Media and Data Integrity Errors: 0
Error Information Log Entries: 641
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 2: 64 Celsius

Error Information (NVMe Log 0x01, 16 of 63 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 641 0 0x300c 0x4004 0x028 0 0 -
1 640 0 0x000f 0x4004 - 0 0 -
 

Attachments

Thanks for the information! As we noticed already, there are multiple errors related to /dev/nvme2 and /dev/nvme3, even if the S.M.A.R.T. tests pass.

Some questions:
  1. How did you connect the NVMe drives to the motherboard?
  2. Can you please try to run memtest86+(included in Proxmox VE) for some time to see if there are any memory errors?
    • I'm also asking this because the output of smartctl -a /dev/nvme1 you posted above shows free(): invalid pointer Aborted. While this could be a bug in smartctl, this could also be related to general system instability.
 
Thank for getting back to me once again. :)
I'm running memtest86+ as I'm writing. 25% passed no error so far. I'll keep you posted.

Two of my NVMe drivers are connected via an JEYI 4 SSD M.2 X16 PCIe 4.0 X4 Expansion Card. The other two are directly on the board's M.2 slots.
 
Last edited:
Thank for getting back to me once again. :)
You're welcome! :)
Two of my NVMe drivers are connected via an JEYI 4 SSD M.2 X16 PCIe 4.0 X4 Expansion Card. The other two are directly on the board's M.2 slots.
Are the ones with issues connected to the motherboard, or to the expansion card? It would be useful to know which SSDs are connected to which slots, because maybe there's some connection between that and the errors.
 
Memory check has passed now two times without any errors.
/dev/nvme0, /dev/nvme1 are on the expansion card where /dev/nvme2, /dev/nvme3 are on the motherboard.

I noticed that if I turn off the network to the VM it doesn't crash. And I can copy paste files around without any problems. The minute I reboot it crashes and if I turn on the network again it still crashes.
 

Attachments

  • Disks.png
    Disks.png
    49.5 KB · Views: 1