Hi guys,
Please bare with me on this one, as its a bit of a weird one, I know there is some issues in the past with NVMe and SMART reporting things it shouldn't in terms of non-error entries, but from my digging through everything nothing matches my issue specifically and even at that all the reports I see are from 18+ months ago with a resolve marked against it, so I'm at a loss to try and get this to stop erroring.
Environment:
Proxmox : 9.0.11 running on Crucial P2 NVME (1 additional VM running on this same NVMe drive)
VMs : 5 running on WD Red SSD
Errors:
This repeats every 30mins as SMART checks the drive.
Troubleshooting:
Stage1:
Turned off all VMs inc the one on the same NVMe drive as Proxmox Host drive (this additional VM will be moved later)
Issue goes away.
By methodically turning on each VM and waiting VM 106 is the issue. Now this is in my head strange as 106 is on the SSD not the NVME, however I turn it off....errors go away...so its 100% this VM.
VM is HAos running latest build and a lot of integrations.
Stage2:
With all other VMs powered on but 106. Run test for 24 hours no issue.
Power on 106 issue appears within 30mins
Deactivate all integrations but basic ones. Issue goes away.
Again through methodically turning on each and every integration, issue is found with Pi-Hole-V6 integration. 100% confirmed.
Next change the poll time of integration from 300seconds to 600 seconds and error count decreases from 12 per 30 mins (as above logs) to 6 per 30mins.
So issue is with polling. Turn off polling but leave integration on (cant be doing much) errors go away.
Next run robust curl commands from HAOS terminal to PiHole instance (which is on the same SSD - different VM) and errors do not increase.
So all these non-errors or "ignored" ones while cosmetic are annoying, also making my NUC led turn amber and I would prefer not to have them.
I can keep the integration turned off sure, but I would prefer not too.
I engaged with the PiHole-V6 integrators author and in fairness the fact that HAOS is not on the NVMe drive and he also believes that the integration isn't technically writing to that drive there isn't much he can do....
This leaves me wondering, trying to understand the coloration between the SSD where the VM sits and the NVMe errors. How Proxmox is seeing that and is there a way to get rid of them. As in SMART all I can see is error warning.
Disk I/O behavior is not my bag as a network engineer, so Im just trying to understand it and maybe even fix it.
Please bare with me on this one, as its a bit of a weird one, I know there is some issues in the past with NVMe and SMART reporting things it shouldn't in terms of non-error entries, but from my digging through everything nothing matches my issue specifically and even at that all the reports I see are from 18+ months ago with a resolve marked against it, so I'm at a loss to try and get this to stop erroring.
Environment:
Proxmox : 9.0.11 running on Crucial P2 NVME (1 additional VM running on this same NVMe drive)
VMs : 5 running on WD Red SSD
Errors:
Code:
2025-10-26T08:11:21.386327+00:00 vm smartd[36415]: Device: /dev/nvme0, NVMe error count increased from 80214 to 80226 (0 new, 12 ignored, 0 unknown)
2025-10-26T08:41:21.416179+00:00 vm smartd[36415]: Device: /dev/nvme0, NVMe error count increased from 80226 to 80238 (0 new, 12 ignored, 0 unknown)
2025-10-26T09:11:21.446245+00:00 vm smartd[36415]: Device: /dev/nvme0, NVMe error count increased from 80238 to 80250 (0 new, 12 ignored, 0 unknown)
2025-10-26T09:41:21.476697+00:00 vm smartd[36415]: Device: /dev/nvme0, NVMe error count increased from 80250 to 80264 (0 new, 12 ignored, 0 unknown)
This repeats every 30mins as SMART checks the drive.
Troubleshooting:
Stage1:
Turned off all VMs inc the one on the same NVMe drive as Proxmox Host drive (this additional VM will be moved later)
Issue goes away.
By methodically turning on each VM and waiting VM 106 is the issue. Now this is in my head strange as 106 is on the SSD not the NVME, however I turn it off....errors go away...so its 100% this VM.
VM is HAos running latest build and a lot of integrations.
Stage2:
With all other VMs powered on but 106. Run test for 24 hours no issue.
Power on 106 issue appears within 30mins
Deactivate all integrations but basic ones. Issue goes away.
Again through methodically turning on each and every integration, issue is found with Pi-Hole-V6 integration. 100% confirmed.
Next change the poll time of integration from 300seconds to 600 seconds and error count decreases from 12 per 30 mins (as above logs) to 6 per 30mins.
So issue is with polling. Turn off polling but leave integration on (cant be doing much) errors go away.
Next run robust curl commands from HAOS terminal to PiHole instance (which is on the same SSD - different VM) and errors do not increase.
So all these non-errors or "ignored" ones while cosmetic are annoying, also making my NUC led turn amber and I would prefer not to have them.
I can keep the integration turned off sure, but I would prefer not too.
I engaged with the PiHole-V6 integrators author and in fairness the fact that HAOS is not on the NVMe drive and he also believes that the integration isn't technically writing to that drive there isn't much he can do....
This leaves me wondering, trying to understand the coloration between the SSD where the VM sits and the NVMe errors. How Proxmox is seeing that and is there a way to get rid of them. As in SMART all I can see is error warning.
Disk I/O behavior is not my bag as a network engineer, so Im just trying to understand it and maybe even fix it.