Proxmox NVMe error count increase - HA integration related

Rattlehead.ie

Member
Dec 30, 2021
21
0
6
42
Hi guys,
Please bare with me on this one, as its a bit of a weird one, I know there is some issues in the past with NVMe and SMART reporting things it shouldn't in terms of non-error entries, but from my digging through everything nothing matches my issue specifically and even at that all the reports I see are from 18+ months ago with a resolve marked against it, so I'm at a loss to try and get this to stop erroring.
Environment:
Proxmox : 9.0.11 running on Crucial P2 NVME (1 additional VM running on this same NVMe drive)
VMs : 5 running on WD Red SSD
Errors:
Code:
2025-10-26T08:11:21.386327+00:00 vm smartd[36415]: Device: /dev/nvme0, NVMe error count increased from 80214 to 80226 (0 new, 12 ignored, 0 unknown)
2025-10-26T08:41:21.416179+00:00 vm smartd[36415]: Device: /dev/nvme0, NVMe error count increased from 80226 to 80238 (0 new, 12 ignored, 0 unknown)
2025-10-26T09:11:21.446245+00:00 vm smartd[36415]: Device: /dev/nvme0, NVMe error count increased from 80238 to 80250 (0 new, 12 ignored, 0 unknown)
2025-10-26T09:41:21.476697+00:00 vm smartd[36415]: Device: /dev/nvme0, NVMe error count increased from 80250 to 80264 (0 new, 12 ignored, 0 unknown)

This repeats every 30mins as SMART checks the drive.

Troubleshooting:
Stage1:
Turned off all VMs inc the one on the same NVMe drive as Proxmox Host drive (this additional VM will be moved later)
Issue goes away.
By methodically turning on each VM and waiting VM 106 is the issue. Now this is in my head strange as 106 is on the SSD not the NVME, however I turn it off....errors go away...so its 100% this VM.
VM is HAos running latest build and a lot of integrations.

Stage2:
With all other VMs powered on but 106. Run test for 24 hours no issue.
Power on 106 issue appears within 30mins
Deactivate all integrations but basic ones. Issue goes away.
Again through methodically turning on each and every integration, issue is found with Pi-Hole-V6 integration. 100% confirmed.
Next change the poll time of integration from 300seconds to 600 seconds and error count decreases from 12 per 30 mins (as above logs) to 6 per 30mins.
So issue is with polling. Turn off polling but leave integration on (cant be doing much) errors go away.
Next run robust curl commands from HAOS terminal to PiHole instance (which is on the same SSD - different VM) and errors do not increase.

So all these non-errors or "ignored" ones while cosmetic are annoying, also making my NUC led turn amber and I would prefer not to have them.
I can keep the integration turned off sure, but I would prefer not too.
I engaged with the PiHole-V6 integrators author and in fairness the fact that HAOS is not on the NVMe drive and he also believes that the integration isn't technically writing to that drive there isn't much he can do....
This leaves me wondering, trying to understand the coloration between the SSD where the VM sits and the NVMe errors. How Proxmox is seeing that and is there a way to get rid of them. As in SMART all I can see is error warning.
Disk I/O behavior is not my bag as a network engineer, so Im just trying to understand it and maybe even fix it.
 
I have no helpful input but I just wanted to say good job on identifying the cause but that has got to be one of the weirdest causes I've ever seen.
 
  • Like
Reactions: Rattlehead.ie
I have no helpful input but I just wanted to say good job on identifying the cause but that has got to be one of the weirdest causes I've ever seen.
I know its taken me all BH weekend here to figure it out and Im tearing whatever little hair I have left, haha thanks for for the response, least Im not going crazy!
 
I have the same message for (only) one of my NVMe drives every time I boot the Proxmox host (which does not run continuously). I always assumed the NVMe is writing to the wrong counter or smartd is interpreting it wrongly. It's also the only consumer drive (Samsung 970 EVO Plus) that's in the system, the enterprise SSDs don't seem to do this.
 
I have the same message for (only) one of my NVMe drives every time I boot the Proxmox host (which does not run continuously). I always assumed the NVMe is writing to the wrong counter or smartd is interpreting it wrongly. It's also the only consumer drive (Samsung 970 EVO Plus) that's in the system, the enterprise SSDs don't seem to do this.
Ok thats another interesting data-point and yes my NVMe is consumer grade rather than NAS/Enterprise. So I take that on the chin, the issue I have is that it only started in the last month, between upgrade of HAOS, Proxmox, god knows which one triggered it. Ive even been in contact with Crucial.
I believe its something to do with "waking up" low power mode as I saw in one of the threads I read, but all those issue again have been fixed, resolved, so I dont know again why Im seeing it, but thank you for the reply. The strange thing though for me is that its related to the poll of the PiHole integration.
 
Last edited: