VM Drops randomly everyday Drive shows Unknown

herotm

Member
Jun 15, 2021
13
0
6
37
I am running my VM raw disks on a nvme drive, but everyday the drive goes offline therefore VMs goes offline as well, i tested the drive, its not bad, a restart fixes the issue.

When attempting to start the VM back manually. This is the error

Command failed with status code 5.
command '/sbin/vgscan --ignorelockingfailure --mknodes' failed: exit code 5
Volume group "nvme" not found
TASK ERROR: can't activate LV '/dev/nvme/vm-101-disk-0': Cannot process volume group nvme

lvs ( dont see my nvme )
root@pm:/dev/nvme# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
data pve twi-aotz-- <320.10g 0.00 0.52
root pve -wi-ao---- 96.00g
swap pve -wi-ao---- 8.00g

pvs ( dont see my nvme as well )
PV VG Fmt Attr PSize PFree
/dev/sda3 pve lvm2 a-- <446.63g 15.99g

/etc/pve/storage.cfg
dir: local
path /var/lib/vz
content backup,iso,vztmpl

lvmthin: local-lvm
thinpool data
vgname pve
content images,rootdir

lvm: nvme2tb
vgname nvme
content rootdir,images
nodes pm
shared 0

syslog
Jul 19 20:18:47 pm pvedaemon[541147]: <root@pam> end task UPID:pm:00084741:00A0AA69:60F56DA7:qmstart:101:root@pam: can't activate LV '/dev/nvme/vm-101-disk-0': Cannot process volume group nvme
Jul 19 20:18:52 pm pvestatd[1433]: command '/sbin/vgscan --ignorelockingfailure --mknodes' failed: exit code 5
Jul 19 20:19:00 pm systemd[1]: Starting Proxmox VE replication runner...
Jul 19 20:19:00 pm systemd[1]: pvesr.service: Succeeded.
Jul 19 20:19:00 pm systemd[1]: Finished Proxmox VE replication runner.
Jul 19 20:19:02 pm pvestatd[1433]: command '/sbin/vgscan --ignorelockingfailure --mknodes' failed: exit code 5
Jul 19 20:19:12 pm pvestatd[1433]: command '/sbin/vgscan --ignorelockingfailure --mknodes' failed: exit code 5
Jul 19 20:22:00 pm systemd[1]: Starting Proxmox VE replication runner...
Jul 19 20:22:00 pm systemd[1]: pvesr.service: Succeeded.
Jul 19 20:22:00 pm systemd[1]: Finished Proxmox VE replication runner.
Jul 19 20:22:02 pm pvestatd[1433]: command '/sbin/vgscan --ignorelockingfailure --mknodes' failed: exit code 5

i did a ls to my nvme drive, all files are there
root@pm:/dev/nvme# ls -l
total 0
lrwxrwxrwx 1 root root 7 Jul 19 19:43 vm-101-disk-0 -> ../dm-0
lrwxrwxrwx 1 root root 7 Jul 19 19:43 vm-101-disk-1 -> ../dm-1
lrwxrwxrwx 1 root root 7 Jul 18 15:22 vm-801-disk-0 -> ../dm-2
root@pm:/dev/nvme#

any advice, thanks.
 
Do you see any errors regarding that drive when you run dmesg?
How are the SMART values?
 
dmesg
[13155.469951] Buffer I/O error on dev dm-0, logical block 9124736, async page read
[13155.469956] Buffer I/O error on dev dm-0, logical block 9124736, async page read
[13155.470129] Buffer I/O error on dev dm-0, logical block 9124736, async page read
[13155.470140] Buffer I/O error on dev dm-0, logical block 9124736, async page read
[13186.091448] buffer_io_error: 277 callbacks suppressed
[13186.091454] Buffer I/O error on dev dm-0, logical block 5264, lost async page write
[13186.091479] Buffer I/O error on dev dm-0, logical block 36770, lost async page write
[13186.091490] Buffer I/O error on dev dm-0, logical block 36771, lost async page write
[13216.088257] Buffer I/O error on dev dm-1, logical block 1008, async page read

What I also realized that lspci -v doesn't show the nvme, i installed the nvme cli and nvme list doesnt show any of my nvme's.. but everytime i reboot the drive works for a while then drive disappears... and vms go down.. this is strange

I performed a reboot to get the smartctl for the nvme

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 37 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 23%
Data Units Read: 1,176,013,214 [602 TB]
Data Units Written: 1,088,610,037 [557 TB]
Host Read Commands: 3,206,140,615
Host Write Commands: 1,694,310,967
Controller Busy Time: 15,251
Power Cycles: 118
Power On Hours: 1,342
Unsafe Shutdowns: 59
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 230
Critical Comp. Temperature Time: 31

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged
 
Last edited:
Okay, so dm-0 and dm-1 do report some IO errors. If you do a ls -l /dev/mapper you should be able to see what is linking to dm-0 and dm-1.
What I also realized that lspci -v doesn't show the nvme, i installed the nvme cli and nvme list doesnt show any of my nvme's.. but everytime i reboot the drive works for a while then drive disappears... and vms go down.. this is strange
Do you have any mentions of the NVME disk in the dmesg output? Alternatively, you could also search the kernel log /var/log/kern.log.

But yeah, this sounds like a hardware problem. I would recommend that you try to create a backup of the VM. Hopefully it will finish before the NVME fails again. Then I would check if there are firmware updates available for the NVME. Those might help. Other than that... if there is another slot where you can place the NVME, do that and see.

Maybe test another NVME and you could also try to disable any power saving options in the BIOS.