Hello everybody,
At the beginning of this week, I just deployed a new PVE 4.4-13 cluster over a DELL VRTX with 3 blades with HA active for almost every VMs, and everything was working flawlessly. Storage is LVM, of course.
Just one hour ago, for some unknown reason, one (or more? I didn't get that) cluster node rebooted and this happened:
1)
On the third node, I found all vms powered off, so I started them again. One booted fine but the other refused to boot complaining about being unable to mount a filesystem larger than the disk and also e2fsck was throwing errors. After I while I managed to get everything back to work and figured out the reason of this:
proxmox ui was reportinh disk size 710gb while the corresponding lv size was 700gb.
This sounded strange to me, so I tried to add 1 gb to that disk and... disk size returned from 710GB to 701GB!!
I added the missing 9gb and then I've been able to successfully complete the filesystem check and bring the vm back to life... phew!
2)
Another VM on an another node refused to start because of missing LV
Looking at the VM disk configuration I found this:
while a "lvscan" was reporting this:
Please mind that there are 2 different datastores: RAID10-SAS and RAID10-SSD.
To be brief, PVE was trying to mount the wrong LVs !
I manually corrected the 114.conf config file with the right disks, and I've been able to recover this vm as well.
---
At this point I'm thinking that HA is doing something nasty so I disabled it.
Sincerely now I'm scared about rebooting the nodes for whatever reason!
But I would like to understand why this happened and how to make sure this won't happen again!
Any help is appreciated.
Thanks.
At the beginning of this week, I just deployed a new PVE 4.4-13 cluster over a DELL VRTX with 3 blades with HA active for almost every VMs, and everything was working flawlessly. Storage is LVM, of course.
Just one hour ago, for some unknown reason, one (or more? I didn't get that) cluster node rebooted and this happened:
1)
On the third node, I found all vms powered off, so I started them again. One booted fine but the other refused to boot complaining about being unable to mount a filesystem larger than the disk and also e2fsck was throwing errors. After I while I managed to get everything back to work and figured out the reason of this:
proxmox ui was reportinh disk size 710gb while the corresponding lv size was 700gb.
This sounded strange to me, so I tried to add 1 gb to that disk and... disk size returned from 710GB to 701GB!!
I added the missing 9gb and then I've been able to successfully complete the filesystem check and bring the vm back to life... phew!
2)
Another VM on an another node refused to start because of missing LV
Code:
TASK ERROR: can't activate LV '/dev/raid10-sas/vm-114-disk-4':
Failed to find logical volume "raid10-sas/vm-114-disk-4"
Looking at the VM disk configuration I found this:
Code:
virtio0: raid10-sas:vm-114-disk-1,size=60G
virtio1: raid10-sas:vm-114-disk-2,size=25G
virtio2: raid10-sas:vm-114-disk-3,size=50G
virtio3: raid10-sas:vm-114-disk-4,size=100G
while a "lvscan" was reporting this:
Code:
ACTIVE '/dev/raid10-sas/vm-114-disk-1' [60,00 GiB] inherit
ACTIVE '/dev/raid10-sas/vm-114-disk-2' [25,00 GiB] inherit
ACTIVE '/dev/raid10-sas/vm-114-disk-3' [100,00 GiB] inherit
ACTIVE '/dev/raid10-ssd/vm-114-disk-1' [50,00 GiB] inherit
Please mind that there are 2 different datastores: RAID10-SAS and RAID10-SSD.
To be brief, PVE was trying to mount the wrong LVs !
I manually corrected the 114.conf config file with the right disks, and I've been able to recover this vm as well.
---
At this point I'm thinking that HA is doing something nasty so I disabled it.
Sincerely now I'm scared about rebooting the nodes for whatever reason!
But I would like to understand why this happened and how to make sure this won't happen again!
Any help is appreciated.
Thanks.
Last edited: