VM IO errors and crash when source NAS rebooted post migration to another NAS

UTEL-CT

New Member
Feb 17, 2023
8
0
1
United Kingdom
www.utel.co.uk
Hi,
Hoping I might be able to get some advice about an unusual issue please.

I have a x3 host proxmox cluster (7.3-6), it has x2 TrueNAS core 13 servers available for shared storage.
I wanted to update both the NAS servers, so I migrated all the virtual hard drives from one TrueNAS server to the other. There was no errors in the migration and I ticked the option to delete source disk.

There have been x2 issues.
Issue 1 - both NAS's show the VM's disks, so it looks like the source disk were not deleted, the UI and terminal will not let me delete the source ones with a "in use" error. The VM pages show the hard disks as being on the migrated to NAS so the source should not be in use.
I've tried the command to get the source disks to re-appear as unused disks, which did nothing and I've also checked the VM's config files and there are not references to the source disks in them.

Issue 2 - When I upgraded the source NAS and rebooted, all the VM's that had drives migrated from it got IO errors and crashed even thought the GUI shows that the disks are no longer on it.

Do you know what might be causing both issues please?
 
Hi,
Hoping I might be able to get some advice about an unusual issue please.

I have a x3 host proxmox cluster (7.3-6), it has x2 TrueNAS core 13 servers available for shared storage.
I wanted to update both the NAS servers, so I migrated all the virtual hard drives from one TrueNAS server to the other. There was no errors in the migration and I ticked the option to delete source disk.

There have been x2 issues.
Issue 1 - both NAS's show the VM's disks, so it looks like the source disk were not deleted, the UI and terminal will not let me delete the source ones with a "in use" error. The VM pages show the hard disks as being on the migrated to NAS so the source should not be in use.
I've tried the command to get the source disks to re-appear as unused disks, which did nothing and I've also checked the VM's config files and there are not references to the source disks in them.

Issue 2 - When I upgraded the source NAS and rebooted, all the VM's that had drives migrated from it got IO errors and crashed even thought the GUI shows that the disks are no longer on it.

Do you know what might be causing both issues please?
Hi, how is the trueNAS attached to the PVE host as storage? Did you stop the VMs before migrating the disks to the other storage?
 
Hi, how is the trueNAS attached to the PVE host as storage? Did you stop the VMs before migrating the disks to the other storage?
Hi Chris,

Thank you for your response.

Both of the TrueNAS systems are attached via iSCSI, then using a LVM layer on top.

I did not stop the VM's first, is this a requirement for migrating disks?
 
Hi Chris,

Thank you for your response.

Both of the TrueNAS systems are attached via iSCSI, then using a LVM layer on top.

I did not stop the VM's first, is this a requirement for migrating disks?
Is the storage on PVE side marked as shared storage? If so and the iscsi setup is correct, then it should not be necessary to stop the VM.
I wanted to update both the NAS servers, so I migrated all the virtual hard drives from one TrueNAS server to the other. There was no errors in the migration and I ticked the option to delete source disk.
Can you provide the full log of the migration task, as well as your current /etc/pve/storage.cfg
Issue 2 - When I upgraded the source NAS and rebooted, all the VM's that had drives migrated from it got IO errors and crashed even thought the GUI shows that the disks are no longer on it.
The issue here seems, that the VM still uses the old disks and not the new ones.
 
Is the storage on PVE side marked as shared storage? If so and the iscsi setup is correct, then it should not be necessary to stop the VM.
Yes the storage has been setup as shared storage.

Can you provide the full log of the migration task, as well as your current /etc/pve/storage.cfg
Please see attached zip for requested files.

The issue here seems, that the VM still uses the old disks and not the new ones.
That does seem to be the issue, not sure why though as I cannot see any obvious errors that would result in this behaviour.
 

Attachments

Yes the storage has been setup as shared storage.


Please see attached zip for requested files.


That does seem to be the issue, not sure why though as I cannot see any obvious errors that would result in this behaviour.
Ok, so config and task log seem fine, can you also provide the journal from around the time of the migration? journalctl --since <starttime> --until <endtime>
 
I've noticed the hardware error in the journalctl log. I've not found any issues with the host on it's ILO page.
I've had a look at both the TrueNAS systems, the source is fine but the target did have a dead disk and a degraded RAID 6 state but it's still functional, so I I'm not sure it would cause this issue as the volume is available for use still where the RAID is able to carry on without the dead disk.
 
Apologies for the delay in getting back to you, I've not been well.
Hi, I'm sorry to hear that and hope you are doing better again.

I have tried to reproduce the issue on my local setup but here everything works as expected. Could you therefore share some more information:
  • pveversion -v
  • qm config <VMID> for the VM in question
  • Is this issue reproducible?
 
Hi, I'm sorry to hear that and hope you are doing better again.
Thank you, I'm on the mend.

Please see attachment for pveversion info of hosts and an example config of one of the VM's. It happened to all the VM's across all the hosts when the source NAS was rebooted.

I'm not sure how reproducible it is as I do not know what has caused this state.
I think it's probably a fairly common setup.
We've got x3 hosts and x2 NAS servers with about x20 VMs spread evening amongst the resources.
The only steps I'm aware of that led to the issue is, migrating all the VM's from one NAS to the other so the NAS can be updated and rebooted.
During the process for some reason the old disks have been left behind even though I ticked the delete option and the migration logs all came back as successful for the move and the delete of the source disks.
The VM's that were part of the disk migration crashed but the ones already located on the target NAS were fine.

I'm not really sure how to get Proxmox to let go of using the source NAS when all the config seems to suggests it should be using the target one.
I was considering trying to migrate the VMs back to the source one then back again to see if that helps it get back on track but just concerned I will back myself into a corner by using up all the free space if it ends up leaving lots of cloned disks that I cannot delete...
 

Attachments

Hi,
it seems like both of your LVM storages reference the same volume group
Code:
iscsi: nas-00139
        portal 172.17.12.139
        target iqn.2005-10.org.freenas.ctl:pve-cls-01
        content images

lvm: nas-00139-lvm-01
        vgname nas-00139-lvm-01
        base nas-00139:0.0.1.scsi-36589cfc0000000115194791e440d3a74
        content images,rootdir
        shared 1

iscsi: nas-00056
        portal 172.17.12.10
        target iqn.2005-10.org.freenas.ctl:pve-cls-01
        content images

lvm: nas-00056-lvm-01
        vgname nas-00139-lvm-01
        content rootdir,images
        shared 1
i.e. the second iSCSI storage is not referenced by an LVM storage. Did you move the disks between the two LVM storages (which actually is the same storage)?
 
  • Like
Reactions: Chris
Hi,
it seems like both of your LVM storages reference the same volume group
Code:
iscsi: nas-00139
        portal 172.17.12.139
        target iqn.2005-10.org.freenas.ctl:pve-cls-01
        content images

lvm: nas-00139-lvm-01
        vgname nas-00139-lvm-01
        base nas-00139:0.0.1.scsi-36589cfc0000000115194791e440d3a74
        content images,rootdir
        shared 1

iscsi: nas-00056
        portal 172.17.12.10
        target iqn.2005-10.org.freenas.ctl:pve-cls-01
        content images

lvm: nas-00056-lvm-01
        vgname nas-00139-lvm-01
        content rootdir,images
        shared 1
i.e. the second iSCSI storage is not referenced by an LVM storage. Did you move the disks between the two LVM storages (which actually is the same storage)?
Hi Fiona,

Nice find, that looks like the culprit, I'm really not sure how it got setup like that...

Would the simplest/safest fix be to move the disks back to lvm: nas-00139-lvm-01, then delete and recreate lvm: nas-00056-lvm-01 with the correct volume group?
 
Hi Fiona,

Nice find, that looks like the culprit, I'm really not sure how it got setup like that...

Would the simplest/safest fix be to move the disks back to lvm: nas-00139-lvm-01, then delete and recreate lvm: nas-00056-lvm-01 with the correct volume group?
Yes, that should work. Since the backing storage is the same, to avoid moving the data, you could also just manually change the storage name for the disks in the VM configuration files while the VMs are down.
 
  • Like
Reactions: UTEL-CT
Yes, that should work. Since the backing storage is the same, to avoid moving the data, you could also just manually change the storage name for the disks in the VM configuration files while the VMs are down.
Hi Fiona,

I went with the first option in the end as I wanted to avoid powering off the VM's.
This worked great and I've been able to carry out further maintenance on the NAS's without any issues when reboots needed now.

Thank you for your assistance with this. Much appreciated.