VM IO errors and crash when source NAS rebooted post migration to another NAS

UTEL-CT · Feb 17, 2023

Hi,
Hoping I might be able to get some advice about an unusual issue please.

I have a x3 host proxmox cluster (7.3-6), it has x2 TrueNAS core 13 servers available for shared storage.
I wanted to update both the NAS servers, so I migrated all the virtual hard drives from one TrueNAS server to the other. There was no errors in the migration and I ticked the option to delete source disk.

There have been x2 issues.
Issue 1 - both NAS's show the VM's disks, so it looks like the source disk were not deleted, the UI and terminal will not let me delete the source ones with a "in use" error. The VM pages show the hard disks as being on the migrated to NAS so the source should not be in use.
I've tried the command to get the source disks to re-appear as unused disks, which did nothing and I've also checked the VM's config files and there are not references to the source disks in them.

Issue 2 - When I upgraded the source NAS and rebooted, all the VM's that had drives migrated from it got IO errors and crashed even thought the GUI shows that the disks are no longer on it.

Do you know what might be causing both issues please?

Chris · Feb 17, 2023

UTEL-CT said:
Hi,
Hoping I might be able to get some advice about an unusual issue please.

I have a x3 host proxmox cluster (7.3-6), it has x2 TrueNAS core 13 servers available for shared storage.
I wanted to update both the NAS servers, so I migrated all the virtual hard drives from one TrueNAS server to the other. There was no errors in the migration and I ticked the option to delete source disk.

There have been x2 issues.
Issue 1 - both NAS's show the VM's disks, so it looks like the source disk were not deleted, the UI and terminal will not let me delete the source ones with a "in use" error. The VM pages show the hard disks as being on the migrated to NAS so the source should not be in use.
I've tried the command to get the source disks to re-appear as unused disks, which did nothing and I've also checked the VM's config files and there are not references to the source disks in them.

Issue 2 - When I upgraded the source NAS and rebooted, all the VM's that had drives migrated from it got IO errors and crashed even thought the GUI shows that the disks are no longer on it.

Do you know what might be causing both issues please?

Hi, how is the trueNAS attached to the PVE host as storage? Did you stop the VMs before migrating the disks to the other storage?

UTEL-CT · Feb 17, 2023

Chris said:
Hi, how is the trueNAS attached to the PVE host as storage? Did you stop the VMs before migrating the disks to the other storage?

Hi Chris,

Thank you for your response.

Both of the TrueNAS systems are attached via iSCSI, then using a LVM layer on top.

I did not stop the VM's first, is this a requirement for migrating disks?

Chris · Feb 20, 2023

UTEL-CT said:
Hi Chris,

Thank you for your response.

Both of the TrueNAS systems are attached via iSCSI, then using a LVM layer on top.

I did not stop the VM's first, is this a requirement for migrating disks?

Is the storage on PVE side marked as shared storage? If so and the iscsi setup is correct, then it should not be necessary to stop the VM.

UTEL-CT said:
I wanted to update both the NAS servers, so I migrated all the virtual hard drives from one TrueNAS server to the other. There was no errors in the migration and I ticked the option to delete source disk.

Can you provide the full log of the migration task, as well as your current /etc/pve/storage.cfg

UTEL-CT said:
Issue 2 - When I upgraded the source NAS and rebooted, all the VM's that had drives migrated from it got IO errors and crashed even thought the GUI shows that the disks are no longer on it.

The issue here seems, that the VM still uses the old disks and not the new ones.

UTEL-CT · Feb 20, 2023

Chris said:
Is the storage on PVE side marked as shared storage? If so and the iscsi setup is correct, then it should not be necessary to stop the VM.

Yes the storage has been setup as shared storage.

Chris said:
Can you provide the full log of the migration task, as well as your current /etc/pve/storage.cfg

Please see attached zip for requested files.

Chris said:
The issue here seems, that the VM still uses the old disks and not the new ones.

That does seem to be the issue, not sure why though as I cannot see any obvious errors that would result in this behaviour.

Chris · Feb 21, 2023

UTEL-CT said:
Yes the storage has been setup as shared storage.

Please see attached zip for requested files.

That does seem to be the issue, not sure why though as I cannot see any obvious errors that would result in this behaviour.

Ok, so config and task log seem fine, can you also provide the journal from around the time of the migration? journalctl --since <starttime> --until <endtime>

UTEL-CT · Feb 23, 2023

Apologies for the delay in getting back to you, I've not been well.

Chris said:
Ok, so config and task log seem fine, can you also provide the journal from around the time of the migration? journalctl --since <starttime> --until <endtime>

Please see the attached journalctl output from the time period of the disk move.

UTEL-CT · Feb 23, 2023

I've noticed the hardware error in the journalctl log. I've not found any issues with the host on it's ILO page.
I've had a look at both the TrueNAS systems, the source is fine but the target did have a dead disk and a degraded RAID 6 state but it's still functional, so I I'm not sure it would cause this issue as the volume is available for use still where the RAID is able to carry on without the dead disk.

Chris · Feb 24, 2023

UTEL-CT said:
Apologies for the delay in getting back to you, I've not been well.

Hi, I'm sorry to hear that and hope you are doing better again.

I have tried to reproduce the issue on my local setup but here everything works as expected. Could you therefore share some more information:

pveversion -v
qm config <VMID> for the VM in question
Is this issue reproducible?

UTEL-CT · Feb 24, 2023

Chris said:
Hi, I'm sorry to hear that and hope you are doing better again.

Thank you, I'm on the mend.

Please see attachment for pveversion info of hosts and an example config of one of the VM's. It happened to all the VM's across all the hosts when the source NAS was rebooted.

I'm not sure how reproducible it is as I do not know what has caused this state.
I think it's probably a fairly common setup.
We've got x3 hosts and x2 NAS servers with about x20 VMs spread evening amongst the resources.
The only steps I'm aware of that led to the issue is, migrating all the VM's from one NAS to the other so the NAS can be updated and rebooted.
During the process for some reason the old disks have been left behind even though I ticked the delete option and the migration logs all came back as successful for the move and the delete of the source disks.
The VM's that were part of the disk migration crashed but the ones already located on the target NAS were fine.

I'm not really sure how to get Proxmox to let go of using the source NAS when all the config seems to suggests it should be using the target one.
I was considering trying to migrate the VMs back to the source one then back again to see if that helps it get back on track but just concerned I will back myself into a corner by using up all the free space if it ends up leaving lots of cloned disks that I cannot delete...

fiona · Feb 27, 2023

Hi,
it seems like both of your LVM storages reference the same volume group

Code:

iscsi: nas-00139
        portal 172.17.12.139
        target iqn.2005-10.org.freenas.ctl:pve-cls-01
        content images

lvm: nas-00139-lvm-01
        vgname nas-00139-lvm-01
        base nas-00139:0.0.1.scsi-36589cfc0000000115194791e440d3a74
        content images,rootdir
        shared 1

iscsi: nas-00056
        portal 172.17.12.10
        target iqn.2005-10.org.freenas.ctl:pve-cls-01
        content images

lvm: nas-00056-lvm-01
        vgname nas-00139-lvm-01
        content rootdir,images
        shared 1

i.e. the second iSCSI storage is not referenced by an LVM storage. Did you move the disks between the two LVM storages (which actually is the same storage)?

UTEL-CT · Feb 27, 2023

fiona said:
Hi,
it seems like both of your LVM storages reference the same volume group

Code:

iscsi: nas-00139 portal 172.17.12.139 target iqn.2005-10.org.freenas.ctl:pve-cls-01 content images lvm: nas-00139-lvm-01 vgname nas-00139-lvm-01 base nas-00139:0.0.1.scsi-36589cfc0000000115194791e440d3a74 content images,rootdir shared 1 iscsi: nas-00056 portal 172.17.12.10 target iqn.2005-10.org.freenas.ctl:pve-cls-01 content images lvm: nas-00056-lvm-01 vgname nas-00139-lvm-01 content rootdir,images shared 1

i.e. the second iSCSI storage is not referenced by an LVM storage. Did you move the disks between the two LVM storages (which actually is the same storage)?

Hi Fiona,

Nice find, that looks like the culprit, I'm really not sure how it got setup like that...

Would the simplest/safest fix be to move the disks back to lvm: nas-00139-lvm-01, then delete and recreate lvm: nas-00056-lvm-01 with the correct volume group?

fiona · Feb 27, 2023

UTEL-CT said:
Hi Fiona,

Nice find, that looks like the culprit, I'm really not sure how it got setup like that...

Would the simplest/safest fix be to move the disks back to lvm: nas-00139-lvm-01, then delete and recreate lvm: nas-00056-lvm-01 with the correct volume group?

Yes, that should work. Since the backing storage is the same, to avoid moving the data, you could also just manually change the storage name for the disks in the VM configuration files while the VMs are down.

UTEL-CT · Mar 1, 2023

fiona said:
Yes, that should work. Since the backing storage is the same, to avoid moving the data, you could also just manually change the storage name for the disks in the VM configuration files while the VMs are down.

Hi Fiona,

I went with the first option in the end as I wanted to avoid powering off the VM's.
This worked great and I've been able to carry out further maintenance on the NAS's without any issues when reboots needed now.

Thank you for your assistance with this. Much appreciated.

Search

Search

VM IO errors and crash when source NAS rebooted post migration to another NAS

UTEL-CT

New Member

Chris

Proxmox Staff Member

UTEL-CT

New Member

Chris

Proxmox Staff Member

UTEL-CT

New Member

Attachments

Chris

Proxmox Staff Member

UTEL-CT

New Member

Attachments

UTEL-CT

New Member

Chris

Proxmox Staff Member

UTEL-CT

New Member

Attachments

fiona

Proxmox Staff Member

UTEL-CT

New Member

fiona

Proxmox Staff Member

UTEL-CT

New Member

We value your privacy