[SOLVED] VM has two disk images after node-failure in cluster

MH_MUC · Jun 21, 2023

Hi,
I am running a two node cluster with several VMs with ZFS.
I am using storage replication in short intervals which seems to be working fine.

I was testing a node failure with if up down and a sleep command.
I only had one VM103 on the failing node during the test.
The VM migrated to the second server correctly and migrated back to the prio 1 server when it was up again, which happend quicker than I expected.

However: There are two raw disk images in my data-pool now: vm-103-disk-0 and vm-103-disk-1.
In the hardware-tab of the VM is only the disk-1. I can't delete the disk-0 in storage, because it is assigned to vm103.

The replication is using disk-1.

Code:

2023-06-21 19:58:05 103-0: start replication job
2023-06-21 19:58:05 103-0: guest => VM 103, running => 2430540
2023-06-21 19:58:05 103-0: volumes => data-pool:vm-103-disk-1
2023-06-21 19:58:06 103-0: freeze guest filesystem
2023-06-21 19:58:06 103-0: create snapshot '__replicate_103-0_1687370285__' on data-pool:vm-103-disk-1
2023-06-21 19:58:06 103-0: thaw guest filesystem
2023-06-21 19:58:06 103-0: using secure transmission, rate limit: 50000 MByte/s
2023-06-21 19:58:06 103-0: incremental sync 'data-pool:vm-103-disk-1' (__replicate_103-0_1687369685__ => __replicate_103-0_1687370285__)
2023-06-21 19:58:06 103-0: using a bandwidth limit of 50000000000 bps for transferring 'data-pool:vm-103-disk-1'
2023-06-21 19:58:07 103-0: send from @__replicate_103-0_1687369685__ to data-pool/vm-103-disk-1@__replicate_103-0_1687370285__ estimated size is 4.59M
2023-06-21 19:58:07 103-0: total estimated size is 4.59M
2023-06-21 19:58:07 103-0: successfully imported 'data-pool:vm-103-disk-1'
2023-06-21 19:58:07 103-0: delete previous replication snapshot '__replicate_103-0_1687369685__' on data-pool:vm-103-disk-1
2023-06-21 19:58:07 103-0: (remote_finalize_local_job) delete stale replication snapshot '__replicate_103-0_1687369685__' on data-pool:vm-103-disk-1
2023-06-21 19:58:07 103-0: end replication job

When migrating the server manually I get:

Code:

task started by HA resource agent
2023-06-21 20:00:02 starting migration of VM 103 to node 'server40' (91.XXX.XXX.25)
2023-06-21 20:00:02 found local disk 'data-pool:vm-103-disk-0' (via storage)
2023-06-21 20:00:02 found local, replicated disk 'data-pool:vm-103-disk-1' (in current VM config)
2023-06-21 20:00:02 scsi0: start tracking writes using block-dirty-bitmap 'repl_scsi0'
2023-06-21 20:00:02 replicating disk images
2023-06-21 20:00:02 start replication job
2023-06-21 20:00:02 guest => VM 103, running => 2430540
2023-06-21 20:00:02 volumes => data-pool:vm-103-disk-1
2023-06-21 20:00:02 freeze guest filesystem
2023-06-21 20:00:02 create snapshot '__replicate_103-0_1687370402__' on data-pool:vm-103-disk-1
2023-06-21 20:00:02 thaw guest filesystem
2023-06-21 20:00:03 using secure transmission, rate limit: 50000 MByte/s
2023-06-21 20:00:03 incremental sync 'data-pool:vm-103-disk-1' (__replicate_103-0_1687370285__ => __replicate_103-0_1687370402__)
2023-06-21 20:00:03 using a bandwidth limit of 50000000000 bps for transferring 'data-pool:vm-103-disk-1'
2023-06-21 20:00:03 send from @__replicate_103-0_1687370285__ to data-pool/vm-103-disk-1@__replicate_103-0_1687370402__ estimated size is 1.74M
2023-06-21 20:00:03 total estimated size is 1.74M
2023-06-21 20:00:03 successfully imported 'data-pool:vm-103-disk-1'
2023-06-21 20:00:03 delete previous replication snapshot '__replicate_103-0_1687370285__' on data-pool:vm-103-disk-1
2023-06-21 20:00:04 (remote_finalize_local_job) delete stale replication snapshot '__replicate_103-0_1687370285__' on data-pool:vm-103-disk-1
2023-06-21 20:00:04 end replication job
2023-06-21 20:00:04 copying local disk images
2023-06-21 20:00:04 full send of data-pool/vm-103-disk-0@__replicate_103-0_1687357201__ estimated size is 13.1G
2023-06-21 20:00:04 send from @__replicate_103-0_1687357201__ to data-pool/vm-103-disk-0@__migration__ estimated size is 5.52M
2023-06-21 20:00:04 total estimated size is 13.1G
2023-06-21 20:00:05 TIME        SENT   SNAPSHOT data-pool/vm-103-disk-0@__replicate_103-0_1687357201__
2023-06-21 20:00:05 20:00:05   73.9M   data-pool/vm-103-disk-0@__replicate_103-0_1687357201__
(shortend by thread opener)
2023-06-21 20:02:48 20:02:48   13.2G   data-pool/vm-103-disk-0@__replicate_103-0_1687357201__
2023-06-21 20:02:49 successfully imported 'data-pool:vm-103-disk-0'
2023-06-21 20:02:49 volume 'data-pool:vm-103-disk-0' is 'data-pool:vm-103-disk-0' on the target
2023-06-21 20:02:49 starting VM 103 on remote node 'server40'
2023-06-21 20:02:50 volume 'data-pool:vm-103-disk-1' is 'data-pool:vm-103-disk-1' on the target
2023-06-21 20:02:50 start remote tunnel
2023-06-21 20:02:51 ssh tunnel ver 1
2023-06-21 20:02:51 starting storage migration
2023-06-21 20:02:51 scsi0: start migration to nbd:unix:/run/qemu-server/103_nbd.migrate:exportname=drive-scsi0
drive mirror re-using dirty bitmap 'repl_scsi0'
drive mirror is starting for drive-scsi0
drive-scsi0: transferred 0.0 B of 4.2 MiB (0.00%) in 0s
drive-scsi0: transferred 4.2 MiB of 4.2 MiB (100.00%) in 1s, ready
all 'mirror' jobs are ready
2023-06-21 20:02:52 starting online/live migration on unix:/run/qemu-server/103.migrate
2023-06-21 20:02:52 set migration capabilities
2023-06-21 20:02:52 migration downtime limit: 100 ms
2023-06-21 20:02:52 migration cachesize: 256.0 MiB
2023-06-21 20:02:52 set migration parameters
2023-06-21 20:02:52 start migrate command to unix:/run/qemu-server/103.migrate
2023-06-21 20:02:53 migration active, transferred 79.5 MiB of 2.0 GiB VM-state, 94.7 MiB/s
2023-06-21 20:02:54 migration active, transferred 157.1 MiB of 2.0 GiB VM-state, 78.8 MiB/s
2023-06-21 20:02:55 migration active, transferred 234.3 MiB of 2.0 GiB VM-state, 79.5 MiB/s
2023-06-21 20:02:56 migration active, transferred 311.2 MiB of 2.0 GiB VM-state, 81.8 MiB/s
2023-06-21 20:02:57 migration active, transferred 387.3 MiB of 2.0 GiB VM-state, 67.3 MiB/s
2023-06-21 20:02:58 migration active, transferred 465.1 MiB of 2.0 GiB VM-state, 81.9 MiB/s
2023-06-21 20:02:59 migration active, transferred 542.0 MiB of 2.0 GiB VM-state, 77.9 MiB/s
2023-06-21 20:03:00 migration active, transferred 619.2 MiB of 2.0 GiB VM-state, 83.4 MiB/s
2023-06-21 20:03:01 migration active, transferred 696.7 MiB of 2.0 GiB VM-state, 135.6 MiB/s
2023-06-21 20:03:02 migration active, transferred 773.5 MiB of 2.0 GiB VM-state, 79.0 MiB/s
2023-06-21 20:03:03 migration active, transferred 851.3 MiB of 2.0 GiB VM-state, 78.1 MiB/s
2023-06-21 20:03:04 migration active, transferred 928.5 MiB of 2.0 GiB VM-state, 80.3 MiB/s
2023-06-21 20:03:05 migration active, transferred 1005.8 MiB of 2.0 GiB VM-state, 82.6 MiB/s
2023-06-21 20:03:06 migration active, transferred 1.1 GiB of 2.0 GiB VM-state, 78.8 MiB/s
2023-06-21 20:03:07 migration active, transferred 1.2 GiB of 2.0 GiB VM-state, 81.0 MiB/s
2023-06-21 20:03:08 average migration speed: 129.1 MiB/s - downtime 65 ms
2023-06-21 20:03:08 migration status: completed
all 'mirror' jobs are ready
drive-scsi0: Completing block job_id...
drive-scsi0: Completed successfully.
drive-scsi0: mirror-job finished
2023-06-21 20:03:09 # /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=server40' root@91.XXX.XXX.25 pvesr set-state 103 \''{"local/server35":{"last_try":1687370402,"duration":1.872984,"last_node":"server35","storeid_list":["data-pool"],"fail_count":0,"last_sync":1687370402,"last_iteration":1687370402}}'\'
2023-06-21 20:03:09 stopping NBD storage migration server on target.
2023-06-21 20:03:13 migration finished successfully (duration 00:03:11)
TASK OK

How do I remove the second disk? I guess if I had a failed node now it wouldn't work because on disk that is referenced to the VM is missing.
Any idea? Thank you!

Linux 5.15.107-2-pve #1 SMP PVE 5.15.107-2
pve-manager/7.4-13/46c37d9c

bbgeek17 · Jun 21, 2023

"qm rescan" will likely help you - it should bring the disk into config as unused device which you can delete. Although if disk-0 is not referenced in VM config I am not sure why you cant delete it, but I may be missing something.
You can also, probably, utilize "pvesm free [storage:disk]" to delete it

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

MH_MUC · Jun 21, 2023

Thank you so much.
For future readers: The correct command is "qm disk rescan".

Code:

qm disk rescan
rescan volumes...
VM 103 add unreferenced volume 'data-pool:vm-103-disk-0' as 'unused0' to config

Thereafter I was able to remove the disk and the hardware-tab of the vm and the migration is working again as expected.

Search

Search

[SOLVED] VM has two disk images after node-failure in cluster

MH_MUC

Active Member

bbgeek17

Distinguished Member

MH_MUC

Active Member