[SOLVED] VM has two disk images after node-failure in cluster

MH_MUC

Active Member
May 24, 2019
66
6
28
37
Hi,
I am running a two node cluster with several VMs with ZFS.
I am using storage replication in short intervals which seems to be working fine.

I was testing a node failure with if up down and a sleep command.
I only had one VM103 on the failing node during the test.
The VM migrated to the second server correctly and migrated back to the prio 1 server when it was up again, which happend quicker than I expected.

However: There are two raw disk images in my data-pool now: vm-103-disk-0 and vm-103-disk-1.
In the hardware-tab of the VM is only the disk-1. I can't delete the disk-0 in storage, because it is assigned to vm103.

The replication is using disk-1.
Code:
2023-06-21 19:58:05 103-0: start replication job
2023-06-21 19:58:05 103-0: guest => VM 103, running => 2430540
2023-06-21 19:58:05 103-0: volumes => data-pool:vm-103-disk-1
2023-06-21 19:58:06 103-0: freeze guest filesystem
2023-06-21 19:58:06 103-0: create snapshot '__replicate_103-0_1687370285__' on data-pool:vm-103-disk-1
2023-06-21 19:58:06 103-0: thaw guest filesystem
2023-06-21 19:58:06 103-0: using secure transmission, rate limit: 50000 MByte/s
2023-06-21 19:58:06 103-0: incremental sync 'data-pool:vm-103-disk-1' (__replicate_103-0_1687369685__ => __replicate_103-0_1687370285__)
2023-06-21 19:58:06 103-0: using a bandwidth limit of 50000000000 bps for transferring 'data-pool:vm-103-disk-1'
2023-06-21 19:58:07 103-0: send from @__replicate_103-0_1687369685__ to data-pool/vm-103-disk-1@__replicate_103-0_1687370285__ estimated size is 4.59M
2023-06-21 19:58:07 103-0: total estimated size is 4.59M
2023-06-21 19:58:07 103-0: successfully imported 'data-pool:vm-103-disk-1'
2023-06-21 19:58:07 103-0: delete previous replication snapshot '__replicate_103-0_1687369685__' on data-pool:vm-103-disk-1
2023-06-21 19:58:07 103-0: (remote_finalize_local_job) delete stale replication snapshot '__replicate_103-0_1687369685__' on data-pool:vm-103-disk-1
2023-06-21 19:58:07 103-0: end replication job

When migrating the server manually I get:
Code:
task started by HA resource agent
2023-06-21 20:00:02 starting migration of VM 103 to node 'server40' (91.XXX.XXX.25)
2023-06-21 20:00:02 found local disk 'data-pool:vm-103-disk-0' (via storage)
2023-06-21 20:00:02 found local, replicated disk 'data-pool:vm-103-disk-1' (in current VM config)
2023-06-21 20:00:02 scsi0: start tracking writes using block-dirty-bitmap 'repl_scsi0'
2023-06-21 20:00:02 replicating disk images
2023-06-21 20:00:02 start replication job
2023-06-21 20:00:02 guest => VM 103, running => 2430540
2023-06-21 20:00:02 volumes => data-pool:vm-103-disk-1
2023-06-21 20:00:02 freeze guest filesystem
2023-06-21 20:00:02 create snapshot '__replicate_103-0_1687370402__' on data-pool:vm-103-disk-1
2023-06-21 20:00:02 thaw guest filesystem
2023-06-21 20:00:03 using secure transmission, rate limit: 50000 MByte/s
2023-06-21 20:00:03 incremental sync 'data-pool:vm-103-disk-1' (__replicate_103-0_1687370285__ => __replicate_103-0_1687370402__)
2023-06-21 20:00:03 using a bandwidth limit of 50000000000 bps for transferring 'data-pool:vm-103-disk-1'
2023-06-21 20:00:03 send from @__replicate_103-0_1687370285__ to data-pool/vm-103-disk-1@__replicate_103-0_1687370402__ estimated size is 1.74M
2023-06-21 20:00:03 total estimated size is 1.74M
2023-06-21 20:00:03 successfully imported 'data-pool:vm-103-disk-1'
2023-06-21 20:00:03 delete previous replication snapshot '__replicate_103-0_1687370285__' on data-pool:vm-103-disk-1
2023-06-21 20:00:04 (remote_finalize_local_job) delete stale replication snapshot '__replicate_103-0_1687370285__' on data-pool:vm-103-disk-1
2023-06-21 20:00:04 end replication job
2023-06-21 20:00:04 copying local disk images
2023-06-21 20:00:04 full send of data-pool/vm-103-disk-0@__replicate_103-0_1687357201__ estimated size is 13.1G
2023-06-21 20:00:04 send from @__replicate_103-0_1687357201__ to data-pool/vm-103-disk-0@__migration__ estimated size is 5.52M
2023-06-21 20:00:04 total estimated size is 13.1G
2023-06-21 20:00:05 TIME        SENT   SNAPSHOT data-pool/vm-103-disk-0@__replicate_103-0_1687357201__
2023-06-21 20:00:05 20:00:05   73.9M   data-pool/vm-103-disk-0@__replicate_103-0_1687357201__
(shortend by thread opener)
2023-06-21 20:02:48 20:02:48   13.2G   data-pool/vm-103-disk-0@__replicate_103-0_1687357201__
2023-06-21 20:02:49 successfully imported 'data-pool:vm-103-disk-0'
2023-06-21 20:02:49 volume 'data-pool:vm-103-disk-0' is 'data-pool:vm-103-disk-0' on the target
2023-06-21 20:02:49 starting VM 103 on remote node 'server40'
2023-06-21 20:02:50 volume 'data-pool:vm-103-disk-1' is 'data-pool:vm-103-disk-1' on the target
2023-06-21 20:02:50 start remote tunnel
2023-06-21 20:02:51 ssh tunnel ver 1
2023-06-21 20:02:51 starting storage migration
2023-06-21 20:02:51 scsi0: start migration to nbd:unix:/run/qemu-server/103_nbd.migrate:exportname=drive-scsi0
drive mirror re-using dirty bitmap 'repl_scsi0'
drive mirror is starting for drive-scsi0
drive-scsi0: transferred 0.0 B of 4.2 MiB (0.00%) in 0s
drive-scsi0: transferred 4.2 MiB of 4.2 MiB (100.00%) in 1s, ready
all 'mirror' jobs are ready
2023-06-21 20:02:52 starting online/live migration on unix:/run/qemu-server/103.migrate
2023-06-21 20:02:52 set migration capabilities
2023-06-21 20:02:52 migration downtime limit: 100 ms
2023-06-21 20:02:52 migration cachesize: 256.0 MiB
2023-06-21 20:02:52 set migration parameters
2023-06-21 20:02:52 start migrate command to unix:/run/qemu-server/103.migrate
2023-06-21 20:02:53 migration active, transferred 79.5 MiB of 2.0 GiB VM-state, 94.7 MiB/s
2023-06-21 20:02:54 migration active, transferred 157.1 MiB of 2.0 GiB VM-state, 78.8 MiB/s
2023-06-21 20:02:55 migration active, transferred 234.3 MiB of 2.0 GiB VM-state, 79.5 MiB/s
2023-06-21 20:02:56 migration active, transferred 311.2 MiB of 2.0 GiB VM-state, 81.8 MiB/s
2023-06-21 20:02:57 migration active, transferred 387.3 MiB of 2.0 GiB VM-state, 67.3 MiB/s
2023-06-21 20:02:58 migration active, transferred 465.1 MiB of 2.0 GiB VM-state, 81.9 MiB/s
2023-06-21 20:02:59 migration active, transferred 542.0 MiB of 2.0 GiB VM-state, 77.9 MiB/s
2023-06-21 20:03:00 migration active, transferred 619.2 MiB of 2.0 GiB VM-state, 83.4 MiB/s
2023-06-21 20:03:01 migration active, transferred 696.7 MiB of 2.0 GiB VM-state, 135.6 MiB/s
2023-06-21 20:03:02 migration active, transferred 773.5 MiB of 2.0 GiB VM-state, 79.0 MiB/s
2023-06-21 20:03:03 migration active, transferred 851.3 MiB of 2.0 GiB VM-state, 78.1 MiB/s
2023-06-21 20:03:04 migration active, transferred 928.5 MiB of 2.0 GiB VM-state, 80.3 MiB/s
2023-06-21 20:03:05 migration active, transferred 1005.8 MiB of 2.0 GiB VM-state, 82.6 MiB/s
2023-06-21 20:03:06 migration active, transferred 1.1 GiB of 2.0 GiB VM-state, 78.8 MiB/s
2023-06-21 20:03:07 migration active, transferred 1.2 GiB of 2.0 GiB VM-state, 81.0 MiB/s
2023-06-21 20:03:08 average migration speed: 129.1 MiB/s - downtime 65 ms
2023-06-21 20:03:08 migration status: completed
all 'mirror' jobs are ready
drive-scsi0: Completing block job_id...
drive-scsi0: Completed successfully.
drive-scsi0: mirror-job finished
2023-06-21 20:03:09 # /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=server40' root@91.XXX.XXX.25 pvesr set-state 103 \''{"local/server35":{"last_try":1687370402,"duration":1.872984,"last_node":"server35","storeid_list":["data-pool"],"fail_count":0,"last_sync":1687370402,"last_iteration":1687370402}}'\'
2023-06-21 20:03:09 stopping NBD storage migration server on target.
2023-06-21 20:03:13 migration finished successfully (duration 00:03:11)
TASK OK

How do I remove the second disk? I guess if I had a failed node now it wouldn't work because on disk that is referenced to the VM is missing.
Any idea? Thank you!

Linux 5.15.107-2-pve #1 SMP PVE 5.15.107-2
pve-manager/7.4-13/46c37d9c
 
"qm rescan" will likely help you - it should bring the disk into config as unused device which you can delete. Although if disk-0 is not referenced in VM config I am not sure why you cant delete it, but I may be missing something.
You can also, probably, utilize "pvesm free [storage:disk]" to delete it


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: MH_MUC
Thank you so much.
For future readers: The correct command is "qm disk rescan".

Code:
qm disk rescan
rescan volumes...
VM 103 add unreferenced volume 'data-pool:vm-103-disk-0' as 'unused0' to config

Thereafter I was able to remove the disk and the hardware-tab of the vm and the migration is working again as expected.
 
  • Like
Reactions: bbgeek17

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!