HEEELOOOOOU!
This Plugin is perfect (now hehehehe). Thanks. BUT....
Problem Description
The Proxmox plugin manages VM storage by assigning a Pure Storage volume (disk) to the host where the VM is running. This mechanism works fine under normal conditions.
- When a VM is powered off, migration between hosts works correctly.
- When a VM is running, live migration fails most of the time.
Observed Behavior During Migration
- The VM is running on HOST1, and its disk (volume) is assigned to it in Pure Storage.
- When a live migrationis initiated:
- The plugin maps the volume to HOST2 (destination) while keeping it mapped to HOST1.
- The migration process transfers CPU and memory to HOST2.
- After migration completes, the plugin removes the volume from HOST1.
Error Encountered
The migration process
fails due to a timeout when mapping the volume:
2025-03-07 14:10:10 starting migration of VM 100 to node 'proxmox01' (10.200.1.201)
2025-03-07 14:10:10 starting VM 100 on remote node 'proxmox01'
2025-03-07 14:10:18 [proxmox01] Error :: Timeout while waiting for volume "vm-100-disk-1" to map
2025-03-07 14:10:18 ERROR: online migrate failure - remote command failed with exit code 255
2025-03-07 14:10:18 aborting phase 2 - cleanup resources
2025-03-07 14:10:18 migrate_cancel
2025-03-07 14:10:22 ERROR: migration finished with problems (duration 00:00:13)
TASK ERROR: migration problems
The issue suggests that the migration process
starts before ensuring that HOST2 has successfully mapped the volume and has access to it.
Workaround
A possible
temporary fix is to
introduce a short delay (e.g., 5 seconds) before proceeding with the migration. This would allow enough time for the storage volume to be properly mapped and detected by HOST2.
Task: Fix the Issue in the Plugin
To resolve this issue properly, the plugin should implement a
validation step before continuing the migration:
- Verify that the destination host (HOST2) can access the volume before starting the migration.
- Add a check to confirm the volume is fully visible and accessible from HOST2.
- If necessary, introduce a short pause (e.g., 5 seconds) to allow time for the mapping process to complete before continuing.
- Modify the plugin logic to only proceed with migration when HOST2 has successfully mapped the volume.
I apologize in advance because I believe this might indeed be an issue with my infrastructure. I'm running Proxmox on top of an ESXi (things we love to do in a lab, hahaha).
I also apologize because I'm not a developer. BUT here are the code modifications:
View attachment 83383
Perl:
# Wait for the device to appear with increased timeout for live migration
wait_for( $path_exists, "volume \"$volname\" to map", 30, 0.5 );
# Additional validation to ensure device is fully accessible
my $device_accessible = sub {
return -b $path && -r $path && -w $path;
};
# Wait for device to be fully accessible
wait_for( $device_accessible, "volume \"$volname\" to be fully accessible", 30, 0.5 );
and
Perl:
if ( !multipath_check( $wwid ) ) {
print "Debug :: Adding multipath map for device \"$wwid\"\n" if $DEBUG;
exec_command( [ 'multipathd', 'add', 'map', $wwid ] );
# Wait for multipath to be fully established
my $multipath_ready = sub {
return multipath_check( $wwid );
};
wait_for( $multipath_ready, "multipath map for volume \"$volname\" to be ready", 30, 0.5 );
}
Once again, thank you very much for sharing the plugin, and here is my contribution.
Best regards,
Rafael Carvalho.