Hello,
I'm new to Proxmox so I'm not sure exactly how to go about troubleshooting this. I'm also using a custom storage plugin, which could well be the cause, but again I don't know how to go about troubleshooting, because the documentation for storage plugins is...non-existant.
Environment:
When I click the "reboot" button in the web UI for
Observed behavior:
Sometimes, but not always (seems to be around 50% of the time), the VM begins showing I/O errors on the console, and the migration fails (error message below). Migration is attempted repeatedly, but never succeeds. pvesm sometimes shows shared storage inactive, despite the fact that there's still a VM using it. However, on a couple of attempts I've seen pvesm show the storage as still enabled and active, along with the web UI, but with the "usage" graph empty (indicating that the web UI can't get the size of the storage).
When this occurs, the only way to "fix" it is to set the HA mode of the VM to
Any suggestions?
Web UI on the rebooting node:

I'm new to Proxmox so I'm not sure exactly how to go about troubleshooting this. I'm also using a custom storage plugin, which could well be the cause, but again I don't know how to go about troubleshooting, because the documentation for storage plugins is...non-existant.
Environment:
- 3 node cluster
- PVE 8.1, freshly installed ~3 weeks ago, latest updates (non-subscription) installed just as I'm writing this, to ensure I'm not encountering an already-fixed bug
- Shared storage using pve-moosefs, which I updated to add support for snapshots of raw files
- HA settings:
shutdown_policy=migrate
- Migration settings:
Default
- CRS:
Default
- Several VMs running successfully on
pve1
- 1 VM running on
pve2
, with HA mode set torunning
When I click the "reboot" button in the web UI for
pve2
, the running VM should be migrated to one of the remaining two nodes. The VM is using only shared storage, so it should be fairly quick. Once all HA VMs are migrated, shared storage should be taken offline, and the reboot should proceed.Observed behavior:
Sometimes, but not always (seems to be around 50% of the time), the VM begins showing I/O errors on the console, and the migration fails (error message below). Migration is attempted repeatedly, but never succeeds. pvesm sometimes shows shared storage inactive, despite the fact that there's still a VM using it. However, on a couple of attempts I've seen pvesm show the storage as still enabled and active, along with the web UI, but with the "usage" graph empty (indicating that the web UI can't get the size of the storage).
When this occurs, the only way to "fix" it is to set the HA mode of the VM to
ignored
, which allows the node to restart. My assumption is that shared storage is being taken offline before the VM migration is complete, which is causing the failure. This seems like something that PVE should understand due to the fact that the VM's disk lives on shared storage, and it should wait to take the storage offline until the VM is migrated. The behavior is also not consistent. Sometimes pvesm shows the storage as enabled and active, but in my most recent test (see below), pvesm says the storage is inactive. However, it's still mounted and I can read/write to it via the web shell.Any suggestions?
Code:
VM Migration output:
task started by HA resource agent
2024-01-18 16:55:35 starting migration of VM 105 to node 'pve1' (192.168.11.11)
2024-01-18 16:55:35 starting VM 105 on remote node 'pve1'
2024-01-18 16:55:37 start remote tunnel
2024-01-18 16:55:38 ssh tunnel ver 1
2024-01-18 16:55:38 starting online/live migration on unix:/run/qemu-server/105.migrate
2024-01-18 16:55:38 set migration capabilities
2024-01-18 16:55:38 migration downtime limit: 100 ms
2024-01-18 16:55:38 migration cachesize: 512.0 MiB
2024-01-18 16:55:38 set migration parameters
2024-01-18 16:55:38 start migrate command to unix:/run/qemu-server/105.migrate
2024-01-18 16:55:39 migration active, transferred 307.5 MiB of 4.0 GiB VM-state, 2.7 GiB/s
2024-01-18 16:55:40 migration status error: failed
2024-01-18 16:55:40 ERROR: online migrate failure - aborting
2024-01-18 16:55:40 aborting phase 2 - cleanup resources
2024-01-18 16:55:40 migrate_cancel
2024-01-18 16:55:42 ERROR: migration finished with problems (duration 00:00:08)
TASK ERROR: migration problems
Code:
# pvesm status
mfsmaster accepted connection with parameters: read-write,restricted_ip ; root mapped to root:root
Name Type Status Total Used Available %
local dir active 63413948 4427048 55733244 6.98%
local-lvm lvmthin active 124596224 0 124596224 0.00%
mfs-main moosefs inactive 0 0 0 0.00%
zfs-1 zfspool active 552730624 1236 552729388 0.00%
Web UI on the rebooting node:

Last edited: