VM migrate fails with message nfs volume is not online

Mobin

Member
Mar 15, 2019
31
2
13
35
We have a setup of three node proxmox cluster and NFS VMstore is used to store the vm also a nfs iso volume added in to the cluster.
Currently there are three VM's on this cluster and while we migrating vm from one node to another node some times it is failing by saying that nfsiso or vmstore volume is not online. But when we are checking the df -h there is no issue with the nfs volume. If we again try to do the vm migrate it works. I am wondering how the vm migrate worked on second time with out changing anything

Proxmox version is 6.1-3

Logs are attached here. Can any one help on this


2020-05-13 05:06:26 starting migration of VM 101 to node 'ascchypsrv3' (172.22.176.53)
2020-05-13 05:06:26 starting VM 101 on remote node 'ascchypsrv3'
2020-05-13 05:06:28 storage 'nfsiso' is not online
2020-05-13 05:06:28 ERROR: online migrate failure - command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=ascchypsrv3' root@172.22.176.53 qm start 101 --skiplock --migratedfrom ascchypsrv1 --migration_type secure --stateuri unix --machine pc-i440fx-4.1+pve1' failed: exit code 255
2020-05-13 05:06:28 aborting phase 2 - cleanup resources
2020-05-13 05:06:28 migrate_cancel
2020-05-13 05:06:29 ERROR: migration finished with problems (duration 00:00:04)
TASK ERROR: migration problems

ping vigyaan-scn.issdc.gov.in

root@ascchypsrv1:~# qm migrate 102 ascchypsrv2
can't migrate running VM without --online
root@ascchypsrv1:~# qm migrate 102 ascchypsrv2 --online
2020-05-13 05:27:55 starting migration of VM 102 to node 'ascchypsrv2' (172.22.176.52)
2020-05-13 05:27:55 starting VM 102 on remote node 'ascchypsrv2'
2020-05-13 05:27:58 storage 'nfsiso' is not online
2020-05-13 05:27:58 ERROR: online migrate failure - command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=ascchypsrv2' root@172.22.176.52 qm start 102 --skiplock --migratedfrom ascchypsrv1 --migration_type secure --stateuri unix --machine pc-i440fx-4.1+pve1' failed: exit code 255
2020-05-13 05:27:58 aborting phase 2 - cleanup resources
2020-05-13 05:27:58 migrate_cancel
2020-05-13 05:27:58 ERROR: migration finished with problems (duration 00:00:04)
migration problems
root@ascchypsrv1:~# qm migrate 102 ascchypsrv2 --online
2020-05-13 05:28:09 starting migration of VM 102 to node 'ascchypsrv2' (172.22.176.52)
2020-05-13 05:28:10 starting VM 102 on remote node 'ascchypsrv2'
2020-05-13 05:28:11 start remote tunnel
2020-05-13 05:28:12 ssh tunnel ver 1
2020-05-13 05:28:12 starting online/live migration on unix:/run/qemu-server/102.migrate
2020-05-13 05:28:12 migrate_set_speed: 8589934592
2020-05-13 05:28:12 migrate_set_downtime: 0.1
2020-05-13 05:28:12 set migration_caps
2020-05-13 05:28:12 set cachesize: 2147483648
2020-05-13 05:28:12 start migrate command to unix:/run/qemu-server/102.migrate
2020-05-13 05:28:13 migration status: active (transferred 100662522, remaining 14132170752), total 17197506560)
2020-05-13 05:28:13 migration xbzrle cachesize: 2147483648 transferred 0 pages 0 cachemiss 0 overflow 0
2020-05-13 05:28:14 migration status: active (transferred 187400409, remaining 9360424960), total 17197506560)
2020-05-13 05:28:14 migration xbzrle cachesize: 2147483648 transferred 0 pages 0 cachemiss 0 overflow 0
2020-05-13 05:28:15 migration status: active (transferred 256946683, remaining 1051828224), total 17197506560)
2020-05-13 05:28:15 migration xbzrle cachesize: 2147483648 transferred 0 pages 0 cachemiss 0 overflow 0
2020-05-13 05:28:16 migration status: active (transferred 374636235, remaining 925507584), total 17197506560)
2020-05-13 05:28:16 migration xbzrle cachesize: 2147483648 transferred 0 pages 0 cachemiss 0 overflow 0
2020-05-13 05:28:17 migration status: active (transferred 492379103, remaining 786104320), total 17197506560)
2020-05-13 05:28:17 migration xbzrle cachesize: 2147483648 transferred 0 pages 0 cachemiss 0 overflow 0
2020-05-13 05:28:18 migration status: active (transferred 610169860, remaining 662179840), total 17197506560)
2020-05-13 05:28:18 migration xbzrle cachesize: 2147483648 transferred 0 pages 0 cachemiss 0 overflow 0
2020-05-13 05:28:19 migration status: active (transferred 728040474, remaining 542912512), total 17197506560)
2020-05-13 05:28:19 migration xbzrle cachesize: 2147483648 transferred 0 pages 0 cachemiss 0 overflow 0
2020-05-13 05:28:20 migration status: active (transferred 845673488, remaining 410640384), total 17197506560)
2020-05-13 05:28:20 migration xbzrle cachesize: 2147483648 transferred 0 pages 0 cachemiss 0 overflow 0
2020-05-13 05:28:21 migration status: active (transferred 962757250, remaining 224559104), total 17197506560)
2020-05-13 05:28:21 migration xbzrle cachesize: 2147483648 transferred 0 pages 0 cachemiss 0 overflow 0
2020-05-13 05:28:22 migration status: active (transferred 1080373938, remaining 107171840), total 17197506560)
2020-05-13 05:28:22 migration xbzrle cachesize: 2147483648 transferred 0 pages 0 cachemiss 0 overflow 0
2020-05-13 05:28:23 migration speed: 1489.45 MB/s - downtime 98 ms
2020-05-13 05:28:23 migration status: completed
2020-05-13 05:28:25 migration finished successfully (duration 00:00:16)




auto lo
iface lo inet loopback

iface ens3f0 inet manual

iface ens3f1 inet manual

iface ens2f0 inet manual

iface ens2f1 inet manual

iface eno1 inet manual

iface eno2 inet manual

auto bond0
iface bond0 inet manual
bond-slaves ens2f0 ens3f0
bond-miimon 100
bond-mode active-backup
#1Gig bond

auto bond1
iface bond1 inet manual
bond-slaves eno1 eno2
bond-miimon 100
bond-mode active-backup
#10Gig bond

auto vmbr0
iface vmbr0 inet static
address 172.22.176.51
netmask 255.255.255.0
gateway 172.22.176.3
bridge-ports bond0
bridge-stp off
bridge-fd 0
#1Gig for managment

auto vmbr1
iface vmbr1 inet manual
bridge-ports bond1
bridge-stp off
bridge-fd 0
#10Gig for data/vm



root@ascchypsrv1:~# cat /etc/pve/storage.cfg
dir: local
path /var/lib/vz
content vztmpl,backup,iso

lvmthin: local-lvm
thinpool data
vgname pve
content images,rootdir

nfs: idsnasccVMStore
export /ifs/data/sonas/idsnascc-vmstore
path /mnt/pve/idsnasccVMStore
server vigyaan-scn.issdc.gov.in
content images

nfs: nfsiso
export /ifs/data/sonas/nfsiso
path /mnt/pve/nfsiso
server vigyaan-scn.issdc.gov.in
content iso



root@ascchypsrv1:~# pveversion
pve-manager/6.1-3/37248ce6 (running kernel: 5.3.10-1-pve)
root@ascchypsrv1:~# corosync
May 13 08:32:03 notice [MAIN ] Corosync Cluster Engine 3.0.2 starting up
May 13 08:32:03 info [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
root@ascchypsrv1:~# corosync -version
Corosync Cluster Engine, version '3.0.2'
Copyright (c) 2006-2018 Red Hat, Inc.
root@ascchypsrv1:~#
root@ascchypsrv1:~# pvesm status
Name Type Status Total Used Available %
idsnasccVMStore nfs active 2147483648 278856704 1868626944 12.99%
local dir active 98559220 1957156 91552516 1.99%
local-lvm lvmthin active 449990656 0 449990656 0.00%
nfsiso nfs active 104857600 49415168 55442432 47.13%
root@ascchypsrv1:~# pvecm status
Cluster information
-------------------
Name: idsnascc
Config Version: 3
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Wed May 13 08:33:08 2020
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.54
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 172.22.176.51 (local)
0x00000002 1 172.22.176.52
0x00000003 1 172.22.176.53
root@ascchypsrv1:~#
 
sounds like access to the NFS server is flaky? check the system logs on both nodes around the time of the failed migration..
 
No. NFS volume is available on all three nodes and we are able to write. We are using EMC isilon storage with all the export policies
 
being able to write using an existing connection and the check whether storage is available & online are two different things. the check does the following command:

showmount --no-headers --exports $SERVER

which fails
 
showmount --no-headers --exports $SERVER
The above command shows the all the exports without fail.

Once in a while it fail to migrate but if I try again it will work. How it can be checked further
 
that means that once in a while that command fails. check the NFS server logs surrounding the timestamps of failed migration, setup continuous monitoring.
 
Dear Fabian,

Thanks for your response. Till now I couldn't find any logs that related to this issue.

However I will observe it few more days.

Is our network configuration is fine ?

Also the storage mount option is default as of now. Do I need to change to version 3
 
the problem is not with mounting it - that check just checks whether any exports are visible via NFS RPC - if it fails, it usually indicates some sort of temporary connectivity or load problem
 
As of now we have only one vm per hypervisor and those are in initial stage not using much. Also from the storage root level permission is given to these hypervisors
How can I check the load/connectivity from the hypervisor ?

Also as per our plan, We are planning to host 21 vm's in 9 physical servers and the vmstore will be in nfs volume over 1 Gig and data access we have separate 10Gig network . Will that be sufficient ? or do we need to upgrade network ?
 
Last edited:
that depends on your workload and what exactly you mean with 'nfs volume over 1G' and 'data access over 10G'. test with a somewhat realistic scenario and extrapolate (e.g., test real workload on one VM with X clients, then multiply by number of real production VMs and clients, add some extra ;)
 
that depends on your workload and what exactly you mean with 'nfs volume over 1G' and 'data access over 10G'. test with a somewhat realistic scenario and extrapolate (e.g., test real workload on one VM with X clients, then multiply by number of real production VMs and clients, add some extra ;)
Okay. Will try that.
What I meant is OS is storing in NFS file system over 1 Gig network and we are not store anything locally. ie, We have seperate nfs file system that is mounted on os over 10Gig network.

Can you please update how can we do nfs diagnostics from proxmox
 
I have noticed that NFS storage is showing offline intermittently in pvesm status.
We have mounted nfs storage through dns. How it can be checked further.
This issue is only there in proxmox 6.1, we have one more setup in which proxmox 5.2 we are using, there everything works fine
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!