VM migrate fails with message nfs volume is not online

Mobin · May 13, 2020

We have a setup of three node proxmox cluster and NFS VMstore is used to store the vm also a nfs iso volume added in to the cluster.
Currently there are three VM's on this cluster and while we migrating vm from one node to another node some times it is failing by saying that nfsiso or vmstore volume is not online. But when we are checking the df -h there is no issue with the nfs volume. If we again try to do the vm migrate it works. I am wondering how the vm migrate worked on second time with out changing anything

Proxmox version is 6.1-3

Logs are attached here. Can any one help on this

2020-05-13 05:06:26 starting migration of VM 101 to node 'ascchypsrv3' (172.22.176.53)
2020-05-13 05:06:26 starting VM 101 on remote node 'ascchypsrv3'
2020-05-13 05:06:28 storage 'nfsiso' is not online
2020-05-13 05:06:28 ERROR: online migrate failure - command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=ascchypsrv3' root@172.22.176.53 qm start 101 --skiplock --migratedfrom ascchypsrv1 --migration_type secure --stateuri unix --machine pc-i440fx-4.1+pve1' failed: exit code 255
2020-05-13 05:06:28 aborting phase 2 - cleanup resources
2020-05-13 05:06:28 migrate_cancel
2020-05-13 05:06:29 ERROR: migration finished with problems (duration 00:00:04)
TASK ERROR: migration problems

ping vigyaan-scn.issdc.gov.in

root@ascchypsrv1:~# qm migrate 102 ascchypsrv2
can't migrate running VM without --online
root@ascchypsrv1:~# qm migrate 102 ascchypsrv2 --online
2020-05-13 05:27:55 starting migration of VM 102 to node 'ascchypsrv2' (172.22.176.52)
2020-05-13 05:27:55 starting VM 102 on remote node 'ascchypsrv2'
2020-05-13 05:27:58 storage 'nfsiso' is not online
2020-05-13 05:27:58 ERROR: online migrate failure - command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=ascchypsrv2' root@172.22.176.52 qm start 102 --skiplock --migratedfrom ascchypsrv1 --migration_type secure --stateuri unix --machine pc-i440fx-4.1+pve1' failed: exit code 255
2020-05-13 05:27:58 aborting phase 2 - cleanup resources
2020-05-13 05:27:58 migrate_cancel
2020-05-13 05:27:58 ERROR: migration finished with problems (duration 00:00:04)
migration problems
root@ascchypsrv1:~# qm migrate 102 ascchypsrv2 --online
2020-05-13 05:28:09 starting migration of VM 102 to node 'ascchypsrv2' (172.22.176.52)
2020-05-13 05:28:10 starting VM 102 on remote node 'ascchypsrv2'
2020-05-13 05:28:11 start remote tunnel
2020-05-13 05:28:12 ssh tunnel ver 1
2020-05-13 05:28:12 starting online/live migration on unix:/run/qemu-server/102.migrate
2020-05-13 05:28:12 migrate_set_speed: 8589934592
2020-05-13 05:28:12 migrate_set_downtime: 0.1
2020-05-13 05:28:12 set migration_caps
2020-05-13 05:28:12 set cachesize: 2147483648
2020-05-13 05:28:12 start migrate command to unix:/run/qemu-server/102.migrate
2020-05-13 05:28:13 migration status: active (transferred 100662522, remaining 14132170752), total 17197506560)
2020-05-13 05:28:13 migration xbzrle cachesize: 2147483648 transferred 0 pages 0 cachemiss 0 overflow 0
2020-05-13 05:28:14 migration status: active (transferred 187400409, remaining 9360424960), total 17197506560)
2020-05-13 05:28:14 migration xbzrle cachesize: 2147483648 transferred 0 pages 0 cachemiss 0 overflow 0
2020-05-13 05:28:15 migration status: active (transferred 256946683, remaining 1051828224), total 17197506560)
2020-05-13 05:28:15 migration xbzrle cachesize: 2147483648 transferred 0 pages 0 cachemiss 0 overflow 0
2020-05-13 05:28:16 migration status: active (transferred 374636235, remaining 925507584), total 17197506560)
2020-05-13 05:28:16 migration xbzrle cachesize: 2147483648 transferred 0 pages 0 cachemiss 0 overflow 0
2020-05-13 05:28:17 migration status: active (transferred 492379103, remaining 786104320), total 17197506560)
2020-05-13 05:28:17 migration xbzrle cachesize: 2147483648 transferred 0 pages 0 cachemiss 0 overflow 0
2020-05-13 05:28:18 migration status: active (transferred 610169860, remaining 662179840), total 17197506560)
2020-05-13 05:28:18 migration xbzrle cachesize: 2147483648 transferred 0 pages 0 cachemiss 0 overflow 0
2020-05-13 05:28:19 migration status: active (transferred 728040474, remaining 542912512), total 17197506560)
2020-05-13 05:28:19 migration xbzrle cachesize: 2147483648 transferred 0 pages 0 cachemiss 0 overflow 0
2020-05-13 05:28:20 migration status: active (transferred 845673488, remaining 410640384), total 17197506560)
2020-05-13 05:28:20 migration xbzrle cachesize: 2147483648 transferred 0 pages 0 cachemiss 0 overflow 0
2020-05-13 05:28:21 migration status: active (transferred 962757250, remaining 224559104), total 17197506560)
2020-05-13 05:28:21 migration xbzrle cachesize: 2147483648 transferred 0 pages 0 cachemiss 0 overflow 0
2020-05-13 05:28:22 migration status: active (transferred 1080373938, remaining 107171840), total 17197506560)
2020-05-13 05:28:22 migration xbzrle cachesize: 2147483648 transferred 0 pages 0 cachemiss 0 overflow 0
2020-05-13 05:28:23 migration speed: 1489.45 MB/s - downtime 98 ms
2020-05-13 05:28:23 migration status: completed
2020-05-13 05:28:25 migration finished successfully (duration 00:00:16)

auto lo
iface lo inet loopback

iface ens3f0 inet manual

iface ens3f1 inet manual

iface ens2f0 inet manual

iface ens2f1 inet manual

iface eno1 inet manual

iface eno2 inet manual

auto bond0
iface bond0 inet manual
bond-slaves ens2f0 ens3f0
bond-miimon 100
bond-mode active-backup
#1Gig bond

auto bond1
iface bond1 inet manual
bond-slaves eno1 eno2
bond-miimon 100
bond-mode active-backup
#10Gig bond

auto vmbr0
iface vmbr0 inet static
address 172.22.176.51
netmask 255.255.255.0
gateway 172.22.176.3
bridge-ports bond0
bridge-stp off
bridge-fd 0
#1Gig for managment

auto vmbr1
iface vmbr1 inet manual
bridge-ports bond1
bridge-stp off
bridge-fd 0
#10Gig for data/vm

root@ascchypsrv1:~# cat /etc/pve/storage.cfg
dir: local
path /var/lib/vz
content vztmpl,backup,iso

lvmthin: local-lvm
thinpool data
vgname pve
content images,rootdir

nfs: idsnasccVMStore
export /ifs/data/sonas/idsnascc-vmstore
path /mnt/pve/idsnasccVMStore
server vigyaan-scn.issdc.gov.in
content images

nfs: nfsiso
export /ifs/data/sonas/nfsiso
path /mnt/pve/nfsiso
server vigyaan-scn.issdc.gov.in
content iso

root@ascchypsrv1:~# pveversion
pve-manager/6.1-3/37248ce6 (running kernel: 5.3.10-1-pve)
root@ascchypsrv1:~# corosync
May 13 08:32:03 notice [MAIN ] Corosync Cluster Engine 3.0.2 starting up
May 13 08:32:03 info [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
root@ascchypsrv1:~# corosync -version
Corosync Cluster Engine, version '3.0.2'
Copyright (c) 2006-2018 Red Hat, Inc.
root@ascchypsrv1:~#
root@ascchypsrv1:~# pvesm status
Name Type Status Total Used Available %
idsnasccVMStore nfs active 2147483648 278856704 1868626944 12.99%
local dir active 98559220 1957156 91552516 1.99%
local-lvm lvmthin active 449990656 0 449990656 0.00%
nfsiso nfs active 104857600 49415168 55442432 47.13%
root@ascchypsrv1:~# pvecm status
Cluster information
-------------------
Name: idsnascc
Config Version: 3
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Wed May 13 08:33:08 2020
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.54
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 172.22.176.51 (local)
0x00000002 1 172.22.176.52
0x00000003 1 172.22.176.53
root@ascchypsrv1:~#

fabian · May 13, 2020

sounds like access to the NFS server is flaky? check the system logs on both nodes around the time of the failed migration..

Mobin · May 13, 2020

No. NFS volume is available on all three nodes and we are able to write. We are using EMC isilon storage with all the export policies

fabian · May 13, 2020

being able to write using an existing connection and the check whether storage is available & online are two different things. the check does the following command:

showmount --no-headers --exports $SERVER

which fails

Mobin · May 13, 2020

showmount --no-headers --exports $SERVER
The above command shows the all the exports without fail.

Once in a while it fail to migrate but if I try again it will work. How it can be checked further

fabian · May 13, 2020

that means that once in a while that command fails. check the NFS server logs surrounding the timestamps of failed migration, setup continuous monitoring.

Mobin · May 14, 2020

Dear Fabian,

Thanks for your response. Till now I couldn't find any logs that related to this issue.

However I will observe it few more days.

Is our network configuration is fine ?

Also the storage mount option is default as of now. Do I need to change to version 3

Mobin · May 15, 2020

Is there any problem if use version 3 to mount vmstore

fabian · May 15, 2020

the problem is not with mounting it - that check just checks whether any exports are visible via NFS RPC - if it fails, it usually indicates some sort of temporary connectivity or load problem

Mobin · May 16, 2020

As of now we have only one vm per hypervisor and those are in initial stage not using much. Also from the storage root level permission is given to these hypervisors
How can I check the load/connectivity from the hypervisor ?

Also as per our plan, We are planning to host 21 vm's in 9 physical servers and the vmstore will be in nfs volume over 1 Gig and data access we have separate 10Gig network . Will that be sufficient ? or do we need to upgrade network ?

fabian · May 18, 2020

that depends on your workload and what exactly you mean with 'nfs volume over 1G' and 'data access over 10G'. test with a somewhat realistic scenario and extrapolate (e.g., test real workload on one VM with X clients, then multiply by number of real production VMs and clients, add some extra

Mobin · May 18, 2020

fabian said:
that depends on your workload and what exactly you mean with 'nfs volume over 1G' and 'data access over 10G'. test with a somewhat realistic scenario and extrapolate (e.g., test real workload on one VM with X clients, then multiply by number of real production VMs and clients, add some extra

Okay. Will try that.
What I meant is OS is storing in NFS file system over 1 Gig network and we are not store anything locally. ie, We have seperate nfs file system that is mounted on os over 10Gig network.

Can you please update how can we do nfs diagnostics from proxmox

Mobin · May 20, 2020

I have noticed that NFS storage is showing offline intermittently in pvesm status.
We have mounted nfs storage through dns. How it can be checked further.
This issue is only there in proxmox 6.1, we have one more setup in which proxmox 5.2 we are using, there everything works fine

Search

Search

VM migrate fails with message nfs volume is not online

Mobin

Member

fabian

Proxmox Staff Member

Mobin

Member

fabian

Proxmox Staff Member

Mobin

Member

fabian

Proxmox Staff Member

Mobin

Member

Mobin

Member

fabian

Proxmox Staff Member

Mobin

Member

fabian

Proxmox Staff Member

Mobin

Member

Mobin

Member