How to remove NFS share while unavailable

mocanub · Feb 4, 2019

Hi all,

I had a FreeNAS share mounted in my Proxmox environment as NFS storage. Due to some hardware issues we had to take the FreeNas appliance offline.

Since we've restarted the first Proxmox VE node the entire cluster started acting weird:
- when trying to start a VM (on any node) if fails with:

Code:

TASK ERROR: start failed: command '/usr/bin/kvm -id 202 -name VM-001 -chardev 'socket,id=qmp,path=/var/run/qemu-server/202.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qemu-server/202-event.qmp,server,nowait' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/202.pid -daemonize -smbios 'type=1,uuid=d6f39b00-43a3-4cd5-9f6f-33e10f09124e' -smp '4,sockets=1,cores=4,maxcpus=4' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga std -vnc unix:/var/run/qemu-server/202.vnc,x509,password -no-hpet -cpu 'kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_reset,hv_vpindex,hv_runtime,hv_relaxed,hv_synic,hv_stimer,enforce' -m 16384 -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:411bb56e5518' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'ahci,id=ahci0,multifunction=on,bus=pci.0,addr=0x7' -drive 'file=/mnt/SSD_NODE_22/images/202/vm-202-disk-1.qcow2,if=none,id=drive-sata0,format=qcow2,cache=none,aio=native,detect-zeroes=on' -device 'ide-drive,bus=ahci0.0,drive=drive-sata0,id=sata0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap202i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown' -device 'e1000,mac=EE:E9:CA:97:07:57,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -rtc 'driftfix=slew,base=localtime' -global 'kvm-pit.lost_tick_policy=discard'' failed: got timeout

- can't temporarily remove the NFS storage as it will error out with Connection Timed Out (in the web console) and pvesm remove NAS_SHARE will just freeze.

- the Proxmox specific services are showing as running but pvestatd status is advising that STORAGE 'NAS_SHARE' IS NOT ONLINE

Any ideas if the unavailable storage is causing this?
- if YES how can I remove it temporarily so that I can resume cluster normal operation?
- If NO then what exactly is causing this behavior and how can I fix it ?

Here is the info on my Proxmox VE environment:

Code:

root@is-node-22:~# pveversion -v                     
proxmox-ve: 5.2-2 (running kernel: 4.15.18-8-pve)     
pve-manager: 5.2-10 (running version: 5.2-10/6f892b40)
pve-kernel-4.15: 5.2-11                               
pve-kernel-4.15.18-8-pve: 4.15.18-28                 
corosync: 2.4.2-pve5                                 
criu: 2.11.1-1~bpo90                                 
glusterfs-client: 3.8.8-1                             
ksm-control-daemon: not correctly installed           
libjs-extjs: 6.0.1-2                                 
libpve-access-control: 5.0-8                         
libpve-apiclient-perl: 2.0-5                         
libpve-common-perl: 5.0-41                           
libpve-guest-common-perl: 2.0-18                     
libpve-http-server-perl: 2.0-11                       
libpve-storage-perl: 5.0-30                           
libqb0: 1.0.1-1                                       
lvm2: 2.02.168-pve6                                   
lxc-pve: 3.0.2+pve1-3                                 
lxcfs: 3.0.2-2                                       
novnc-pve: 1.0.0-2                                   
proxmox-widget-toolkit: 1.0-20                       
pve-cluster: 5.0-30                                   
pve-container: 2.0-29                                 
pve-docs: 5.2-9                                       
pve-firewall: 3.0-14                                 
pve-firmware: 2.0-6                                   
pve-ha-manager: 2.0-5                                 
pve-i18n: 1.0-6                                       
pve-libspice-server1: 0.14.1-1                       
pve-qemu-kvm: 2.12.1-1                               
pve-xtermjs: 1.0-5                                   
qemu-server: 5.0-38                                   
smartmontools: 6.5+svn4324-1                         
spiceterm: 3.0-5                                     
vncterm: 1.5-3

Thanks in advance,
B.

alexskysilk · Feb 4, 2019

step 1: remove or disable the share by editing storage.conf directly.
step 2: dismount your phantom shares like so:

umount -l /mnt/pve/yournfsshare

you may need to repeat this step or add the -f switch.

That should remove it.

mocanub · Feb 4, 2019

@alexskysilk:
- I can read the contents of the storage.conf file but if I try to edit it with nano / vim it freezes the screen after entering the command.
- the NAS_SHARE is showing as folder in /mnt/pve/ but it is not mounted so the output of your suggested command is: NAS_SHARE not mounted.

alexskysilk · Feb 4, 2019

Please post the contents of storage.conf, and mark the changes you are attempting to make.

mocanub · Feb 4, 2019

These are the lines from /etc/pve/storage.cfg which are referring to the NFS share that I want to remove:

nfs: NAS_03_HDD
export /mnt/IS_NAS_03_HDD
path /mnt/pve/NAS_03_HDD
server 10.20.30.83
content rootdir,backup,iso,vztmpl,images
maxfiles 1
options vers=3

Also don't get me wrong but this not necessarily my main task ... I would like to make my Proxmox cluster operational. My guess is that currently it has something to do with this unreachable NFS share.

B.

alexskysilk · Feb 4, 2019

mocanub said:
Also don't get me wrong but this not necessarily my main task

I hear you, but since a storage outage locks up both the stats collector and puts the API head in unknown state, those two things are one and the same...

I see you have that share used for ct/vm storage; do you have any containers or vms that are currently "running" with their disk on that storage? if so, you will need to kill them if you can (you may not be able to.) If you cant, you'll need to kill any impacted nodes one by one to restore operation, and by kill I mean hard reset since they will not shut down gracefully.

mocanub · Feb 4, 2019

I've figured out that my first Proxmox node has a template with 2 disks on that NAS. Unfortunately I can't move the vm config or alter it in any way:
- by stopping the corosync service it will tell me that I will not be able to alter a read-only file (as root)
- by trying to alter / delete the config it will freeze the console.

Any suggestions ?

B.

alexskysilk · Feb 4, 2019

mocanub said:
Any suggestions ?

see above post. migrate anything you can from the affected node, make sure any vms that are using your templates are REMOVED FROM HA, and then kill it (hard reset.)

fireon · Feb 4, 2019

Unmount an freeze NFSshare easy with:

Code:

fusermount -uz /mnt/pve/mountpoint

mocanub · Feb 4, 2019

@fireon

# ls -la /mnt/pve/NAS_03_
NAS_03_HDD/ NAS_03_SSD/
# fusermount -uz /mnt/pve/NAS_03_HDD
fusermount: failed to unmount /mnt/pve/NAS_03_HDD: Invalid argument

@alexskysilk: That's the thing. I cannot migrate anything / alter anything. Cluster seems unresponsive.

fireon · Feb 4, 2019

Invalid argument... strange. This worked here alway, on really hard stucks.

mocanub · Feb 5, 2019

Unfortunately no luck, I've configured some machine with that NFS but the nodes still can't see each other and can't turn any VMs on.

Here are the last logs from my services:

root@is-node-22:~# systemctl status corosync.service pvestatd.service pvedaemon.service pve-cluster.service
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2019-02-05 17:40:58 EET; 3min 12s ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 9952 (corosync)
Tasks: 2 (limit: 4915)
Memory: 51.7M
CPU: 3.580s
CGroup: /system.slice/corosync.service
└─9952 /usr/sbin/corosync -f

Feb 05 17:41:25 is-node-22 corosync[9952]: notice [TOTEM ] Retransmit List: 14c42 14c69 14c6a 14cb6 14cb7 14c1e 14c43 14cb8 14cb9 14cba 14cbb 14c6b 14c6c 14c6d 14c6e 14ce0 14ce1 14c44 14c45 14c46 14c47 14c90 14c91 14c1f 14c20 14c21 14c92 14c93 14c94
Feb 05 17:41:25 is-node-22 corosync[9952]: [TOTEM ] Retransmit List: 14c42 14c69 14c6a 14cb6 14cb7 14c1e 14c43 14cb8 14cb9 14cba 14cbb 14c6b 14c6c 14c6d 14c6e 14ce0 14ce1 14c44 14c45 14c46 14c47 14c90 14c91 14c1f 14c20 14c21 14c92 14c93 14c94
Feb 05 17:41:25 is-node-22 corosync[9952]: notice [TOTEM ] Retransmit List: 14c43 14cb8 14cb9 14cba 14cbb 14c6b 14c6c 14c6d 14c6e 14ce0 14ce1 14c44 14c45 14c46 14c47 14c90 14c91 14c1f 14c20 14c21 14c92 14c93 14c94 14c1e 14c42 14c69 14c6a 14cb6 14cb7
Feb 05 17:41:25 is-node-22 corosync[9952]: [TOTEM ] Retransmit List: 14c43 14cb8 14cb9 14cba 14cbb 14c6b 14c6c 14c6d 14c6e 14ce0 14ce1 14c44 14c45 14c46 14c47 14c90 14c91 14c1f 14c20 14c21 14c92 14c93 14c94 14c1e 14c42 14c69 14c6a 14cb6 14cb7
Feb 05 17:41:25 is-node-22 corosync[9952]: notice [TOTEM ] Retransmit List: 14c6b 14c6c 14c6d 14c6e 14ce0 14ce1 14c44 14c45 14c46 14c47 14c90 14c91 14c1f 14c20 14c21 14c92 14c93 14c94 14c1e 14c42 14c69 14c6a 14cb6 14cb7 14c43 14cb8 14cb9 14cba 14cbb
Feb 05 17:41:25 is-node-22 corosync[9952]: [TOTEM ] Retransmit List: 14c6b 14c6c 14c6d 14c6e 14ce0 14ce1 14c44 14c45 14c46 14c47 14c90 14c91 14c1f 14c20 14c21 14c92 14c93 14c94 14c1e 14c42 14c69 14c6a 14cb6 14cb7 14c43 14cb8 14cb9 14cba 14cbb
Feb 05 17:41:25 is-node-22 corosync[9952]: notice [TOTEM ] Retransmit List: 14c44 14c45 14c46 14c47 14c90 14c91 14c1f 14c20 14c21 14c92 14c93 14c94 14c1e 14c42 14c69 14c6a 14cb6 14cb7 14c43 14cb8 14cb9 14cba 14cbb 14c6b 14c6c 14c6d 14c6e 14ce0 14ce1
Feb 05 17:41:25 is-node-22 corosync[9952]: [TOTEM ] Retransmit List: 14c44 14c45 14c46 14c47 14c90 14c91 14c1f 14c20 14c21 14c92 14c93 14c94 14c1e 14c42 14c69 14c6a 14cb6 14cb7 14c43 14cb8 14cb9 14cba 14cbb 14c6b 14c6c 14c6d 14c6e 14ce0 14ce1
Feb 05 17:41:29 is-node-22 corosync[9952]: notice [TOTEM ] Retransmit List: 14c6d 14c6e 14ce0 14ce1 14e11 14e12 14e13
Feb 05 17:41:29 is-node-22 corosync[9952]: [TOTEM ] Retransmit List: 14c6d 14c6e 14ce0 14ce1 14e11 14e12 14e13

● pvestatd.service - PVE Status Daemon
Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2019-02-05 17:40:58 EET; 3min 12s ago
Process: 9891 ExecStop=/usr/bin/pvestatd stop (code=exited, status=0/SUCCESS)
Process: 9954 ExecStart=/usr/bin/pvestatd start (code=exited, status=0/SUCCESS)
Main PID: 9985 (pvestatd)
Tasks: 1 (limit: 4915)
Memory: 69.8M
CPU: 1.292s
CGroup: /system.slice/pvestatd.service
└─9985 pvestatd

Feb 05 17:40:58 is-node-22 systemd[1]: Starting PVE Status Daemon...
Feb 05 17:40:58 is-node-22 pvestatd[9985]: starting server
Feb 05 17:40:58 is-node-22 systemd[1]: Started PVE Status Daemon.
● pvedaemon.service - PVE API Daemon
Loaded: loaded (/lib/systemd/system/pvedaemon.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2019-02-05 17:40:58 EET; 3min 12s ago
Process: 9892 ExecStop=/usr/bin/pvedaemon stop (code=exited, status=0/SUCCESS)
Process: 9971 ExecStart=/usr/bin/pvedaemon start (code=exited, status=0/SUCCESS)
Main PID: 9994 (pvedaemon)
Tasks: 4 (limit: 4915)
Memory: 130.0M
CPU: 9.410s
CGroup: /system.slice/pvedaemon.service
├─9994 pvedaemon
├─9997 pvedaemon worker
├─9998 pvedaemon worker
└─9999 pvedaemon worker

Feb 05 17:40:58 is-node-22 systemd[1]: Starting PVE API Daemon...
Feb 05 17:40:58 is-node-22 pvedaemon[9994]: starting server
Feb 05 17:40:58 is-node-22 pvedaemon[9994]: starting 3 worker(s)
Feb 05 17:40:58 is-node-22 pvedaemon[9994]: worker 9997 started
Feb 05 17:40:58 is-node-22 pvedaemon[9994]: worker 9998 started
Feb 05 17:40:58 is-node-22 pvedaemon[9994]: worker 9999 started
Feb 05 17:40:58 is-node-22 systemd[1]: Started PVE API Daemon.
Feb 05 17:41:34 is-node-22 pvedaemon[9999]: <root@pam> successful auth for user 'root@pam'

● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2019-02-05 17:40:58 EET; 3min 12s ago
Process: 9946 ExecStartPost=/usr/bin/pvecm updatecerts --silent (code=exited, status=0/SUCCESS)
Process: 9919 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
Main PID: 9929 (pmxcfs)
Tasks: 6 (limit: 4915)
Memory: 116.1M
CPU: 2.245s
CGroup: /system.slice/pve-cluster.service
└─9929 /usr/bin/pmxcfs

Feb 05 17:41:03 is-node-22 pmxcfs[9929]: [dcdb] notice: starting data syncronisation
Feb 05 17:41:03 is-node-22 pmxcfs[9929]: [status] notice: members: 1/9929, 2/1462, 3/1194, 4/1296, 5/1359, 6/1232, 7/1328, 10/1380, 13/1296, 15/17494, 16/4436, 17/1246, 18/1240, 19/1169, 21/1257, 22/1251, 23/1342, 24/7094
Feb 05 17:41:03 is-node-22 pmxcfs[9929]: [status] notice: starting data syncronisation
Feb 05 17:41:03 is-node-22 pmxcfs[9929]: [dcdb] notice: received sync request (epoch 1/9929/00000001)
Feb 05 17:41:03 is-node-22 pmxcfs[9929]: [status] notice: received sync request (epoch 1/9929/00000001)
Feb 05 17:41:23 is-node-22 pmxcfs[9929]: [status] notice: members: 1/9929, 2/1462, 3/1194, 4/1296, 5/1359, 6/1232, 7/1328, 10/1380, 13/1296, 15/17494, 16/4436, 17/1246, 18/1240, 19/1169, 21/1257, 22/1251, 23/1342
Feb 05 17:41:23 is-node-22 pmxcfs[9929]: [status] notice: received sync request (epoch 1/9929/00000002)
Feb 05 17:41:23 is-node-22 pmxcfs[9929]: [status] notice: members: 1/9929, 2/1462, 3/1194, 4/1296, 5/1359, 6/1232, 7/1328, 10/1380, 13/1296, 15/17494, 16/4436, 17/1246, 18/1240, 19/1169, 21/1257, 22/1251, 23/1342, 24/7094
Feb 05 17:41:23 is-node-22 pmxcfs[9929]: [status] notice: queue not emtpy - resening 81332 messages
Feb 05 17:41:24 is-node-22 pmxcfs[9929]: [status] notice: received sync request (epoch 1/9929/00000003)

Any further advice is highly appreciated.

Thanks,
B.

gunterwa · Jun 13, 2022

I do remember I've removed nfs configuration, but output of journalctl still show me pve seeking for nfs storage...

Mar 31 09:30:49 host4 pvestatd[1309]: status update time (6.387 seconds)
Mar 31 09:30:59 host4 pvestatd[1309]: storage 'nfs-ndt' is not online
Mar 31 09:30:59 host4 pvestatd[1309]: storage 'nfsproxmox' is not online
Mar 31 09:31:00 host4 pvestatd[1309]: status update time (6.323 seconds)
Mar 31 09:31:09 host4 pvestatd[1309]: storage 'nfs-ndt' is not online
Mar 31 09:31:09 host4 pvestatd[1309]: storage 'nfsproxmox' is not online
Mar 31 09:31:09 host4 pvestatd[1309]: status update time (6.359 seconds)
Mar 31 09:31:19 host4 pvestatd[1309]: storage 'nfs-ndt' is not online
Mar 31 09:31:19 host4 pvestatd[1309]: storage 'nfsproxmox' is not online
Mar 31 09:31:19 host4 pvestatd[1309]: status update time (6.332 seconds)
Mar 31 09:31:23 host4 pvestatd[1309]: storage 'nfsproxmox' is not online
Mar 31 09:31:29 host4 pvestatd[1309]: storage 'nfs-ndt' is not online
Mar 31 09:31:30 host4 pvestatd[1309]: status update time (6.355 seconds)
Mar 31 09:31:39 host4 pvestatd[1309]: storage 'nfs-ndt' is not online
Mar 31 09:31:39 host4 pvestatd[1309]: storage 'nfsproxmox' is not online
Mar 31 09:31:39 host4 pvestatd[1309]: status update time (6.343 seconds)
Mar 31 09:31:43 host4 pvestatd[1309]: storage 'nfsproxmox' is not online
Mar 31 09:31:49 host4 pvestatd[1309]: storage 'nfs-ndt' is not online
Mar 31 09:31:49 host4 pvestatd[1309]: status update time (6.351 seconds)
Mar 31 09:31:53 host4 pvestatd[1309]: storage 'nfsproxmox' is not online
Mar 31 09:32:00 host4 pvestatd[1309]: storage 'nfs-ndt' is not online
Mar 31 09:32:00 host4 pvestatd[1309]: status update time (6.363 seconds)
Mar 31 09:32:03 host4 pvestatd[1309]: storage 'nfsproxmox' is not online
Mar 31 09:32:09 host4 pvestatd[1309]: storage 'nfs-ndt' is not online
Mar 31 09:32:09 host4 pvestatd[1309]: status update time (6.348 seconds)
Mar 31 09:32:13 host4 pvestatd[1309]: storage 'nfsproxmox' is not online
Mar 31 09:32:19 host4 pvestatd[1309]: storage 'nfs-ndt' is not online

I've checked storage.cfg and /etc/pve/storage, both have no nfs configure:

root@host4:~# more /etc/pve/storage.cfg
dir: local
path /var/lib/vz
content backup,vztmpl,iso

lvmthin: local-lvm
thinpool data
vgname pve
content images,rootdir

rbd: SSD_Storage
content rootdir,images
krbd 0
pool SSD_Storage

rbd: HDD_Storage
content rootdir,images
krbd 0
pool HDD_Storage

cephfs: cephfs
path /mnt/pve/cephfs
content backup,vztmpl,iso
fs-name cephfs

// is there any other cfg file I need to modify?

Search

Search

How to remove NFS share while unavailable

mocanub

Active Member

alexskysilk

Distinguished Member

mocanub

Active Member

alexskysilk

Distinguished Member

mocanub

Active Member

alexskysilk

Distinguished Member

mocanub

Active Member

alexskysilk

Distinguished Member

fireon

Distinguished Member

mocanub

Active Member

fireon

Distinguished Member

mocanub

Active Member

gunterwa

Member

We value your privacy