Shared storage across nodes not working - See comment #56.

Proximate · Sep 5, 2022

What in the world does this mean?

Error 500: can't activate storage 'nfs-iso' on node 'pro07':
command '/usr/bin/ssh -o 'BatchMode=yes' 10.0.0.76 -- /usr/sbin/pvesm status --storage nfs-iso' failed: exit code 255

I recently lost three guests because something happened to the storage in a cluster. Bit nervous about what seems to be happening here.

As far as I understand, when you enter a node into the cluster, each gets access to the storage set up on any node no?
I don't see any obvious problem in the GUI and I can see the share and the files from the command line so what's going on?

Dunuin · Sep 5, 2022

Proximate said:
As far as I understand, when you enter a node into the cluster, each gets access to the storage set up on any node no?

PVE will not magically share storages across nodes. Ticking the "shared storage" checkbox when creating a storage won't make it a storage shared. You have to setup a shared storage yourself (so create a Ceph, NFS or SMB) and then checking the "shared storage" checkbox will only tell PVE to handle that storage as a shared storage. Otherwise your nodes only got local storages that other nodes can't access.

Or do you mean something different?

Proximate · Sep 5, 2022

Yes, I mean something different but since I'm not sure what's going on, I'm also not sure what to include for info, details etc.
The storage was created at the cluster level and is shared across all nodes.

Proximate · Sep 5, 2022

All nodes can see the files in the NFS storage. Only I cannot upload anything suddenly as I'm getting the above error.

dcsapak · Sep 5, 2022

i'd check the syslog (/var/log/syslog or journalctl) to see what might be going on here

Proximate · Sep 5, 2022

What would I be looking for?

bbgeek17 · Sep 5, 2022

Proximate said:
What would I be looking for?

start with making sure your /etc/pve/storage.cfg is ok on pro07
continue with running directly on pro07: /usr/sbin/pvesm status --storage nfs-iso
check your log file if the error shown is not more descriptive
follow with : showmount -e "nfs-host/ip"

Its also possible that the error in your first post is not from pvesm but from SSH itself, in which care you need to make sure you can ssh between all nodes without password.

Finally, if everything probes ok - navigate to your mounted storage (see "df" output) and try to create a file : "touch file"

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Proximate · Sep 6, 2022

Seems ok. Like I shared earlier, all of the NFS shares are there.

Code:

# cat /etc/pve/storage.cfg
dir: local
        path /var/lib/vz
        content backup,iso,vztmpl

zfspool: local-zfs
        pool rpool/data
        content rootdir,images
        sparse 1

nfs: nfs_vmstore
        export /mnt/tn01pool/vmstore
        path /mnt/pve/nfs_vmstore
        server 10.0.0.12
        content iso,snippets,rootdir
        prune-backups keep-all=1

nfs: nfs-iso
        export /mnt/tn01pool/iso
        path /mnt/pve/nfs-iso
        server 10.0.0.12
        content images,rootdir,iso
        prune-backups keep-all=1

Code:

~# /usr/sbin/pvesm status --storage nfs-iso
Name           Type     Status           Total            Used       Available        %
nfs-iso         nfs     active      3879897728         6373760      3873523968    0.16%

Code:

# showmount -e "10.0.0.12"
Export list for 10.0.0.12:
/mnt/tn01pool/backups  10.1.1.1,10.0.0.1
/mnt/tn01pool/wwwshare 10.0.0.1
/mnt/tn01pool/iso      (everyone)
/mnt/tn01pool/vmstore  (everyone)

Code:

root@pro07:~# touch /mnt/pve/nfs-iso/template/iso/test
root@pro07:~# ls -la /mnt/pve/nfs-iso/template/iso/
total 6373523
drwxr-xr-x 2 nobody root         10 Sep  5 13:44 .
drwxr-xr-x 3 nobody root          3 Feb 14  2022 ..
-rw-r--r-- 1 nobody root 1020264448 Feb 14  2022 CentOS-7.9-x86_64-Minimal-2009.iso
-rw-r--r-- 1 nobody root    2428568 Mar 21 02:02 cp040002.exe.iso
-rw-r--r-- 1 nobody root 3320903680 Feb 14  2022 en_windows_7_ultimate_with_sp1_x64_dvd_u_677332.iso
-rw-r--r-- 1 nobody root  971974656 May  6 18:47 FreeBSD-13.0-RELEASE-amd64-disc1.iso
-rw-r--r-- 1 nobody root          0 Sep  5 13:44 test
-rw-r--r-- 1 nobody root 1269377024 Feb 14  2022 ubuntu-21.10-live-server-amd64.iso
-rw-r--r-- 1 nobody root  541001728 Feb 17  2022 virtio-win-0.1.215.iso
-rw-r--r-- 1 nobody root   40439098 Mar 19 16:17 VMware-ovftool-4.4.0-15722219-lin.x86_64.bundle.iso

All seems to work from the command line.

Now, the one thing I found is that the upload is only not working from one node. All of the other nodes can upload. One node gets the error above.
Stranger still is that I can upload a file from any other node and the file can be seen from the node that gives the error above.

bbgeek17 · Sep 6, 2022

Proximate said:
upload is only not working from one node

"upload" is a wide encompassing term. There are a few areas in PVE that use that. It would be much easier if you described step by step what is working and whats not.

Proximate said:
command '/usr/bin/ssh -o 'BatchMode=yes' 10.0.0.76 -- /usr/sbin/pvesm status --storage nfs-iso' failed: exit code 255

What happens when you run this ^^ command from a node other than 10.0.0.76?
What does "pvecm status" show?

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Proximate · Sep 7, 2022

Did you read what I've posted? I've been sharing a LOT of information

.

I explain that all nodes but this one are able to see the storage but I also confirmed since that all nodes can upload to the iso storage above but this one node. I was able to use the uploaded .ISO to install a guest on node 07.

So the storage is fine and it can even be used by all nodes include the problem 07 one but I cannot upload a new operating system file (ISO) into the nfs storage from node 07.

I'm not able to run the ssh command and not sure what it's for anyhow. Have to dig into that since I have strict ssh policies.

Sure, I can show the status but as I've said above, everything is as it should be. The cluster is up, everything works, only uploads to the nfs from node07 doesn't work which is odd since all the others can.

From another node;

Code:

# pvecm status
Cluster information
-------------------
Name:             clust01
Config Version:   22
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed Sep  7 08:04:40 2022
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000001
Ring ID:          1.4b1
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.0.0.71 (local)
0x00000002          1 10.0.0.72
0x00000003          1 10.0.0.76
0x00000004          1 10.0.0.73

From node07

Code:

~
# pvecm status
Cluster information
-------------------
Name:             clust01
Config Version:   22
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed Sep  7 08:01:28 2022
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000003
Ring ID:          1.4b1
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.0.0.71
0x00000002          1 10.0.0.72
0x00000003          1 10.0.0.76 (local)
0x00000004          1 10.0.0.73

ph0x · Sep 7, 2022

Is ssh from every node to every other node working? I also have the impression that this is not really a storage issue.

Proximate · Sep 8, 2022

It seems ssh is now not working between nodes. And other problems are starting to creep in.
I tried installing a rocky Linux instance but it would not start on any node so I gave up.

This for example makes no sense. I even tried to add the known_hosts but it would not even see the file.
However, I can ssh from the pro07 node to the node I'm trying to ssh in from.

Seriously nervous about what's going on. It just seems to keep getting worse.

Code:

~# ssh 10.0.0.76
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Offending RSA key in /etc/ssh/ssh_known_hosts:10
  remove with:
  ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R "10.0.0.76"
RSA host key for 10.0.0.76 has changed and you have requested strict checking.
Host key verification failed.

bbgeek17 · Sep 8, 2022

ph0x said:
I also have the impression that this is not really a storage issue.

Its really unclear. The only concrete evidence of any error message over the last 4 days is the one shown in the opening post. That error message could be an ssh command failing or actual pvesm failing at the time of the post. We've managed to establish that the pvesm is working now based on the output provided since.
Despite multiple requests for log information during the failing operation - none were provided so far. Op declined to run the actual ssh command that was failing to confirm that the ssh operations intra-cluster are working. I was trying to establish if the cluster and IP addresses are in fact valid and they seem to be.
Based on the limited data: 3 out of 4 nodes can write to NFS, one cant. Its possible that this is a permission issue, but we have not received enough data to say.

It would be great to have some sort of log information as requested by @dcsapak on Monday. Until then - its just guessing.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

bbgeek17 · Sep 8, 2022

Proximate said:
Seriously nervous about what's going on. It just seems to keep getting worse.

technically you are exactly where you were at the beginning of this thread. You've now confirmed that the original error is due to ssh authentication and not directly storage related.
Before you start hacking on known_hosts etc, make sure you dont have duplicate IPs on the network.
Then confirm that .ssh/authorized_keys is a link to /etc/pve which should be a cluster filesystem shared across all nodes.
Confirm that its location and context is identical across all nodes.
Check that you can ssh on the problem node into itself, both public and loopback IP.
If everything above checks out - remove the offending line from known_hosts. Perform manual ssh from each node into each node.
Ensure that ssh mesh connectivity is working. Repeat your operation.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Proximate · Sep 8, 2022

To be clear, I've not declined to do anything. I simply have an incredible load of work and this is just adding to it and I'm trying to make the time to fix proxmox instead of suggesting we move away from it. We've never had any such issues with vmware and we're not ready to buy a license right now as we've not had enough time to trial this.

I tried running the ssh command you gave me and showed the output. I cannot reach that server using ssh.
Is it something I can run locally on it? If so, happy to run it.

I didn't simply start hacking on ssh

. I tried looking online first, found nothing that helped so decided to try some basic things like adding a known_hosts file to see if it would help. It didn't so I deleted that.

Here's the ssh link you're asking, shown both from another node (pro02) and the problem node (pro07).

IP's are fine, one of the first things I would look at is routing, IP issues, etc.

>Despite multiple requests for log information during the failing operation - none were provided so far. Op declined

I asked what I should be looking for since those are very large logs which should not be posted in forums.

>was failing to confirm that the ssh operations intra-cluster are working.

I mentioned that ssh is working across nodes but not to pro07. Happy to provide what ever I may be missing.

>Based on the limited data: 3 out of 4 nodes can write to NFS, one cant.
>Its possible that this is a permission issue, but we have not received enough data to say.

Correct. However, pro07 can read/use a file that was uploaded into NFS storage from any other node as mentioned.

If it is a permissions issue, it happened out of the blue as these hosts don't get played with without reason.
It came to light because one node lost it's guests, again, out of the blue. I posted that as well and ended up having to rebuild.
That caused me to need to move things around which is why I found out pro07 was also having a problem.

What ever is happening, it seems to be slowly breaking the cluster or some of the nodes.

ph0x · Sep 8, 2022

If you rebuilt the node then probably the ssh keys did change. Enter that in one of the other nodes:

Code:

ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R pro07

Code:

ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R "10.0.0.76

and that on pro07:

Code:

echo ~/.ssh/id_rsa.pub >> /etc/pve/priv/authorized_keys

in order to get the new key accepted.

How often did you reinstall a node in vmware? Did you not encounter problems with it? The problem is not proxmox (in fact that's not even a proxmox problem), the problem is that you don't know what implications your actions have.

Proximate · Sep 8, 2022

Let see if we can get in sync.
The initial node that started being a problem was pro04. That was eventually rebuilt.

This is a new problem with pro07 after rebuilding that other node.

Sure I admit I'm new to proxmox, just trying it out. More specifically, I'm new to this OS and not always sure where certain files live until I have to look into it. In this case, I thought the known_hosts file was in /root/.ssh but it's obviously not.

By running the above, yes, that updated the known_hosts file and am able to ssh from pro02 to pro07.

Now I guess I should test the NFS storage?

ph0x · Sep 8, 2022

Some stuff in the GUI actually uses ssh that's why it's important that every node can access every other node through ssh.
If you're in doubt you can even add every id_rsa.pub to known_hosts once again. But after that your issues with the storage in the GUI will most certainly be gone.

Proximate · Sep 8, 2022

BTW, not sure what you are asking as it doesn't relate. How often did I re-install a node on vmware? Every host I've installed was one time and all have run for many many years to date. Never seen any problems, at all where a host would out of the blue do something odd.

Proximate · Sep 8, 2022

Ok, I'll give the storage another try shortly and report. As you said, since it was ssh based, then that might be the cause of any weirdness.
Still, I'm not sure how it could have happened but at least it might explain what's going on which I can now be aware of.

Shared storage across nodes not working - See comment #56.

Member

Distinguished Member

Member

Member

Proxmox Staff Member

Member

Distinguished Member

Member

Distinguished Member

Member

Renowned Member

Member

Distinguished Member

Distinguished Member

Member

Renowned Member

Member

Renowned Member

Member

Member

We value your privacy