Shared storage across nodes not working - See comment #56.

Node 07 is where I tried to upload, the paste I shared is from that node.
I can upload to the NFS storage from any other node.

From node 02 to 07;
root@pro02:~# /usr/bin/ssh -o "BatchMode=yes" 10.0.0.76 -- usr/sbin/pvesm status --storage nfs-iso
root@10.0.0.76: Permission denied (publickey,password).

From the command line of 02, I can ssh to node 07, no errors.
 
Yeah well, there you have your problem. Fix the ssh issue and the upload will most probably work.
 
Sure, I'd love to but where is the issue to be found?
The command you gave me doesn't work but I can ssh to 07 from 02.
I've not had the chance to go through the two links shared but I'll look at those.
 
Yeah well, there you have your problem. Fix the ssh issue and the upload will most probably work.
we are finally settled on what was said in comment #7. Eventually OP will get to comment #28 and the troubleshooting steps described here.

OP - you did a node replacement. You appear to have missed some steps and did not do it right. The system is not in an expected stable state. Its impossible to advise on exact commands due to unpredictability of how the system actually looks at this moment in time.
If you have production services running on this cluster - either pay for subscription and open a case, or work with the volunteers who have been spending their time to help you fix your environment by sharing the details asked and following the guidance already provided.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
It's a production test setup meaning that non critical guests are installed across nodes to learn more about proxmox.
The only node that was re-installed was 04. I don't recall anything changing with node 07 but if it's simpler to just re-install node 07, that might be the solution.

I'm pretty sure I responded with details to comment 7.

I showed that node 07 was able to see the mount points on the NFS server.
I asked which log file and what am I looking for since all of the tests but the ssh one seems to work.
I tested ssh as we all saw and found a problem which was fixed yet the upload issue continues on node 07.
I used 'touch' and created a file from node 07 to the mounted storage and it worked. Below, you see the 'test' file I created.

Code:
root@pro07:~# ls -la /mnt/pve/nfs-iso/template/iso/
total 4948793
drwxr-xr-x 2 nobody root         10 Sep  9 09:12 .
drwxr-xr-x 3 nobody root          3 Feb 14  2022 ..
-rw-r--r-- 1 nobody root 1020264448 Feb 14  2022 CentOS-7.9-x86_64-Minimal-2009.iso
-rw-r--r-- 1 nobody root 1047048192 Sep  5 17:31 FreeBSD-13.1-RELEASE-amd64-disc1.iso
-rw-r--r-- 1 nobody root 1493499904 Sep  8 13:58 Rocky-9.0-x86_64-minimal.iso
-rw-r--r-- 1 nobody root          0 Sep  5 13:44 test

Comment 29 talks about replacing a node and as I've explained, node 04 was rebuilt, not node 07.

At this point, if you think the problem is that node 07 has some sort of problem, I think it would be better if I simply removed it from the cluster, rebuilt it and added it back as this troubleshooting is taking a long time. I don't know why this broke, that's the most concerning thing but I'm not giving up on proxmox.
 
There are a lot of possibilities where to look at, regarding those ssh problems. You mentioned strict ssh rules, that's where I would begin. Loosen up on the rules, try to encircle the problem.
We don't know why ssh works and the above command does not, but as long as this issue isn't resolved it's unnecessary to check other stuff.
 
The NFS storage is fine. There is no need to test it.

The problem is your ssh intra cluster communication. The evidence you've shown does not match your statements. Even your statements are contradictory, making it hard to isolate the issue:
From node 02 to 07;
root@pro02:~# /usr/bin/ssh -o "BatchMode=yes" 10.0.0.76 -- usr/sbin/pvesm status --storage nfs-iso
root@10.0.0.76: Permission denied (publickey,password).

From the command line of 02, I can ssh to node 07, no errors.
the first output shows ssh failure, which you follow up with with "works, no errors".

The batchmode option only works with passwordless authentication. When you say you "can login" - does that involve password?

You can execute _any_ command "root@pro02:~# /usr/bin/ssh -o "BatchMode=yes" 10.0.0.76 ls" and it should fail now.

You can add -vvvv to your ssh to get a more verbose output, you can try "pvecm updatecerts", or start reading carefully at around comment #17 https://forum.proxmox.com/threads/migration-issue-after-replacing-a-node.13968/


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
LOL, this is silly. You just keep trying to get personal with me and I keep trying to keep it professional.

>You mentioned strict ssh rules, that's where I would begin. Loosen up on the rules, try to encircle the problem.

I didn't do anything other than default. The system gave me this notice at some point so I shared it.

My statements are contradictory? If they are, it certainly isn't by choice. I've tried to provide all of the information being asked.
You make it sound like I'm trying to be mysterious and not provide details.

>the first output shows ssh failure, which you follow up with with "works, no errors".

I stated and showed that I was not able to run the ssh command given to me, it failed yet I was able to ssh to the node the command tested against. How is that contradictory, it's what happened and it's what I show.
I also showed screen shots of what happened. I show that I can reach the NFS storage from node 07 and I was able to 'touch' a file.
Yet from the GUI, I cannot upload a file.

I'm not making anything up, I'm sharing the results of tests you've asked me to run.
And no, I didn't notice the difference in password vs no password but since you now mention it, I'm aware of it.

I also mentioned I never re-installed node 07. Node 04 is what was re-installed then came this problem. A little bit whack-a-mole. If I would have done something really stupid, I likely would have known by now.

I think it's best to simply re-install node 07 and move on.
 
Last edited:
Yes, I know the problem is ssh, we've confirmed that but why did it happen, how did it happen, why have the things you've suggested not worked? I've followed along. I was able to re-build the known_hosts file but the GUI still never works.

It would be nice to know what could cause this since I didn't mess with these systems and didn't re-install node 07. Something happened, even if it was by mistake, what are the causes is what I'd like to understand.
 
LOL, just to add to my stress level on top of being rolled over the coals here, of course I cannot even migrate guests off of 07.

2022-09-09 11:43:33 # /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pro02' root@10.0.0.71 /bin/true
2022-09-09 11:43:33 root@10.0.0.71: Permission denied (publickey,password).
2022-09-09 11:43:33 ERROR: migration aborted (duration 00:00:00): Can't connect to destination address using public key
TASK ERROR: migration aborted
 
Not sure why someone didn't specifically know this but a suggestion could have been this;

node07 # pvecm updatecerts

Done, everything works now.
 

That only fixed the migration it turns out but that's good. At least I can move the guests and rebuild now.
I'm sure there's one simple command to fix the rest but I don't have the energy to contracdict or confuse anyone anymore :).

Thanks for the help.
 
As I said one cannot really know what causes the problem. That's kind of your task to ensure that passwordless login works flawlessly because only you know your exact setup. It's definitely not an error that is very common, what can be seen by us poking in the dark. However we were able to narrow the problem down. Besides, I'm not sure if reinstalling node 07 will solve the problem.

The problem in this forum is, that a LOT of people come here with their issues, give little information, only reluctantly uncover details and in the end the solution is something that could've been avoided if the user would have just read the guides in the documentation.
And this repeats over and over again. That's why I took a year long break, for example. Yes, Proxmox is easy to install and run, but apart from that it's a Linux system that the user is supposed to master.
That's why some guys around here sometimes have a thin skin, when OPs are fast with conclusions about which suggestions are valid and which are not. That's nothing personal.
 
Come on now, I did not hold back on information and I've been polite and professional the entire time. You make it sound like I wanted to mislead and cause more confusion than there already is.

First you get your feelings hurt because it sounds like I'm blaming proxmox and then no matter how hard I try, you just keep on telling me I'm not trying.

Can you PLEASE stop attacking people, not everyone is out to do those things. Some of us simply do not have the same level of knowledge.
You didn't 'master' anything without spending a lot of time on it and I've not had that time yet.

It's just using the basics for now, to see how reliable it is without getting into anything complicated. Just installed, create a cluster, learn the basics as this is being done, add some guests, let it run, see how things go. Almost all good software is usable out of the box with just a little knowledge and what's where I'm at. I don't recall doing anything that should cause the problem otherwise I would have re-traced my steps.

Maybe you need to take another break from the forums because you're not able to recognize when someone is actually trying and you are pushing them away. I'm working at this to the best of my abilities.

I don't know if re-installing will work either but I'm not going to spend much more time in this post asking for help since you keep telling me I'm not trying. I'd rather rebuild and take my chances before getting to the point where I'm simply done with proxmox which is where you're pushing me with that attitude. If you feel I'm not doing my part then just walk away from the thread, don't keep accusing me of anything.

Try being kind, that's what I'm doing. We are both real human beings at the other end of the text. Let's treat each other with the same respect we would face to face.
 
Last edited:
UPDATE: Nope, this comment still didn't fix anything. I continue playing whack-a-mole fixing one node, then another breaks, it won't stop.

I even ran this on all of the nodes.

pvecm updatecerts

The solution I thought for a while was as follows.

This is done from each node to all the other nodes to ensure the inter-communications are working.

# /usr/bin/ssh -o 'BatchMode=yes' 10.0.0.xx -- /usr/sbin/pvesm status --storage <nfs-share>

If there's an access error, ssh to the same host;
# ssh 10.0.0.7x
Most likely a key error will show up with a suggestion to remove the old key. Remove it.

After removing it, ssh to the same host and you should be prompted to say yes to generating a new key, say yes.
This should automatically log you into the remote server ssh.
Exit out and try the first command again. If it works, move on to the next node.

Somehow, several keys got messed up.
I think it all started with one node but following suggestions in posts I found before asking here caused a bigger mess.
 
Last edited:
What ever happened as we've come to conclude is ssh related but something else.
This started happening right after I upgraded three of the nodes from 7.1.7 to 7.2 and re-installed one of the nodes, 04.
I have nodes 01, 03, 04 and 07.

I noticed the problem when I could not move guests between hosts but of course, have since lost track of which to which.

The problem simply won't go away. I can write all of the things that work and when it doesn't but that only seems to confuse those trying to help me which I appreciate. It's confusing to me also because I fix one or two nodes, then another won't work. I fix that one then another stops working.

From each node, the storage can be seen from command line;

From node 02
# /usr/bin/ssh -o 'BatchMode=yes' 10.0.0.72 -- /usr/sbin/pvesm status --storage nfs-iso
# /usr/bin/ssh -o 'BatchMode=yes' 10.0.0.73 -- /usr/sbin/pvesm status --storage nfs-iso
# /usr/bin/ssh -o 'BatchMode=yes' 10.0.0.76 -- /usr/sbin/pvesm status --storage nfs-iso

# /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pro03' root@10.0.0.72 /bin/true
# /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pro04' root@10.0.0.73 /bin/true
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
# /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pro07' root@10.0.0.76 /bin/true

Then I check from node 04 since node 02 complains it cannot reach it but it's fine from 04 to 02.
# /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pro02' root@10.0.0.71 /bin/true

So back to node 02, I remove the key as suggested.
# ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R "pro04"

# ssh root@10.0.0.73
It auto logs me in as root. No being prompted for key or password so I exit.

I run the tests again;

# /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pro04' root@10.0.0.73 /bin/true
Host key verification failed.

# /usr/bin/ssh -o 'BatchMode=yes' 10.0.0.73 -- /usr/sbin/pvesm status --storage nfs-iso
Name Type Status Total Used Available %
nfs-iso nfs active 3861157888 4948992 3856208896 0.13%

At this point, I have one more option that I know of, I can run 'pvecm updatecerts'.
I've used this also to try and get things synched up and I can get it so that everything works from the command line.
Then I go to the GUI and the whole thing starts all over again.

I'm obviously missing a step or an order but so far, it's a non stop whack-a-mole of fixing it from the command line only to see the problem show up again in the GUI. This thread has not helped to help me understand how things work so far but I'm trying to re-read it to see what I might have missed.

All I can think of at this point is to migrate all guests away from one host. Rebuild that from scratch using 7.2 then move all guests to that host, destroy the cluster, remove any mentions of other nodes on this single host (/etc/pve/nodes/<nodename>), rebuild all the other hosts then create a new cluster, migrate hosts etc etc.

Someone still learning but not wanting to post too quickly in the forums for fear of being told they aren't learning might find something like this and countless other ideas.

https://codingpackets.com/blog/proxmox-certificate-error-fix-after-node-replacement/

It's interesting because it coincides with my own experience of having upgraded nodes THEN starting to have these problems.

I don't know, is that a good idea? There are hundreds of posts on the net, how to fix this, that and sometimes we try the ideas we've found, some work, some probably cause new problems you're not aware of because the article you read was about a problem that looked like your own but maybe was slightly different.

That's the Internet and trying to learn new things :)
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!