@Tmanok didn't mean to sound unappreciative! I do need the help... so there is that.
I have 9 nodes (8 that are active. Node 6 is out putting in new power supply so it is out of the cluster for now).
They talk to each other fine...
Node (Stack1-node8) all have a single 1TB HDD spinner that I...
OK but assuming I dont want to redo everything and reinstall from scratch... just wondering where to change config or how it works to change it after its been on another subnet already... or how to change to different nic within config... oh well I will look some more
#1 - I have 9 nodes
#2 - all nodes are plenty resources
#4 - of course I rebooted all nodes...
this happened after octo update to pacific with the automatic update and upgrade script....
Aside from generic info - your answer has no value to my case posted... ceph just hangs and mds and osd...
any way to recover osd and get managers back and rescue map?
Nodes can see each other fine - just missing managers for Ceph and no osd are showing up.
ceph -s hangs
timeout on any gui screen and most ceph commands
root@node900:/etc/pve/nodes/node2# ha-manager status
quorum OK
master node5...
Did you ever resolve this? I am having same issue.
ceph -s jsut sites there and freezes
timeout 500 on gui for ceph status page/dashboard
config shows all the correct hosts for monitors and correct node ip
proxmox node to node connectivity is fine - just ceph MANAGERS are missing and no OSDs...
I somehow lost all my osd and map too - when I did pm gui update.. after reboot everything went to hell... any ideas on any of this?
ceph osd setcrushmap -i backup-crushmap
and just about any command for ceph just hangs and or times out...
Monitors are listed but no quorum
No OSDs are listed...
To be honest - I did not even look to see what upgrades happened till it was too late.
Octo to Pacific upgrade happened apparently with the automatic gui updates... I did not read the notes and now all my cluster ceph pool is dead as a rock.
I noticed timeout after timeout...
I manually...
I have a bunch of older servers - almost all have a 4port x 1GB card and 2x onboard gb ports.
right now I only am using one of the on board nics for all the nodes - one of the onboard ones... I have a linux bridge assigned vmbr0 to that on-board port and then all the VM's LXCs run over that...
Hey, you're not alone.. been having same issues with pve-root maxing out.. seems something is getting stuck on recent upgrade... For me ceph log and other logs were HUGE and taking up all the space.. so I deleted the ceph log and removed the osd from that specific node... completed the update...
So this caused all sorts of issues and node would not update then it froze up bad... restarted and it would not reconnect... got on local and saw it was up but out of root partition space...
ssh to the node
apt-get autoremove failed... no space
everything I did failed...
I found several...
yes - I put that in the info above...
Using the entire host 1TB HDD drive as OSD for CephPool1 (9 other nodes with 1TB drives on ceph osd and 1 machine with 8 more drives all setup as osd)...
Created VM using CephPool1 for HDD - VirtIO SCSI - default, no cache on the VM HDD setup.
wondering...
Testing further - changed guest HDD to "emulate SSD"
and it seems to have increased performance a bit... from 7900-8200 to now 8200-8650 Mb/sec is 3 to 5% improvement but not anywhere close to the direct access on the PM node at around 11,500... So about 25% less disk performance on VM than...
Was looking today and noticed significant slower speeds on ubuntu guest VM for hdparam -Tt /dev/sda2 than for that same partition drive on the host node directly on console.
This is directly on proxmox node to the attached SSD /sda2
This is directly on proxmox node to the attached SATA HDD...
I spent the time to build a generic VM for a specific app using Ubuntu 21 iso. After hours of base setup I shut down the VM and "Convert to template" thinking I could then zip the template package up and share with others so it could save them 5 hours of basic setup...
How do we do this?
I...
So what I did to get it running again:
After reboot all machines - no quorum so all machines refused to start up VM's. I realized that the expected vote is the total joined machines in the cluster. Quorum is apparently defined as more than 50%. So - I am assuming a lot here from what I am...
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.