After upgrading to 7.2-5, the error indeed disappeared!
Have removed my manually created monitors and recreated the monitors via the GUI and can see in the Ceph config that the port numbers are no longer showing for the newly created monitors. (for good measure, have replaced all monitors to...
Hi Fabian,
Do you know if the patch will be rolled out soon? I did go for the manual route and that did the trick for now, but it seems to cause some trouble when rebooting the box (monitor doesn't start automatically) - putting it back the way ProxMox likes it seems to be the most suitable...
In case it is useful for anyone else, after switching my backup mode from "Snapshot" to "Suspend" seems to have done the trick and the particular VM's are now more stable.
Not sure if this topic is still relevant, but I've seen similar issues over the last few months.
I've got a few VM load balancers that have become unstable on a regular basis (once a week). After first suspecting the internet line, as well as many other things, I now seem to have also found a...
>> I sent a preliminary patch.
Brilliant! Will keep an eye out for it. As long as nothing else goes wrong at the moment, should be fine running off 2 monitors for now.
Will update this thread if that's gone to plan.
>> You might want to switch your mon_host config line to use spaces as...
>> Thanks, I was now able to reproduce the issue! I'll see about fixing it, as old configurations should of course continue to work.
Great! Having a quick look on Ceph's pages, I keep seeing the mon_addr in the config files WITH the port number, so I'll await the outcome of your assessment...
Hi Fabian,
Thanks for your help!
>> Was this configuration created via Proxmox VE? If yes, approximately what version?
Yes, I think it was originally created on version 5.4-ish.
I do seem to remember an upgrade of corosync from version 5 to 6 or so that was changing port numbers /...
Hi,
I've recently added a couple of new nodes to my system and now want to move the Ceph monitors to the new nodes, but am struggling.
When I go to Ceph - Monitor in the Proxmox GUI, click Create and select the new node, it almost instantly give me the error message "Invalid IP string"...
In case it's of any use to anyone; I had a similar problem and this thread helped me pinpoint the answer.
In my case, a newly added node that wasn't completely new had the wrong ceph key in a few files. Replacing the key value in the following two files with the key value of an existing cluster...
Hi all,
Looks like I've got the same issue as you all - corosync failing randomly (but roughly every 12 - 48 hours), causing various management and connectivity issues.
If there's any logs that I can provide to shed more light on the problem, please shout.
Couple of things about my cluster...
Hi Alwin,
I've got a nice steaming pile of log files for you!
https://www.dropbox.com/s/iykxek4hwqj3sj2/ceph-logs-extended.zip?dl=0
This contains:
OSD logs - switched log/mem levels to 20/20 (assumed that was the best to choose)
Ceph log
Ceph Audit log
Ceph Mon/Mgr/Mds logs
Order of...
Thanks. I'm just moving the remainder of my VM disks across to the new pool - probably should be finished by tonight/tomorrow morning, after which I'll rebalance the cluster and then enable the logging.
My cluster setup;
- 4 nodes, 24-32 CPU cores each, 128GB ram each (CPU load normally...
I think I've tried that last week, but let me indeed try again once the cluster is rebalanced and see what happens. From memory, I think it ended up just crashing the osd's quicker ;)
I've started to do that with a few disks - was hoping I could avoid it as there's a good 100 VM's or so, and...
...Saying that, I have rebalanced some of the OSD's yesterday to bring them back in (had removed a bunch of them to see if it made a difference last week). It is still rebalancing the cluster and has about 12% of objects misplaced which it is slowly putting back in the right place.
No, at the moment, they're running only about 10-ish VM's (some websites and some Windows boxes) - all crucial VM's have been migrated to the backup cluster. Normally they're running about 50 to 100-ish VM's that are used for a training lab environment.
root@prox7:~# ceph osd df tree
ID CLASS...
Fair suggestion, but already tried that last week. It still has a copy of the data on osd.16, and that one is then trying to replicate its data to the other ones, causing the same results.
When I wasn't aware of the norecover/nobackfill flags, the only way I had to stop the OSD's flapping, was...
Hmm, that did something alright ;) The moment I stopped OSD 14, it also stopped OSD's 23 and 26 - the two it's currently trying to replicate to. Surprisingly OSD 16, another acting drive for this PG and which should have the correct data for this PG, did NOT go down.
I've uploaded the log files...
Oops... :) For completeness, I've attached the full output of "ceph pg 1.3e4 query"
Couple of things I've done yesterday evening and this morning;
- Moved the VM disk to the different pool (using the Move Disk option in the Hardware section of the VM - Proxmox GUI), and as the different pool...
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.