An interessting Fact:
- This issues does not occur on Clusters that were "born" as 7.x / Pacific or Quincy
- The I/O impact is so hard, that workload can barely run, I/O is very laggy..
I will look into the mclock tuning thing.
Hi,
I upgraded a Cluster right all the way from Proxmox 6.2/Ceph 14.x to Proxmox 8.0/Ceph 17.x (latest). Hardware is Epyc Servers, all flash / NVME. I can rule out Hardware issues. I can reproduce the issue as well.
All running fine so far, except that my whole system gehts slowed down when i...
Thanks for quick reply @fiona.
Can you explain why I only see this on Clusters, that have been upgraded all the way up from 6.x to 8 but not on Clusters that were born as 7.x? I am just curious.
//edit: sorry, I have to correct myself. i also see this on clusters that came from 7.x.
regards.
Well, it applies to all backups that rely on qemu snapshot mechanisms. There are backup/replication tools, that utilize ceph or zfs snapshotting and bypass this whole qemu thingie.
thanks again, i will implement additional checks. :)
Ok, forget about it.
It was not a deadman alert but a qemu-ga alert, which I implemented into our obeservability setup to get informed when Guest-Agents crash somewhere. ( https://github.com/lephisto/check-ga )
It basically issues a qmp guest-ping every 5 minutes. If that happens while the...
I have noticed this behaviour because the telegraf agent inside the VM has started transmitting telemetry to influxdb and when it was shutdown at the end of the backup a deadman alert was raised.
i guess this is not an intended behaviour?
Hi,
i noticed one weird behaviour:
I have a few VMs in my cluster that are stopped on purpose. HA is set to request state = stopped.
Now when the PBS Backup is running is see this:
INFO: starting kvm to execute backup task
The VM is being booted up, which can cause some trouble.
Proxmox /...
You might want to talk a look at:
https://github.com/lephisto/crossover
I use this to have incremental DR cold-standby copies in separate clusters, do (near) live migration with minimum downtime between different clusters and so on.
Hi,
since I have to maintain some geographic disjunct locations for Services, I was looking for a possibility to get a nearly migrate VMs across different pools with the least downtime possible. Sure, you can backup to PBS or export and import, but depending on the size of the Images you will...
There is Progress on this:
https://tracker.ceph.com/issues/48276#note-32
The PR isn't merged upstream yet, so I guess we will see this (important) fix only in 14.2.16 or later.
Just an update: I filed an Issue on the Ceph Redmine. There's a patch proposed which enables a verbose logging in case of this specific fail, but it's unclear as of now, when it gets backported to 14.x.
https://tracker.ceph.com/issues/48276
so long..
Hello,
i guess I have the same Issue here. OSD Crash with no obvious Hardware Issues:
root@X# ceph crash info 2020-11-18_02:24:35.429967Z_800333e3-630a-406b-9a0e-c7c345336087
{
"os_version_id": "10",
"utsname_machine": "x86_64",
"entity_name": "osd.29",
"backtrace": [...
*bump
Since i follow ceph devlopment very closely i can tell that there are a few additional regressions in ceph 14.2.10, i advise you to not upgrade to ceph 14.2.10 at the moment.
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.