"happy" to see more people reporting this issue as well as the issue being recognised and work being done to remediate...
@apollo13 we have reinstalled the cluster back to version 7 since this issue was not resolved for more than a month.
Happy to sit on a call to discuss setups and potential...
Reviewing the backup jobs, just noticed one VM that had an error
INFO: Starting Backup of VM 4138 (qemu)
INFO: Backup started at 2023-11-21 13:20:11
INFO: status = running
INFO: VM Name: prod-lws138-dbcl33
INFO: include disk 'scsi0' 'ceph:vm-4138-disk-0' 32G
INFO: include disk 'scsi1'...
Thanks for engaging!
Some details on the backup infra:
PBS server is VM on the PVE cluster
TrueNAS server has 128GB RAM (plenty of ARC)
ZFS pool is striped mirror of HDDs
VM for the example will be 4142, VM config:
root@pvelw11:~# cat /etc/pve/qemu-server/4142.conf
agent...
I need to come back to this...
Did additional validation and testing as follows:
OSD bench is consistent, no issues to report
Rados bench shows slightly better results compared to the tests we keep record of 2 years ago
Did fio testing in the VM and compared to previous results we have - no...
Thanks for the response... I would love to understand what monitoring you have implemented, sounds really good.
We only collect standard proxmox metrics -> influx -> grafana...
This cluster is really really quiet, we use it as hot standby to out production environment, and also for testing new...
virtio-scsi-single with iothreads was deemed better for our database servers long time ago when doing performance testing...
can definitely give it a try but need to understand how to reproduce the problem (e.g. target to particular vm)
Can you please expand what do you mean with this?
Random VMs, it also looks like this is happening after backup (early morning), which I need to confirm once again
All VMs are configured in similar way, this VM was hanging this morning
agent: 1,fstrim_cloned_disks=1
boot: order=scsi0;net0
cores: 32
cpu: x86-64-v2-AES
memory: 65536
name...
Was this ever solved and how?
We are observing this since recent upgrade to PVE 8 and Ceph Quincy
All VM disks on Ceph
agent: 1,fstrim_cloned_disks=1
boot: order=scsi0;net0
cores: 32
cpu: x86-64-v2-AES
memory: 65536
name: prod-lws141-dbcl41
net0...
The issue continues (randomly)...
We have noticed that migrating the VM to different node fixes the problem i.e. VM is responsive again immediately (issue yesterday was resolved by the migration, not rebooting the cluster nodes, nor patches).
We have implemented detection mechanism to understand...
Just noticed bunch of patches released in the non-sub repo, which I rushed in deploying.
The quickest/lamest way to detect non-responsive systems for me was to check if IP address is detected (as the qemu agent is not running).
After patching all VMs appear to be ok. Not sure if rebooting the...
Another issue we observe, VMs are becoming non-responsive (cannot ssh to them), the following messages are displayed on the console
I cannot reboot the VM cleanly as it lost connection to the storage...
Hi team,
We just finished upgrading to version 8....
We are running 3 node cluster with Ceph, we are using no-subscription repo on this cluster.
Syslog on all nodes has tons of the following
Nov 18 19:38:40 pvelw11 ceph-crash[2163]: WARNING:ceph-crash:post...
Does anyone knows if feature/enhancement request was actually made?
We are running dedicated PBS with HDDs in mirrored vdevs (ZFS), 7 nodes running backups in the same time is choking the server/disks.
Slow PBS server causes issues on the VM's while the backup is running.
To mitigate this at...
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.