Recent content by hepo

  1. hepo

    [SOLVED] Ceph (stretched cluster) performance troubleshooting

    @VictorSTS, technically you are right, we are using 3rd datacenter only for MON and to maintain quorum in the event of datacenter failure. The Ceph Stretched Cluster is something we've looked into long time ago, it was only available with Pacific and above which at the time was not yet available...
  2. hepo

    [SOLVED] Ceph (stretched cluster) performance troubleshooting

    Easy, go with single datacenter if your business case allows it. The latency between datacenters will have significant impact on the ceph performance. Even if 1ms round trip. Additionally we have 4 copies of each image (vs 3 by default), this also impacts performance since ceph uses synchronous...
  3. hepo

    [SOLVED] Ceph (stretched cluster) performance troubleshooting

    Excellent response above, nothing to add. Just know our journey was from consumer to enterprise NVMe's, we never used SSD's for the Ceph cluster. With regards to crush map, I have found the following in our docs: ceph osd tree ceph osd crush add-bucket org-name root ceph osd crush add-bucket...
  4. hepo

    [SOLVED] Ceph (stretched cluster) performance troubleshooting

    The results are from the following fio command executed directly on the NVMe's, ceph is not involved in this test... fio --ioengine=libaio --filename=/dev/nvme... --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio Ceph is pretty much default...
  5. hepo

    Proxmox scalability - max clusters in a datacenter

    I thought VMWare has a limit of about 32 nodes in cluster... Multi-cluster management would be really nice!
  6. hepo

    Proxmox VE 8.0 released!

    Moving to version 8 is inevitable, I would love to see the issue resolved first, appreciate any updates you can/will provide!
  7. hepo

    Proxmox VE 8.0 released!

    "happy" to see more people reporting this issue as well as the issue being recognised and work being done to remediate... @apollo13 we have reinstalled the cluster back to version 7 since this issue was not resolved for more than a month. Happy to sit on a call to discuss setups and potential...
  8. hepo

    Kernel panic, machine stuck, task khugepaged:796 blocked for more than 120 seconds

    I have spewed tons of posts on this in the PVE8 thread, last time I've checked the issue continued. Ceph Quincy here...
  9. hepo

    Proxmox VE 8.0 released!

    Reviewing the backup jobs, just noticed one VM that had an error INFO: Starting Backup of VM 4138 (qemu) INFO: Backup started at 2023-11-21 13:20:11 INFO: status = running INFO: VM Name: prod-lws138-dbcl33 INFO: include disk 'scsi0' 'ceph:vm-4138-disk-0' 32G INFO: include disk 'scsi1'...
  10. hepo

    Proxmox VE 8.0 released!

    Thanks for engaging! Some details on the backup infra: PBS server is VM on the PVE cluster TrueNAS server has 128GB RAM (plenty of ARC) ZFS pool is striped mirror of HDDs VM for the example will be 4142, VM config: root@pvelw11:~# cat /etc/pve/qemu-server/4142.conf agent...
  11. hepo

    Proxmox VE 8.0 released!

    I need to come back to this... Did additional validation and testing as follows: OSD bench is consistent, no issues to report Rados bench shows slightly better results compared to the tests we keep record of 2 years ago Did fio testing in the VM and compared to previous results we have - no...
  12. hepo

    Proxmox VE 8.0 released!

    Thanks for the detailed write-up! We will also evaluate ceph monitoring via zabbix - https://www.zabbix.com/integrations/ceph
  13. hepo

    Proxmox VE 8.0 released!

    Thanks for the response... I would love to understand what monitoring you have implemented, sounds really good. We only collect standard proxmox metrics -> influx -> grafana... This cluster is really really quiet, we use it as hot standby to out production environment, and also for testing new...
  14. hepo

    Proxmox VE 8.0 released!

    virtio-scsi-single with iothreads was deemed better for our database servers long time ago when doing performance testing... can definitely give it a try but need to understand how to reproduce the problem (e.g. target to particular vm) Can you please expand what do you mean with this?
  15. hepo

    Proxmox VE 8.0 released!

    Random VMs, it also looks like this is happening after backup (early morning), which I need to confirm once again All VMs are configured in similar way, this VM was hanging this morning agent: 1,fstrim_cloned_disks=1 boot: order=scsi0;net0 cores: 32 cpu: x86-64-v2-AES memory: 65536 name...