Search results

  1. J

    Scheduled downtime of large CEPH node

    Yes. If you are using separate DB/WAL drives, you can't move the OSD around. If you have 20 OSD concentrated on one node, I don't think that 2x OSD in 9x more nodes will provide enough capacity in a 3x replicated crushmap. Depends on OSD size. You are going to end up shuffling around a lot...
  2. J

    Scheduled downtime of large CEPH node

    I am a little confused. Is Ceph only on one node, and not all 10? The entire purpose of Ceph is that it is a high availability, clustered file system designed to be run on multiple nodes. The recommendation for a production cluster is a minimum of five nodes. You are saying you have Ceph on...
  3. J

    Shutdown of the Hyper-Converged Cluster (CEPH)

    Based on guidance from Ceph and experience, this is the best method. If you do not set the Ceph flags and a node does not boot properly within five minutes (this is a tunable which could be changed from default), the OSD's on that node will be marked out. The rest of the Ceph cluster will...
  4. J

    Shutdown of the Hyper-Converged Cluster (CEPH)

    OP's originally stated method is the best method. If you don't set Ceph flags, the cluster will begin to rebalance as soon as OSD are marked out. If there is more than a five minute time between first shutdown and last shutdown of all the nodes, or similarly on restart, you will be dealing...
  5. J

    VM Instability and KP's

    Unfortunately not. The PCIe slots are completely full and require both sockets to be populated. With only CPU0 (first socket) populated, I would be forced to choose between networking cards and NVMe drives. Can't really operate an HCI infrastructure without storage or networking. However, I...
  6. J

    VM Instability and KP's

    It isn't hard to obtain, but it also isn't inexpensive.
  7. J

    VM Instability and KP's

    Thank you for your help. I actually did exactly that - I tested the third Chassis with one CPU in Socket 1. I was able to narrow down the issue to the 2nd CPU. It has some errata. Sadly, it will need to be replaced. Thank you for helping me find this little gremlin!
  8. J

    VM Instability and KP's

    Initial Disclaimer and Apology: The Proxmox forum isn't really the place for this, but I am casting a wide net hoping anyone has any ideas. I don't believe there is anything in anyway wrong with Proxmox software or my installation at a software level. I apologize if this is long and overly...
  9. J

    [SOLVED] restore failed: detected chunk with wrong digest.

    The problem is unrelated to PBS directly. This thread can be closed. I will post separately in a depeserate plea for help. Thank you.
  10. J

    [SOLVED] restore failed: detected chunk with wrong digest.

    Thanks for tips. I will have to do another restore attempt and access logs from both ends for you. Regarding memory: I have problems regardless of the cluster I am restoring to. All nodes have 192GB ECC DDR4. I wasn’t aware of a good way to test ECC, but I am up for any recommendations.
  11. J

    [SOLVED] restore failed: detected chunk with wrong digest.

    Also encountering problems when syncing between remote and destination PBS. Remote: May 28 09:04:00 pbs proxmox-backup-proxy[194]: TASK ERROR: connection error: Connection reset by peer (os error 104) Destination: 2021-05-28T09:04:00-05:00: sync group vm/115 failed - detected chunk with wrong...
  12. J

    [SOLVED] restore failed: detected chunk with wrong digest.

    The samething happens when taking a backup. Here are the logs. PVE Log: INFO: starting new backup job: vzdump 112 --storage HomePBS --remove 0 --node viper --mode snapshot INFO: Starting Backup of VM 112 (qemu) INFO: Backup started at 2021-05-27 08:15:05 INFO: status = running INFO: VM Name...
  13. J

    [SOLVED] restore failed: detected chunk with wrong digest.

    I am attempting to restore a backup from PBS to a node. I have run a complete verification pass on all backups. Everything passes. However, the restore immediately errors out with "restore failed: detected chunk with wrong digest.". What is going on? new volume ID is...