For my customers we are using rds farm + fslogic on CEPH tbh, not on local disks. Try changing cpu type to something instead of host. Also maybe screenshots of load and io of the VM,maybe you have something like I/O stall?
I use for some testing those nested snapshots(testing different versions of some app), and i haven't found a reliable way to migrate or backup those snapshots.
Usually i don't agree with that,because i've had customers on 3-node CEPH/Proxmox for more than 3-4 years without a hiccup (not counting power errors,etc). And nowadays, with proxmox it is everything batteries included, only thing admin needs to...
I don;t agree that the ceph learning curve is steep. Read the docs, get the right network(i would say everything else but this is primary),and start working with it. There are rarely problems, if we count out physical layer problems.
Why are you putting those two nodes in the same cluster as those before? Why not just for starters, create them in separate clusters and work with that?
Why you have external help? When External Audit comes(Grant Thornton,KPMG etc etc) first they ask you is "what if you get hit by a bus", who will support that and that. That is why you need to have external maintenance contracts.
You mention failover, and in that regard, it makes sense to use Proxmox <> Proxmox pve-zsync replication. Both proxmoxes have zfs storage for vm and cts and you implement it on one machine. Failove then has moving .conf files and starting the...
1. Yes it is because you don't have shared storage for all your nodes, which is usually ceph.
2. There are some aftermarket scripts to replicate whole pool, but you usually want to replicate different machines with different schedules, eg dbs in...