Sorry, I'm not the OP ;). I too think that something isn't correctly configured in Corosync, host lost quorum and HA made them reboot, hence I asked for the configuration files and other details.
Essentially it's the same procedure as with a major version upgrade, like you posted [1]. Install packages in all nodes, set noout, restart one by one mons, managers, OSD, MDS, while waiting for Ceph status to be ok before restarting the next...
QEMU guest agent does not keep time in sync like i.e. VMWare tools do. It only syncs time on VM start, after suspend/resume and after a snapshot is taken, either a manual one or from a backup. You need some ntp/sntp client to keep time synced...
Make vzdump backups of the PBS VM to a Samba or NFS share on that same NAS that hosts you datastore. When the cluster breaks or any other disaster, just install PVE anywhere, restore the backup of the VM and you'll get access to your backups...
Post your /etc/pve/corosync.conf and /etc/network/interfaces of the nodes. Also the current output of corosync-cfgtool -n in each node. Full logs of each node at around the time of the reboot will be useful too. We can't really guess anything...
This post would be much more useful if you provide the exact version of the virt-io ISO and the version of each component (spice vdagent, virtio drivers, etc) that each one installs. That information will be needed for the bug report anyway...
Simply apt update && apt install ceph-common. That will update Ceph packages and their dependencies. If in doubt, after apt update, check the available versions for each Ceph packages with something like apt list --installed | grep ceph.
If you care about your data, buy second hand enterprise drives instead of consumer ones. The performance you are seeing is expected with those drives: once the drive SLC cache is full, writes are very slow.
Also, no sure if the...
Proxmox Mail Gateway 8.2 is available! The new version of our email security solution is based on Debian 12.9 “Bookworm”, but defaulting to the newer Linux kernel 6.8 and allowing for opt-in use of kernel 6.11. The latest versions of ZFS 2.2.7...
As detailed in the docs, max latency is 10ms [1], although IME being around 5ms max is recomended. Clustering is not supported if latency is over that values.
In your use case, I would simply have independent hosts or at least an independent...
Glad to know its ok now!
Add that third host asap, like yesterday. Meanwhile you add the third node with HDD's, if for some reason you have to take down one of the HDD hosts or if it breaks, change the pool to 2/1. That will allow to keep I/O to...
As aaron pointed out, check the MTU. I've just noticed that you've set mtu 8972 in the nics for the 10.22.0/24 network, instead of the more typical 9000. In your case, ping -M do -s 8972 {target host} won't work, as you have to substract 28 from...
Please, post /etc/pve/ceph.conf and /etc/pve/storage.cfg so we get the full picture of the setup.
I would also take a look with tcpdump in the new node to check if that host is getting some reply from the Ceph cluster when trying to access it.
Not exactly: data will remain in one node but PGs will become inactive unless the pool is set to 2/1 replicas (which is something not recomended in any case except for disaster recovery).
A lot of things here...
You are using 6 monitors, which isn't supported. Use either 3 or 5 (preferred as it would allow 2 mons to fail and still keep quorum).
You have a 3/2 pool set to drive class "hdd", but only have 2 servers with "hdd"...
Need the ouput requested to be sure.
That was helpful in that case due to using just 3 hosts. With 6 it will probably help, but does not completely avoid the main issue and over time unbalanced distribution will arise, not to mention how unsafe...
Checking this again, with the already provided data, pretty sure you are using crush rule(s) that do not use device class and mixing 1T with 10T drives in the same pools with so few PGs will cause such imbalace.