Not entirely sure if this is saying much of use:
-- Journal begins at Thu 2021-07-08 11:26:31 PDT, ends at Fri 2021-10-08 08:05:08 PDT.>
Oct 08 01:35:56 pvenode1 systemd[1]: Starting The Proxmox VE cluster filesystem...
Oct 08 01:35:56 pvenode1 pmxcfs[6572]: [quorum] crit: quorum_initialize...
Further investigation suggests a corosync congestion issue with the current layout. Is there a good place to look for logging regarding corosync errors?
@adriano_da_silva
From my experiences with Ceph, I prefer it for the rebalancing and reliability it has offered. I have put a lab ceph setup through hell with a mix of various drives, host capabilities, and even uneven node networking capabilities. The only thing it did not handle well (and...
I want to bring up that Ceph cluster again. Are you expecting there to be OSDs on the nodes you power off to save power? If so Ceph will have to rebalance every time you turn servers off/on, and that will have considerable wear on your storage medium, slow down your storage pools, and cause...
Well regardless, I suggest starting your diagnostics with vda1 since it is having I/O errors and buffer issues. It is likely what is causing both your storage issues and the massive amount of IO Wait I can see in the CPU graph.
What is vda1?
Upgraded a 4 node production cluster running Ceph with no downtime or issues. I did do the needful and ran the check script, ensured Prox 6.4 was fully updated, and reviewed my systems for any of the known issues/extra steps. For example, 3/4 nodes in this cluster run old boards and therefore...
Are you sure the heat isn't coming from somewhere else in the system and just being vented by the power supply?
Otherwise, have any bulging capacitors on that motherboard?
Depending on the age of the system and the rarity of that hardware at this point, it very well may be that the drivers are not included in the Proxmox installer anymore. You may have to look into drivers for that system and load them into the installer.
You could also try putting the controller...
Also in case you do not get the indication well enough from the docs, I strongly suggest you do hardware testing on the drive(s) which held the malfunctioning pg(s).
You may need to confirm what hardware you have, and if the storage controller is still functional on an older unit like that. Doing some quick searching I am unable to find an "H433" controller, but based on my testlabs I can say that other storage controllers of the same generation as that...
@akus I'm glad the SSD cache is helping with that drive. Definitely keep a backup of what's stored on that array though, since I have seen those SMR drives cause issues with cached setups before, notably on my home machine when I tried to make some use of the extra SMR disks I had laying around.
He may have more than one issue, fair, but having accidentally ran VMs on that exact disk model before, I can say that my performance on an otherwise decent system was awful. The IO Wait those disks can pile up when you have a bunch of random writes will make them feel like a failed/failing...
I know exactly what your problem is.
First and foremost, that Seagate ST2000DM005 is a Shingled Magnetic Recording (SMR) disk. These are only good for slow, mostly read only storage. They have terrible random write speeds due to the technology behind SMR. There is no way to fix this, there are...
For an actual server storage controller, it can be an issue with command queue depths being filled with commands waiting on the slower HDDs. I've even seen decent server storage controllers choke a bit when there are SSDs and HDDs connected and the HDDs are loaded with a large write task...
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.