I wrote in my OP:
Therefor I can't use ethX, the system has the ability to use those names automatically, so it messed the nics up by swapping them randomly.
I completely removed VLAN12. I first removed the interfaces from the firewall machines, then I removed the SDN config. I then restarted the firewall, recreated the SDN for VLAN12, then re-added the interface (which is a bridge on vmbr0), recreated the rules on the pfSense after adding the ip...
Yes, before I did that, the SDN's showed an error. After applying the status returned to normal.
Yes, I have actually deleted an complete definition and recreated it, but it has not no noticeable effect. I think I'll just do one again.
32 nodes with 2 x AMD 128core Epyc V4 processors = 8192 vCPU's per cluster. Pretty big. But as others have said, the inter-cluster scripts and automations are the key to making it work.
We ran into a very nasty issue a few days ago.
Background:
Systemd generates ridiculously long interface names (see https://manpages.debian.org/bookworm/udev/systemd.link.5.en.html and referenced here https://wiki.debian.org/NetworkInterfaceNames#CUSTOM_SCHEMES_USING_.LINK_FILES) like...
When viewing a QEMU machine console with noVNC, the options are to either scale the screen locally, or not. When scaled locally, the text is so small that it's not practically usable. Disabling local scaling fixes that, but then the view screen cannot be shifted left / right or up / down, so...
I have since learned that the live migration is a feature of QEMU. This article describes it nicely. I cannot however see how this would casue the problem we're having.
I have a FreeBSD 12.3 guest running a poller node and when it gets installed everything runs just fine. We can stop and start the guest too, no problem. The guest uses VirtIO SCSI and uses an ceph RBD image of 120GB. The FreeBSD qemu-guest-agent is installed.
If for some reason the VM is...
However, when I attempt to do this I get an error which is not documented anywhere afaict
# lsblk /dev/sdb
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb...
I got advice from a seasoned ceph expert to do the following:
Split the NVMe drive into 3 OSD (with LVM) to really optimise the use the speed the NVMe offers. So I created additional 2 volumes (5% of the NVMe/ 47GB each) in the NVME to hold the RocksDB and WAL for 2 HDD drives. I'm in the...
I need to do something about the horrible performance I get from the HDD pool on a production cluster. (I get around 500KB/s benchmark speeds!). As the disk usage has been increasing, so the performance has been dropping. I'm not sure why this is, since I have a test cluster, which higher...
Good news. I fixed the "pg incomplete" error in ceph with the help of this post and now that the cluster in healthy, the slow mds message has gone away too!
It has now become clearer to me what happens. I removed all the mds's and then the message changed. It seems that the active mds generates this message, although I have trouble finding the message from the console. The messager pertains to the active mds. Previously it was mdssm1, but now...
This cluster is primarily used as a backup. We run Proxmox Backup Server on it, replicate some databases to it and use it for testing, so it's not primary production. We have had old drives fail a couple of time though, but I hear what you're saying about too many mds's.
I will do this.
However, I have created many mds daemons because these machines are old. Any of them could go down at some point and if the two that host the mds daemons go down at the same time I'd be screwed.
Is there a downside to having many mds daemons?
Yes, the node first removed and then rebuilt. The node was completely removed before I added the rebuilt one.
# ceph -s
cluster:
id: a6092407-216f-41ff-bccb-9bed78587ac3
health: HEALTH_ERR
1 MDSs report slow metadata IOs
1 MDSs report slow requests...
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.