Just to add to this.
So we have a dedicated 1gb/s network for corosync. We then have 2 LCAP bonded 10gb/s network cards in each host which have been split into VLAN's. We have a couple for ceph and 1 for public, which at the time in question would have had zero traffic on it. We have defined...
So this weekend's problem.
We did some maintenance on one of our nodes this weekend. We have a cluster with Ceph running. It appeared that everything went ok, but we have had reports today from cusomers that their applications crashed around the time of the migration.
One VM that we...
Well, 3000 is quite a lot!
Our use case is as follows. We run a SaaS application that interfaces into a 3rd Party application which is very badly behaved! As such we need to keep it very tightly constrained resource-wise. We actually have the application packaged up into a docker container for...
Thanks for your further replies.
i have checked the switches and they do seem to be working with 802.3ad:
[LN-LD4-SW-S6720]dis lacp stat eth-trunk 16
Eth-Trunk16's PDU statistic is:
So did a full set of pings.
On the short term, i can see max above 6ms, but on the long term one these appear to vanish.
Looking at it, it would seem using this link would cause issues for corosync.
I also did a "jumbo-frame" ping between all the servers and could confirm that that is working...
Ok, so an update for you.
I have changed the primary corosyn interface over to a standalone 1gb one. For now it is still on the same switch as that will require a DC visit to change, but figured it would be better off like this.
For a start, the retransmit messages have stopped, and there is...
I have been taking a look at this in more depth, and i am starting to wonder if there is an issue with the overall network setup
there are a lot of of corosync entries such as this:
Jan 1 10:12:59 ld4-pve-n6 corosync: [KNET ] link: host: 3 link: 2 is down
Jan 1 10:12:59 ld4-pve-n6...
Thanks for the replies.
I am starting to realize how it has been setup is madness.
We don't have the resources to dedicate 6 switches to this (2 x public, 2 x cluster, 2 x storage), but we could put the cluster on its own single switch, maybe with a vlan as a backup?
i would not want to be in...
So the cluster was originally setup by some "consultants" but i have reason to question their abilities.
Every node is identical.
We have a Mellanox 2 port 10GB network card in place, that is configured as an LACP Bond. That then has a single vmbr bridge on it which is VLAN aware.
Thanks for your detailed reply, I am pretty much of the same mindset as you in that something to improve uptime should never behave like this.
We have been running VPS hosting for 11 years, the past 7 of those with Hyper-V, and never had such an issue as this. Windows gets its fair share...
So tried again, making sure IP addresses/vlans were correct and had exactly the same problem.
I have quite the list of issues with Proxmox now, every single one of which is disrupting my clients by having their VM's randomly going offline.
I am now at a crossroads. Do i shell out several...
We have just had a serious issue with our cluster.
We have 7 nodes in total, 4 of which are also running Ceph and around 400 VM's running. We were in the process of adding an 8th node, and after adding it to the cluster everything started to lock up.
Upon investigation, it appeared that every...
Thanks for your detailed reply, this was just the info i was looking for.
Issue with the 820's is that the iDrac is part of the system board, so can't be replaced. It is also strange that it is now happening on all 7 servers that i have installed Proxmox on, and has never happened on the...
Just out of interest here.
Do you think any of these issues could be caused by the Guest CPU Type? We have set this to Sandy Bridge on all Guests as that is a common CPU type among all hosts, however, it's not the default of kvm64.
Also, would I be correct that adding a CPU of a newer...