Hello community!
We have 2 different 3 node setups (with community subscription) with ZFS replication instead of a shared storage. We know that's not ideal, that's not the point ;-)
Recently we had some bad side effects of unexpected node reboots (and one slooooow node reboot in particular) and I'd like to discuss if the lesson is to use very high replication frequencies.
For background, incident 1 (well, for the moment just as background info):
Recently we had another unexpected node reboot that looks like yet another corner case instability with certain ASRock mainboards, like discussed in https://forum.proxmox.com/threads/sudden-bulk-stop-of-all-vms.139500/page-4#post-692639 and below. The boards in this cluster have reached serials M80-H1025200nnn now, but the behaviour still tastes as bad, I really hope we won't get reboots every other day again, sigh. It might have been extraordinary network + disc I/O load together with the virtio network driver that seems less stable than the legacy e1000 emulation.
Incident 2:
Some 2 weeks ago, in the early evening, I accidently triggered a hard reboot of nodes 1 and 3 (of 3 overall) by very accidently stopping corosync (instead of just pausing HA, and I can't even remember what I wanted in the first place).
Unfortunately while node 1 was back after some 2-3 minutes, node 3 took about 15 minutes to come back.
By that time node 2 had taken over the containers from node 3 - using the data replicated in the morning.
When switching back to node 3 those morning data from node 2 overruled the newer evening data on node 3.
(Why? OK, I understand why, still very inconvenient)
We lost half a day of data. Customers are not happy.
I have since increased the replication frequency to every 30 minutes, but I'm very reluctant to go too high, out of fear the replication could dominate the systems' load.
Lesson 1:
Unfortunately unexpected reboots seem way more likely than node maintenance activities or actual hardware problems, and they can burn data if the replication is not frequent enough.
Experimental "incident" 3:
Today, on our internal setup with mostly irrelevant containers without ever-changing data, I played with the idea of setting most desired HA service states to "Ignored", so I could migrate containers manually if I decide a node won't come back.
BUT, that's not possible, the cluster insists on trying to connect to the dead node that it remembers the containers running on, even if I set that node to maintenance mode.
I might be able to make a backup of the latest replicated data, kick out the stuck container, and re-create it more or less. But that's not practical at all.
Lesson 2?
Ignore the load of frequent replication and replicate everything every minute?
Main Question: How frequent should we replicate CTs and VMs with continous data changes? Every minute? We don't want the replication to dominate over normal operation.
Question 2: Is it possible to slow down automatic node switching of services, to accomodate for slow reboots of nodes? I search but didn't find anything but wishes for even faster takeover. But for many services an outage would be very much preferred to data loss from what my indicent 2 describes above.
Side question: Anyone else still experiencing unexpected reboots, with AS Rock mainboards or virtio network driver?
Thanks in advance for any feedback!
Regards, Christoph
We have 2 different 3 node setups (with community subscription) with ZFS replication instead of a shared storage. We know that's not ideal, that's not the point ;-)
Recently we had some bad side effects of unexpected node reboots (and one slooooow node reboot in particular) and I'd like to discuss if the lesson is to use very high replication frequencies.
For background, incident 1 (well, for the moment just as background info):
Recently we had another unexpected node reboot that looks like yet another corner case instability with certain ASRock mainboards, like discussed in https://forum.proxmox.com/threads/sudden-bulk-stop-of-all-vms.139500/page-4#post-692639 and below. The boards in this cluster have reached serials M80-H1025200nnn now, but the behaviour still tastes as bad, I really hope we won't get reboots every other day again, sigh. It might have been extraordinary network + disc I/O load together with the virtio network driver that seems less stable than the legacy e1000 emulation.
Incident 2:
Some 2 weeks ago, in the early evening, I accidently triggered a hard reboot of nodes 1 and 3 (of 3 overall) by very accidently stopping corosync (instead of just pausing HA, and I can't even remember what I wanted in the first place).
Unfortunately while node 1 was back after some 2-3 minutes, node 3 took about 15 minutes to come back.
By that time node 2 had taken over the containers from node 3 - using the data replicated in the morning.
When switching back to node 3 those morning data from node 2 overruled the newer evening data on node 3.
(Why? OK, I understand why, still very inconvenient)
We lost half a day of data. Customers are not happy.
I have since increased the replication frequency to every 30 minutes, but I'm very reluctant to go too high, out of fear the replication could dominate the systems' load.
Lesson 1:
Unfortunately unexpected reboots seem way more likely than node maintenance activities or actual hardware problems, and they can burn data if the replication is not frequent enough.
Experimental "incident" 3:
Today, on our internal setup with mostly irrelevant containers without ever-changing data, I played with the idea of setting most desired HA service states to "Ignored", so I could migrate containers manually if I decide a node won't come back.
BUT, that's not possible, the cluster insists on trying to connect to the dead node that it remembers the containers running on, even if I set that node to maintenance mode.
I might be able to make a backup of the latest replicated data, kick out the stuck container, and re-create it more or less. But that's not practical at all.
Lesson 2?
Ignore the load of frequent replication and replicate everything every minute?
Main Question: How frequent should we replicate CTs and VMs with continous data changes? Every minute? We don't want the replication to dominate over normal operation.
Question 2: Is it possible to slow down automatic node switching of services, to accomodate for slow reboots of nodes? I search but didn't find anything but wishes for even faster takeover. But for many services an outage would be very much preferred to data loss from what my indicent 2 describes above.
Side question: Anyone else still experiencing unexpected reboots, with AS Rock mainboards or virtio network driver?
Thanks in advance for any feedback!
Regards, Christoph