Hi everyone,
I've recently started facing issues with a large MySQL test database (20TB VM disk) and MySQL replication. Fuller details on the issue on the DB side can be found in another thread, but should be mostly irrelevant to the questions here: https://dba.stackexchange.com/quest...-uninterruptible-mysql-query-from-replication
During testing, I found that the MySQL replication failure and subsequent "pseudo-freeze" of the MySQL server process correlates with backup runs of the VM disk. What I've observed:
Before facing the issues, I were under the impression that PVE backups make use of ZFS snapshots (like live migrations do), but according to Backup modes and especially the linked description Efficient VM backup for qemu, I see that this is not the case. What caught my attention in the guide is this statement about fleecing:
"With backup fleecing, such old data is cached in a fleecing image rather than sent directly to the backup target. This can help guest IO performance and even prevent hangs in certain scenarios, at the cost of requiring more storage space."
I didn't have fleecing enabled so far, but did so now for a test. The backup is still running, but so far, the VM doesn't show signs of a problem. The fleece image size seems to cycle around 2-3GB, so if this really is a solution, I suppose I can work with that even with the fairly small "local-zfs" storage (only a few 10GB).
That brings me to my main questions to hopefully better understand what's going on:
Philipp
I've recently started facing issues with a large MySQL test database (20TB VM disk) and MySQL replication. Fuller details on the issue on the DB side can be found in another thread, but should be mostly irrelevant to the questions here: https://dba.stackexchange.com/quest...-uninterruptible-mysql-query-from-replication
During testing, I found that the MySQL replication failure and subsequent "pseudo-freeze" of the MySQL server process correlates with backup runs of the VM disk. What I've observed:
- MySQL replication is normally running
- The Proxmox backup starts, with an external PBS as target (10G link)
- There is a short hiccup in the VM at the beginning of the backup due to fsfreeze/thaw, but otherwise it runs normally
- The dirty bitmap finds some 500-1000GB of changed data (out of 20TB total disk size and >10TB used space)
- The backup data is written at wildly varying speeds, from 30 to 500MBytes/s
- At some time during the backup, the MySQL replication will suddenly stall; not with an error, it just waits while having one CPU core loaded at 100%
- At some (other) time during the backup, the DB will fail to handle application requests
- Before or at about the time when the backup finishes, the DB will become responsive to select/insert/update/delete queries again, making the application resume work; data can be read from and written to the DB just fine and files can also normally be written and read on both the VM's root and the DB data disk
- BUT: Even then, the MySQL replication stays stalled
- More severely, the MySQL replication can't be stopped anymore, a "stop replica;" query just stays there indefinitely
- Due to the unstoppable replication, the MySQL server can't be shut down properly; a SIGTERM will just lead to a SHUTDOWN message in the MySQL log, but the process stays alive, causing CPU load (but no or very low disk I/O, as far as I've seen)
- The only way to get out of this situation is to forcefully "kill -KILL" the MySQL server process, after which it will restart, do some cleanups and work again normally, until the next backup run
Before facing the issues, I were under the impression that PVE backups make use of ZFS snapshots (like live migrations do), but according to Backup modes and especially the linked description Efficient VM backup for qemu, I see that this is not the case. What caught my attention in the guide is this statement about fleecing:
"With backup fleecing, such old data is cached in a fleecing image rather than sent directly to the backup target. This can help guest IO performance and even prevent hangs in certain scenarios, at the cost of requiring more storage space."
I didn't have fleecing enabled so far, but did so now for a test. The backup is still running, but so far, the VM doesn't show signs of a problem. The fleece image size seems to cycle around 2-3GB, so if this really is a solution, I suppose I can work with that even with the fairly small "local-zfs" storage (only a few 10GB).
That brings me to my main questions to hopefully better understand what's going on:
- How does the backup process interact with the VM, that could lead to keeping a killed process from being cleaned up until the end of the backup?
- I see that the interception of VM writes by the backup layer causes delays inside the VM, but has this been found to cause other trouble than just a slowdown with guest processes (especially DBs) in the past already?
- Does backup fleecing change the behavior of these VM interactions, except for providing a quicker response to writes?
- Since it is mentioned in the docs that fleecing can "even prevent hangs in certain scenarios", is there an explanation or at least a theory about what could cause these hangs in the first place without fleecing? Or was "hang" in this case just about the VM lagging behind while the backup runs due to slow write performance?
- Are there plans to leverage ZFS snapshots for backups, as they are already used for VM live migration? Or has there been a change in this direction already since PVE 8.3?
Philipp