We're not able to fix this, since we didn't have a problem for 20 years on QEMU it must be a bug in the current version. For now we are able to create a shadow machine with a cron that r-syncs the critical areas and then the shadow machine is snapshotted. What we are finding is that if ther-sync and backup are run at different times then the corruption doesn't seem to occur. In my testing where I contiunually copied files and loaded the CPU to 100% on all cores I found that the snapshot took a long time ~ >8 seconds. This is probably part of the problem for the VM writes probably fail and are not handled properly, I guess we'll never know. (The server its running on is no slouch its 64 core, ultra wide SSD bandwidth plenty of RAM and load and IOwait is too low to measure )
We have another warehouse that running on Kernel 3.2 which I think we can use as a template and upgrade the others stuck on 2.4. Those two kernels are lightyears apart, the whole fs / IO / multitasking networking stack etc are so different. We have a 3.0 kernel SUSE enterprise that we backup every hour and thats never remotely had an issue. So that will probably be the road we go down.
We have another warehouse that running on Kernel 3.2 which I think we can use as a template and upgrade the others stuck on 2.4. Those two kernels are lightyears apart, the whole fs / IO / multitasking networking stack etc are so different. We have a 3.0 kernel SUSE enterprise that we backup every hour and thats never remotely had an issue. So that will probably be the road we go down.
Last edited: