Unfortunately my celebration was premature. It died in the same way again after working fine for four days. This is pretty strange since triggering the failure was very repeatable before the update (even on a freshly booted system), and was repeatably not reproducible immediately following it. This time, I managed to capture the following backtraces of the KVM processes using sysrq. Again, there are other processes that subsequently go into D state after the KVM ones lock up, and I can provide those if useful, but I assume the KVM ones are the most important.
@fiona I know you have been asking for some backtraces, so perhaps these might mean something to you.
I may have more time to mess around with it in a while. I have a few random ideas to check, mainly NFS volume free space (maybe 4 days of backups brought it down to some critical level?) and running a memory test on the server (it passed fine about a year or two ago).
One thing I find strange is that these hanging KVM processes are that of the TrueNAS VM, which has nothing stored on NFS (since the NFS is hosted by this VM and doesn't exist yet at its boot time). If something is causing the TrueNAS VM to block waiting on its own NFS server, that would explain the deadlock. Then the question then becomes why are the TrueNAS KVM processes accessing NFS if they have nothing stored on it?
I suppose the next step would be trying to figure out what files those processes are accessing over NFS.
EDIT: Just triggered a lockup again (took copying 150GB of random junk this time), but wasn't able to see any new file opens or open files on /mnt/pve/TrueNAS (the NFS location)
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.