MariaDB database corruption.

Jan-Willem Spuij

Active Member
Aug 7, 2018
4
2
43
43
Hello,

We are experiencing database corruption with an Ubuntu 24.04 LTS container and the latest MariaDB 10.11.8-MariaDB-ubuntu.24.04.1. PVE kernel version is: 6.8.12-6-pve.
Bugs are reported in the MariaDB jira:

https://jira.mariadb.org/browse/MDEV-35334
https://jira.mariadb.org/browse/MDEV-35886

However, the hangs seem to be connected to a series of patches backported to kernel version 6.1 (LTS). However, we are running 6.8.12-5-pve of course. I'm at a loss now what to do. Is this known within the proxmox community? Anybody else experiencing corruption of InnoDB tables (especially the first byte being the checksum ending up as 01 00 00 00)
 
the linked bug report seems to indicate this was an issue limited to the 6.1 backport.. do you really have the same symptoms? how are you running mariadb, in a VM or in a container?
 
It's a container, not a VM, and I'm having one of the symptoms: in the IBD file, the first four bytes of the header are changed into 01 00 00 00. The hanging does not occur. However within those jira issues it's speculated that the hang and corruption could have the same origin, namely those patches. In my case (not using Row compression), it seems that the checksum is stored elsewhere, so I can just change the 01 00 00 00 to 00 00 00 00 and mariaDB can read the table again.

I do however try to avoid editing table files manually after a server restart. This corruption only occurs when stopping the MariaDB server. As long as the server is running, the corruption does not occur. The corruption is only the first byte. Correcting the first byte, lets mariadb read the table instantly. Checksum with innodbchecksum then matches again.

Servers are ECC memory checked, ZFS storage in raid1, scrubs reveal no errors.
 
okay, so a container should definitely use the host kernel and thus not be affected by the broken backport (the code in question looks quite different in 6.8). you could try our 6.11-based kernel, maybe there is a second bug that got already fixed. else we'd maybe need to narrow it down a bit more..
 
keep us posted! if you find any more leads on the kernel side and can reproduce the corruption easily, we can also build test kernels if needed.