Database hosted on MSSQL server 2017 on Windows Server 2022 standard edition getting corrupted

bsinha

Member
May 5, 2022
92
0
11
Hi,

We are facing issues of SQL server database corruption for around last 10 days and unable to come out of the situation.

We are running a 3 node Proxmox cluster with Ceph. There are 4 NVMe drives participating in the ceph configuration. These 3 nodes are connected in a full mesh network in Routed (With fall back) mechanism to support Ceph.

Please find hardware information of each of the servers below:

Disk model: 3.2TB Micron_7450_MTFD x4 (Participating in Ceph)
CPU: Intel(R) Xeon(R) Silver 4316 CPU @ 2.30GHz x2
Memory: 32GB x12
Server Model: Supermicro SYS-620C-TN12R
Network card participating in Ceph: AOC-A25G-i4SM x2
Storage Controller: Broadcom MegaRAID 9560-16i 8GB


This 3-node cluster is hosting around 8 Windows Server VMs. All of them are running Windows Server 2022 Standard editions. The servers are of the following kind:

1) 5 Application servers (Running IIS)
2) 1 Active Directory (Microsoft Active Directory)
3) 2 Database servers (SQL Server 2017 with latest Cumulative Update applied) // Corruption issues are shown here

Please note all the servers are TPM enabled.

The problem started on 12th June 2025 when the SQL server workload had been brought live after enabling database encryption (TDE). The application requires 2 database servers. DB-1 acts as a Transactional server and DB-2 is the Reporting server. The arrangement is that, after the entire day of transactions on DB-1, all the databases are backed up in the night-time and restored in DB-2 for the reporting purpose.

DB-1 is hosting 5 databases. All of them are encryption enabled.

The problem is, after enabling encryption the databases have started corrupted. DBCC CHECKDB shows various allocation and page issues. Even if a new database is created with fresh data and DBCC CHECKDB shows no error at all, but in the next day the database started getting corrupted and throws the errors like following:

fe4070a5fdfd0b3253ee5d65fe2f5cb3db4cbc4a594238173c34bb9be0f98ad46f7d1d8b2763ed1a?t=fbe08ecabdc536298b554ac7c946e549


03a94c2c6ece36f9ff7fe38ef3cb5babc278e66f435a1dca864009244686afae3440d324d44bc176?t=ff1cfb3181c3b90f6ccb15d6cd81f8c2



It is happening randomly. So far 2 databases are affected. We do not know whether the other database would hit by this issue or not in the future. The daily task has become - we get the corrupt databases, then create a new database. And somehow correlate the data for the entire night and make the database with 0 error. Next day another database gets corrupted and the cycle continues.

It had also happened that the same database had got corrupted multiple times after creating the database freshly with 0 error.

We have looked into the System and Hardware events in the VMs and found nothing related to Storage or I/O subsystem.

The storage of the Windows VM is configured in the following way:

5d7cf5fefe0f146e08f447022fd696b3c72c6ad9bd0b93fd3e11db09a2f49bd002f664519e2217a2?t=c6dd9d4d810b0a4a469f248dc15bcd84


Storage configuration


Is the VM configuration correct in Proxmox?
The hardware we chose, are they compatible to run MSSQL server 2017?



What is going wrong? we need urgent help on this.



PVE information is attached
 

Attachments

Our company has been installing PVE for over 10 years, and we use Windows Server VMs with MS SQL versions ranging from 2008 to 2022 without any issues. The configuration you posted is correct.

I have never used CEPH or encrypted databases, so the problem might be related to these two configurations.
 
Our company has been installing PVE for over 10 years, and we use Windows Server VMs with MS SQL versions ranging from 2008 to 2022 without any issues. The configuration you posted is correct.

I have never used CEPH or encrypted databases, so the problem might be related to these two configurations.
Instead of Ceph, what are you using? Do you use NVMe ssds for your workload?
 
I use both nvme or sata ssd with lvm or zfs storage.
I tested Ceph in a datacenter but noticed poor I/O performance with SQL.
Your issue might be related to the combination of Ceph's low I/O and database encryption.
Can you test the same hardware (single node) with a different type of storage, such as ZFS?
 
I use both nvme or sata ssd with lvm or zfs storage.
I tested Ceph in a datacenter but noticed poor I/O performance with SQL.
Your issue might be related to the combination of Ceph's low I/O and database encryption.
Can you test the same hardware (single node) with a different type of storage, such as ZFS?
Using Ceph I get around 1,00,000 IOPS. But if we use ZFS or RAID the IOPS is not that much.
 
2 suggestions (not that you will implement either but here it goes):
1. Proxmox host should only ever see drives used as VM storage in a RAID format of some kind (hardware based or RAIDZ or something), the higher the RAID the better (EX: RAID10 better than RAID5)
2. If you turn on a feature and it immediately gives an issue, than maybe you should turn that feature (encryption in your case) back off?
 
From your logs (824 and 9100 errors) this is SQL Server corruption triggered by I/O consistency issues, often tied to the underlying storage. Since it started right after enabling TDE, it’s stressing the I/O path (Proxmox + Ceph + NVMe).

First, protect data — take full VM disk snapshots or copy the MDF/LDF off the cluster. Then run integrity checks:

SQL:
-- Check DB integrity
DBCC CHECKDB('YourDB') WITH NO_INFOMSGS, ALL_ERRORMSGS;

-- Quick physical check
DBCC CHECKDB('YourDB') WITH PHYSICAL_ONLY;

-- If DB won’t come online
ALTER DATABASE YourDB SET EMERGENCY;
DBCC CHECKDB('YourDB') WITH ALL_ERRORMSGS;

If you have clean backups, test with:

SQL:
RESTORE VERIFYONLY FROM DISK='E:\Backups\YourDB.bak';

If corruption repeats, review storage cache settings in Proxmox/Ceph and check NVMe/OSD logs for I/O errors.

If no good backup exists, last resort is DBCC CHECKDB ... REPAIR_ALLOW_DATA_LOSS (only on a copy). For safer recovery without losing critical records, consider using a third-party tool like Stellar Repair for MS SQL, which can open and rebuild corrupt MDF/NDF files even when CHECKDB fails.
 
Not sure why people here are suggesting ZFS or other storage solutions. The problem is not your storage or Proxmox, this happens just as well on VMware. Look up MSSQL TDE corruption and you’ll see a flood of issues and potential causes.

The reality is that (TDE in) MSSQL is buggy. It corrupts data randomly. This could be because your application (.NET framework) is out of date, your devs didn’t account for it in code or .NET doesn’t have a certain hotfix. Some other things that may trigger it: compression, clustered indexes, using ReFS instead of NTFS…

Use a real database :) or use encrypted Ceph and not your database. I highly doubt you are stressing this hardware. There is some things like the MegaRAID that you’re probably using as a SAS HBA to pass through disks. Ceph notices corruption though and you’d require at least 2 copies to be corrupt in order not to trigger an I/O error at the OS level which you don’t seem to have. If it were the underlying storage, you’d see more than just a single application issue, there would be logs, crashes etc.

Here is a great article on TDE and the many paths of failure: https://medium.com/@s.mcauliffe_174...ent-data-encryption-tde-or-how-to-d164eb08564
 
Last edited:
I've run into the same situation. I'm running Windows Server 2019 in my VMs with SQL Server 2019. I have about six of these SQL Server instances spread across different clusters, and they are all experiencing intermittent DBCC check errors. My PVE version is 7.4-1.