Database hosted on MSSQL server 2017 on Windows Server 2022 standard edition getting corrupted

bsinha

Member
May 5, 2022
87
0
11
Hi,

We are facing issues of SQL server database corruption for around last 10 days and unable to come out of the situation.

We are running a 3 node Proxmox cluster with Ceph. There are 4 NVMe drives participating in the ceph configuration. These 3 nodes are connected in a full mesh network in Routed (With fall back) mechanism to support Ceph.

Please find hardware information of each of the servers below:

Disk model: 3.2TB Micron_7450_MTFD x4 (Participating in Ceph)
CPU: Intel(R) Xeon(R) Silver 4316 CPU @ 2.30GHz x2
Memory: 32GB x12
Server Model: Supermicro SYS-620C-TN12R
Network card participating in Ceph: AOC-A25G-i4SM x2
Storage Controller: Broadcom MegaRAID 9560-16i 8GB


This 3-node cluster is hosting around 8 Windows Server VMs. All of them are running Windows Server 2022 Standard editions. The servers are of the following kind:

1) 5 Application servers (Running IIS)
2) 1 Active Directory (Microsoft Active Directory)
3) 2 Database servers (SQL Server 2017 with latest Cumulative Update applied) // Corruption issues are shown here

Please note all the servers are TPM enabled.

The problem started on 12th June 2025 when the SQL server workload had been brought live after enabling database encryption (TDE). The application requires 2 database servers. DB-1 acts as a Transactional server and DB-2 is the Reporting server. The arrangement is that, after the entire day of transactions on DB-1, all the databases are backed up in the night-time and restored in DB-2 for the reporting purpose.

DB-1 is hosting 5 databases. All of them are encryption enabled.

The problem is, after enabling encryption the databases have started corrupted. DBCC CHECKDB shows various allocation and page issues. Even if a new database is created with fresh data and DBCC CHECKDB shows no error at all, but in the next day the database started getting corrupted and throws the errors like following:

fe4070a5fdfd0b3253ee5d65fe2f5cb3db4cbc4a594238173c34bb9be0f98ad46f7d1d8b2763ed1a?t=fbe08ecabdc536298b554ac7c946e549


03a94c2c6ece36f9ff7fe38ef3cb5babc278e66f435a1dca864009244686afae3440d324d44bc176?t=ff1cfb3181c3b90f6ccb15d6cd81f8c2



It is happening randomly. So far 2 databases are affected. We do not know whether the other database would hit by this issue or not in the future. The daily task has become - we get the corrupt databases, then create a new database. And somehow correlate the data for the entire night and make the database with 0 error. Next day another database gets corrupted and the cycle continues.

It had also happened that the same database had got corrupted multiple times after creating the database freshly with 0 error.

We have looked into the System and Hardware events in the VMs and found nothing related to Storage or I/O subsystem.

The storage of the Windows VM is configured in the following way:

5d7cf5fefe0f146e08f447022fd696b3c72c6ad9bd0b93fd3e11db09a2f49bd002f664519e2217a2?t=c6dd9d4d810b0a4a469f248dc15bcd84


Storage configuration


Is the VM configuration correct in Proxmox?
The hardware we chose, are they compatible to run MSSQL server 2017?



What is going wrong? we need urgent help on this.



PVE information is attached
 

Attachments

Our company has been installing PVE for over 10 years, and we use Windows Server VMs with MS SQL versions ranging from 2008 to 2022 without any issues. The configuration you posted is correct.

I have never used CEPH or encrypted databases, so the problem might be related to these two configurations.
 
Our company has been installing PVE for over 10 years, and we use Windows Server VMs with MS SQL versions ranging from 2008 to 2022 without any issues. The configuration you posted is correct.

I have never used CEPH or encrypted databases, so the problem might be related to these two configurations.
Instead of Ceph, what are you using? Do you use NVMe ssds for your workload?
 
I use both nvme or sata ssd with lvm or zfs storage.
I tested Ceph in a datacenter but noticed poor I/O performance with SQL.
Your issue might be related to the combination of Ceph's low I/O and database encryption.
Can you test the same hardware (single node) with a different type of storage, such as ZFS?
 
I use both nvme or sata ssd with lvm or zfs storage.
I tested Ceph in a datacenter but noticed poor I/O performance with SQL.
Your issue might be related to the combination of Ceph's low I/O and database encryption.
Can you test the same hardware (single node) with a different type of storage, such as ZFS?
Using Ceph I get around 1,00,000 IOPS. But if we use ZFS or RAID the IOPS is not that much.
 
2 suggestions (not that you will implement either but here it goes):
1. Proxmox host should only ever see drives used as VM storage in a RAID format of some kind (hardware based or RAIDZ or something), the higher the RAID the better (EX: RAID10 better than RAID5)
2. If you turn on a feature and it immediately gives an issue, than maybe you should turn that feature (encryption in your case) back off?