Dear All!
We have several such systems:
- Dell VRTX
- 4x m640
- 10x3.84 SAS SSD (Raid6) with 2xPerc8 shared RAID controllers (all node connected to this "storage")
- PVE 7.4-18 (we plan to update!)
- 4 node PVE cluster (non HA)
- LVM shared storage on ISCSI Multipath (shared storage not supported snapshot!)
Problem:
- Windows 2022 VM + Oracle SQL ( 300+ days uptime!)
- 3 virtual disks
- a VM snapshot backup is taken every night to PBS
- qemu guest agent installed
- PVE start backup 01:20 and make "snapshot"
- after PVE send "fs-freeze" qemu command, windows start IO operation errors:
"The IO operation at logical block address 0xd1c8 for Disk 1 (PDO name: \Device\0000004d) was retried."
only on disk1
- In the following, disk errors are constantly logged, constantly!
- disk1 is damaged, unusable .... (it could not be repaired)
- We restored the disk from backup and everything was fine! (from the backup that will get disk damaged!)
- on LVM storage we have 3Tb free space, no free space problem.
- other VMs are not damaged, cluster uptime is 400+ days.
- on PVE host no error message (nothing)
This is the second time we've experienced this.
We use PVE in many-many places, but we only experience this on LVM shared storage!
What can be done to prevent this from happening?
We have several such systems:
- Dell VRTX
- 4x m640
- 10x3.84 SAS SSD (Raid6) with 2xPerc8 shared RAID controllers (all node connected to this "storage")
- PVE 7.4-18 (we plan to update!)
- 4 node PVE cluster (non HA)
- LVM shared storage on ISCSI Multipath (shared storage not supported snapshot!)
Problem:
- Windows 2022 VM + Oracle SQL ( 300+ days uptime!)
- 3 virtual disks
- a VM snapshot backup is taken every night to PBS
- qemu guest agent installed
- PVE start backup 01:20 and make "snapshot"
- after PVE send "fs-freeze" qemu command, windows start IO operation errors:
"The IO operation at logical block address 0xd1c8 for Disk 1 (PDO name: \Device\0000004d) was retried."
only on disk1
- In the following, disk errors are constantly logged, constantly!
- disk1 is damaged, unusable .... (it could not be repaired)
- We restored the disk from backup and everything was fine! (from the backup that will get disk damaged!)
- on LVM storage we have 3Tb free space, no free space problem.
- other VMs are not damaged, cluster uptime is 400+ days.
- on PVE host no error message (nothing)
This is the second time we've experienced this.
We use PVE in many-many places, but we only experience this on LVM shared storage!
What can be done to prevent this from happening?