Hi,
I have a pretty weird problem and I am out of ideas and can't find anyone that has the same problem. I would be really grateful for some ideas.
Background info: I have one Proxmox host with an AMD EPYC 7443P, 128 GB RAM and three ZFS pools.
One is mirrored with two 1TB NVME SSDs
One is mirrored with two 4TB NVME SSDs.
And the third is RAIDZ1 with three 6TB WD Red Plus HDDs.
I originally had a Windows Server 2016 with a system disk on the 4TB SSD pool and a data disk on the HDD pool. The windows server is a secondary domain controller and also used as a file server. I wanted to replace it with something else for a long time, but never got to it. I initially had the data disk also on the 4TB SSD pool, but recently added the HDD pool and moved it over there. I then extended the virtual disk a couple of times and extended the NTFS partition on it. I didn't notice anything until a couple of days ago, when I noticed that the performance was really bad. When I copy files (doesn't matter if big or small) the transfer speed fluctuates a lot and goes down to pretty much nothing for most of the time. With spikes up to 100 or 200 MBps. Most of the time the copy finishes eventually if its just a couple GBs up to 100 or 200GB. I didn't try more yet, because even that takes hours. Sometimes it hangs completely.
I thought maybe there is something wrong with the Windows Server installation, so I moved the virtual disk to a Windows 11 VM. But that didn't change anything.
I had the drive attached with VIRTIO, but also tested if it makes a difference with other options, but it didn't.
I didn't think about the connection to expanding the partition initially, but I now believe that that is probably the reason. I am sure it worked normally a couple of months ago.
In the meantime I made two new file servers. One on my new Univention domain controller that I am migration to, for normal files. And one OMV for my of course legally obtained Movie and TV show collection that makes up most of the data. That was the reason why I got the HDD pool and extended the disk.
The data drive for OMV is on the same HDD pool and it works great. I get 200-300 MBps transfer speed over SMB. So I can for sure rule out the physical disks.
I managed to copy some of the data to the new file server, but it's very very slow. What's interesting is that the disk active time in Windows is almost always at 100% as soon as anything is using the disk. Even just browsing files in explorer. When I start copying files, when the transfer rate goes up, the disk active time goes down. Sometimes the transfer rate goes up to 100 or 200 MBps, but it drops down quickly again to zero or almost zero.
I think that something might be wrong with the NTFS file system. I ran chkdsk -f -x but it didn't fix anything. I tired chkdsk -f -x -r too, but that would take forever and I don't know if it makes any sense with a virtual disk. The -r is supposed to find bad sectors on the disk and move that data somewhere else.
zpool scrub ran around two weeks ago and didn't find any errors. I am running it right now and it says it will take over 7 days. So maybe there is something wrong with the zfs volume? The new one on the same pool works great as I mentioned.
I am going to let it run, but so far it didn't repair anything.
Thats the status:
1.48T / 7.02T scanned at 45.8M/s, 374G / 7.01T issued at 11.3M/s
0B repaired, 5.20% done, 7 days 03:15:21 to go
I am not sure if that will solve anything or how long it will really take, so I wanted to ask if anyone experienced something similar before? Or if someone has any ideas what else I could try in the meantime. I just want to copy everything to the new fileservers. I do have a cloud backup for the normal data that I could download, but it would take a long time since we only have a 20Mbps internet connections. For the media data I don't have a backup and would really like to save it. I could get it again of course, but it would also take a very long time.
Is there maybe someway I can check and repair the virtual disk outside of Windows? I can move it to some VM or boot of an ISO, but I don't know what to use. I never had to deal with a problem like this. If it wasn't a VM I would say the disk is bad, but I can rule that out. They are new, SMART is perfect and the other zfs volume on there works great.
But it's also weird that the zpool scrub takes so long. Now it went up again by a couple of hours. I've been trying to fix that for days now and I am out of ideas.
Thanks for any input
I have a pretty weird problem and I am out of ideas and can't find anyone that has the same problem. I would be really grateful for some ideas.
Background info: I have one Proxmox host with an AMD EPYC 7443P, 128 GB RAM and three ZFS pools.
One is mirrored with two 1TB NVME SSDs
One is mirrored with two 4TB NVME SSDs.
And the third is RAIDZ1 with three 6TB WD Red Plus HDDs.
I originally had a Windows Server 2016 with a system disk on the 4TB SSD pool and a data disk on the HDD pool. The windows server is a secondary domain controller and also used as a file server. I wanted to replace it with something else for a long time, but never got to it. I initially had the data disk also on the 4TB SSD pool, but recently added the HDD pool and moved it over there. I then extended the virtual disk a couple of times and extended the NTFS partition on it. I didn't notice anything until a couple of days ago, when I noticed that the performance was really bad. When I copy files (doesn't matter if big or small) the transfer speed fluctuates a lot and goes down to pretty much nothing for most of the time. With spikes up to 100 or 200 MBps. Most of the time the copy finishes eventually if its just a couple GBs up to 100 or 200GB. I didn't try more yet, because even that takes hours. Sometimes it hangs completely.
I thought maybe there is something wrong with the Windows Server installation, so I moved the virtual disk to a Windows 11 VM. But that didn't change anything.
I had the drive attached with VIRTIO, but also tested if it makes a difference with other options, but it didn't.
I didn't think about the connection to expanding the partition initially, but I now believe that that is probably the reason. I am sure it worked normally a couple of months ago.
In the meantime I made two new file servers. One on my new Univention domain controller that I am migration to, for normal files. And one OMV for my of course legally obtained Movie and TV show collection that makes up most of the data. That was the reason why I got the HDD pool and extended the disk.
The data drive for OMV is on the same HDD pool and it works great. I get 200-300 MBps transfer speed over SMB. So I can for sure rule out the physical disks.
I managed to copy some of the data to the new file server, but it's very very slow. What's interesting is that the disk active time in Windows is almost always at 100% as soon as anything is using the disk. Even just browsing files in explorer. When I start copying files, when the transfer rate goes up, the disk active time goes down. Sometimes the transfer rate goes up to 100 or 200 MBps, but it drops down quickly again to zero or almost zero.
I think that something might be wrong with the NTFS file system. I ran chkdsk -f -x but it didn't fix anything. I tired chkdsk -f -x -r too, but that would take forever and I don't know if it makes any sense with a virtual disk. The -r is supposed to find bad sectors on the disk and move that data somewhere else.
zpool scrub ran around two weeks ago and didn't find any errors. I am running it right now and it says it will take over 7 days. So maybe there is something wrong with the zfs volume? The new one on the same pool works great as I mentioned.
I am going to let it run, but so far it didn't repair anything.
Thats the status:
1.48T / 7.02T scanned at 45.8M/s, 374G / 7.01T issued at 11.3M/s
0B repaired, 5.20% done, 7 days 03:15:21 to go
I am not sure if that will solve anything or how long it will really take, so I wanted to ask if anyone experienced something similar before? Or if someone has any ideas what else I could try in the meantime. I just want to copy everything to the new fileservers. I do have a cloud backup for the normal data that I could download, but it would take a long time since we only have a 20Mbps internet connections. For the media data I don't have a backup and would really like to save it. I could get it again of course, but it would also take a very long time.
Is there maybe someway I can check and repair the virtual disk outside of Windows? I can move it to some VM or boot of an ISO, but I don't know what to use. I never had to deal with a problem like this. If it wasn't a VM I would say the disk is bad, but I can rule that out. They are new, SMART is perfect and the other zfs volume on there works great.
But it's also weird that the zpool scrub takes so long. Now it went up again by a couple of hours. I've been trying to fix that for days now and I am out of ideas.
Thanks for any input