Where to start to work out whats causing freezes of OS

CelticWebs · Dec 22, 2023

I've been transferring data form an old server to the new one for 4th last day or so, it's mainly larger files, I haev also been transferring large archives from online to eth storage on the server too.

Server spec is shown below and it's got 10GB connection at the data centre. It was happily downloading at around 400MBs direct to disk via a VM running the download as well as transferring from the old machine on the main OS to disk via an NFS share at around 100MB/s (1gb Nic). All was going well and I was quite impressed with the performance. Both the VM and the main system remained responsive and I think eth only restriction to downlink faster was disk speed.

Then I notice dates downloads wood get slower and slower till they cam to a halt, I stoped the downloading in the VM and left the NFS transfer on the main system going, assuming it was just a network congestion issue. Still ether were issue with it grinding to a halt. I started to investigate to see if there was any network issues, when I looked at the summary screen I see this, which shows tax eth GUI and the whole system must be freezing for periods of time. Where; step best place to start looking whats causing this? I'm assuming logs would be stopping at the same time too? Any idea?

I'm going to assume it's disk related because when I stop the transfers / downloads, all seems to be fine. So where would I find logs relating to disk issues?

leesteken · Dec 22, 2023

CelticWebs said:
I'm going to assume it's disk related because when I stop the transfers / downloads, all seems to be fine. So where would I find logs relating to disk issues?

Do you use consumer QLC SSD's by any chance (or maybe SMR HDDs, with raidz1/2/3)? Please let us know the make and model, and the raid setup if you use any, of the drives that are giving you trouble.

CelticWebs · Dec 22, 2023

The large pool that is being used as storage are Enterprise drives. They are Seagate 4TB ST4000NM0034. All Proxmox VMs are admittedly running form a pair of non enterprise WD Black 1TB NVME in Raid 1 Within the VM that's downloading, there are some docker containers that store their system data on a 4 raid mirrored & striped Samsung 970 Evo NVME. all downloads and transfers are being saved / transferred into eth enterprise disk pool which is 24 of the 4TB HDD in Draid3:16d:24C:0s

However, I think I may have just found part of the issue, though I'm not sure how I'd stop it in future. When I went to get the disk models, the disk page timed out. So I decided to see if it listed it under pool. While it didn't show the disk model, it did show that it was scrubbing and would complete in 9 days! I went to command line and stopped the scrub. So far it's working at full speed again. (spoke to soon, minuteS later it started locking up again!)

The system has been up for quite a while but didn't have anything on it for a long time while I planned the final setup and transfer. Surely a scrub shouldn't cause the system to lock quite like that though?

leesteken · Dec 22, 2023

CelticWebs said:
The system has been up for quite a while but didn't have anything on it for a long time while I planned the final setup and transfer. Surely a scrub shouldn't cause the system to lock quite like that though?

A scrub on large (and not mostly empty) HDDs can take quite some time, and it does slow the IOPS a lot because there are constant read actions going on, as HDD are always already very limits in IOPS compared to SSDs (because of the high seek times required to move physical read/write heads). Note that your SSDs don't have PLP and might also cause increased IO delays and Proxmox warns about running Docker in a container.

CelticWebs · Dec 22, 2023

I spoke to soon, it still ground to a halt and froze again.... Which logs are likely to show what's causing the lock up?

leesteken · Dec 22, 2023

CelticWebs said:
I spoke to soon, it still ground to a halt and froze again.... Which logs are likely to show what's causing the lock up?

It's unlikely that there are logs about it. It's just your hardware that behaves like this on long/heavy/sustained writes. Don't worry too much about the missing graphs, it's pick up once the writes are over and the drives have had time to settle.

CelticWebs · Dec 22, 2023

I'll have to experiment a bit, can't have the entire system locking up for a couple of minutes at a time.

CelticWebs · Dec 23, 2023

Well this looks ominous! I'm guessing this is where the problem lies!

Code:

ZFS has finished a resilver:

  eid: 819
class: resilver_finish
 host: prox847
 time: 2023-12-23 00:15:41+0000
 pool: Storage
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
  see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
 scan: resilvered 37.3M in 00:00:14 with 0 errors on Sat Dec 23 00:15:41 2023
config:

    NAME                        STATE     READ WRITE CKSUM
    Storage                     ONLINE       0     0     0
      draid3:16d:24c:0s-0       ONLINE       0     0     0
        scsi-35000c50085e8709f  ONLINE       0     0     0
        scsi-35000c50085e914d7  ONLINE       0     0     0
        scsi-35000c50085e89123  ONLINE       0     0     0
        scsi-35000c50085e16087  ONLINE       0     0     0
        scsi-35000c50085e22d9b  ONLINE       0     0     0
        scsi-35000c50085e170ef  ONLINE       0     0     0
        scsi-35000c50085e230af  ONLINE       0     0     0
        scsi-35000c50085e87223  ONLINE       0     0     0
        scsi-35000c50085e8f2ef  ONLINE       0     0     0
        scsi-35000c50085e172c3  ONLINE       0     0     0
        scsi-35000c50085e85423  ONLINE       0     0     0
        scsi-35000c50085e890cb  ONLINE       0     0     0
        scsi-35000c50085e22dfb  ONLINE       0     0     0
        scsi-35000c50085e159ff  ONLINE       0     0     0
        scsi-35000c50085e92683  ONLINE       0     0     0
        scsi-35000c50085e234f7  ONLINE       0     0     0
        scsi-35000c50085e15df7  ONLINE       0     0     0
        scsi-35000c50085e877d3  ONLINE       0     0     0
        scsi-35000c50085e1ba93  ONLINE       3 4.36K     0
        scsi-35000c50085e160af  ONLINE       0     0     0
        scsi-35000c50085e85563  ONLINE       0     0     0
        scsi-35000c50085e8c2a7  ONLINE       0     0     0
        scsi-35000c50085e15ce7  ONLINE       0     0     0
        scsi-35000c50085e9130b  ONLINE       0     0     0

errors: No known data errors

leesteken · Dec 23, 2023

CelticWebs said:
Well this looks ominous! I'm guessing this is where the problem lies!

Write errors can be caused by drives that are too slow (ZFS assumes the drive is not responding) or a bad cable/connection or just the drive reporting errors. Check journalctl for error messages and check SMART (maybe run a long test) to find out.

CelticWebs · Dec 23, 2023

leesteken said:
Write errors can be caused by drives that are too slow (ZFS assumes the drive is not responding) or a bad cable/connection or just the drive reporting errors. Check journalctl for error messages and check SMART (maybe run a long test) to find out.

What exactly am I looking for? That output is 85516 lines long. I Grep for lines with Error and there's still a heck of a lot of lines.

leesteken · Dec 23, 2023

CelticWebs said:
What exactly am I looking for? That output is 85516 lines long. I Grep for lines with Error and there's still a heck of a lot of lines.

See if those errors (maybe start with the last ones) are about that drive and/or ZFS (it reported at least 4360 errors). Or just test the drive with smartctl or just replace it.

CelticWebs · Dec 26, 2023

The plot thickens...


ZFS has detected that a device was removed.

impact: Fault tolerance of the pool may be compromised.
   eid: 10628
 class: statechange
 state: REMOVED
  host: prox847
  time: 2023-12-25 19:38:15+0000
 vpath: /dev/disk/by-id/scsi-35000c50085e85423-part1
 vphys: pci-0000:86:00.0-scsi-0:1:50:0
 vguid: 0xBE045C4352B0BEE8
 devid: scsi-35000c50085e85423-part1
  pool: Storage (0x9386A9628C7E34B1)

Had a load of emails today with this. Interestingly it's a disk that previously hadn't even had a single read/write error. Beginning to wonder if it's just the controller overheating during the extended transfer rather than actual disk issues. Unfortunately they are SAS disks so I can't find a decent way of testing them. Doe Proxmox have anything that can fully test a SAS disk? I keep reading that smartctl doesn't do a proper job of it?

CelticWebs · Jan 15, 2024

Just to update this thread. It turned out to be the Raid card, Adaptec 71605, swapped it today for an LSI 9300 and all is well again

Search

Search

Where to start to work out whats causing freezes of OS

CelticWebs

Member

leesteken

Distinguished Member

CelticWebs

Member

leesteken

Distinguished Member

CelticWebs

Member

leesteken

Distinguished Member

CelticWebs

Member

CelticWebs

Member

leesteken

Distinguished Member

CelticWebs

Member

leesteken

Distinguished Member

CelticWebs

Member

CelticWebs

Member