Suspended ZFS pool affects all other VMs on node (even if no vDisks are affected)

apoc

Famous Member
Oct 13, 2017
1,051
170
108
Hello all,

I have a situation here and I hope that someone can help me with this.

My setup is running 3 individual ZFS pools. All are attached locally via a SAS HBA (actually they are SATA HDDs, no Expanders).
One of those ZFS pools only holds temporary data and I couldn't care less of it. 3 of my VMs have ZVOLs on it.
My expectation is, that a ZFS pool having issues only would affect those 3 VMs (with a relationship to it).

However that is not the case.
A few days ago it seems that this ZPOOL with temporary data was suspended due to IO-Errors.
Screenshot from 2020-05-20 14-02-11.png

Some hours after the initial problem (according to the logs, 4-6 hrs later) all my VMs "crashed". They were not responsive anymore (IP-connections failed, etc...)

I have checked the console and every single VM looked like this:
Screenshot from 2020-05-20 14-02-45.png

The host itself is running fine. No problems I have recognized.
Rebooting the host (shutting the VMs via "sudo qm shutdown <id> -forceStop" first) solved all the issues and the whole system is operating normally again.

Has anyone experienced something like this or an explanation for the behavior I am seeing?
Thanks for your help all the best
 
Me again.
This behaviour still drives me nuts.
It seems to be time related.
I checked uptime of the system and rebooted for patches after 24 days. All fine...
This morning, after 25 days uptime it stroke me again.
After reboot everything is back to normal, scrubs are OK...
Any ideas other than reboot every 21 days?
Thanks
 
Hi,

Could be a software bug, or a faulty hardware. I have seen errors like you corelated with zfs bugs in the past years(and a reboot solve the problem ). But after upgrade with a new kernel and recompile of zfs this error has disappeared. But this was happening before I start to use PMX. On Proxmox I do not see in my case such errors.

Also it would be interesting if you can check what application/s have the pids that appears in your logs.

Good luck/Bafta!
 
Thanks!
I have only used official sources so I wonder why/if I would need to re-compile things.
Do you think that a reinstallation of Proxmox would help in this case?
Don't recall when this started, but I had issues (different ones) beginning of this year.
Upgrading to Proxmox 6 was kind of an emergency upgrade beginning of the year. With this I have re-designed my disk layout and this is where I started to notice the behavior I try to tackle here...
 
As you use failing SATA drives, when a failure accurs, system tries many times to access data.
While it is doing so, it can not use other SATA drives and other arrays can also be affected.
I understand that you use different SATA controllers, but obviously they somehow influence each other.

If you were to use SAS drives, I bet this would not happen, as the SAS controller would kick them out much faster and not wait for them, logic being that SAS drives are usually in a RAID setup and we do not care if one fails, while SATA drives are usually not in a RAID array, as they are used by home users, and because there is no replication of data, they try very hard (wait and retry alot) before they stop being used.
 
Thanks for your answer/opinion.
I do agree with you to some extend.

Where I disagree is that the disks aren't failing (at least not really).
SMART values are OK. After a reboot everything is fine. And though it happens ever 25 days. That sounds like an overflow issue to me.

I am using LSI 9211 controllers. The particular disks are as you mentioned on an individual controller. So there should be some isolation. Io blocks/qeue overflow should be encapsulated within one pool.
I would understand if all other pools get slower due to some things piling up, but "failing" the whole system is quite unexpected and tough.

Imagine you have a raid controller where a RAID is failing. And one failing raid kills all other raidsets. I can't imagine that this is by design of ZFS.

It is some weird edge case I am running into. I have tried exchanging cables, the HDD cages I am using, the controller and the disks (using a complete different type of HDD).
The issue persists. I even have made sure that the PSU is not overloaded by adding additional capacities there. Overheating also can be taken from the table.

I haven't had that for a long time and then it started. Today I think it came with some upgrade. Just can't tell when exactly.
 
they try very hard (wait and retry alot) before they stop being used.

You can limit this with:

Code:
 smartctl -l scterc,RNUM,WNUM [drive] - RNUM/WNUM are tenths of a second to attempt read/write before giving up

For sATA I use this:

smartctl -l scterc,70,70 /dev/sdX
 
Last edited:
I haven't had that for a long time and then it started. Today I think it came with some upgrade. Just can't tell when exactly.
Hi @tburger

You could try to do a "smartctl -t long /dev/sdX" for each of yours disks?
And also try "systemctl stop zfs-zed" and to see if your problem will happend again!
And if it happend I would try to see the output for:

zpool events -v


Good luck / Bafta !
 
Last edited:
I have done other changes recently which I will now monitor for a while:
- moved ioscheduler to "none"
- reduced queuedepth to 1
- disabled a time on the zfs pools (not necessary)
- switched to sync=always as I now have a decent NVRAM device in place.

I don't think it is load related as well. Because the pool and affected diks are barely used. 3 times a day the mail server writes backups to them. In the night the Ubuntu mirror is updated.
And when the issue happens typically none of these processes is running.

Thanks for your tips! Will check them asap.

/edit: @guletz - can you explain the rational behind your zed-hint?
 
Last edited:
... and again. It happened precise after 25 days uptime.
Code:
uptime
13:42:13 up 25 days,  4:00,  1 user,  load average: 1.79, 1.91, 1.93

I think this time I "intercepted" the pool suspension, because I paid a close look onto the uptime and the "run over" was at a decent time. So I was able to take measures and reboot the server.
Again it is C1-S4 - so slot 4/cable position 4 on controller 1. Keep in mind that I have switched cables, hot-swap drive-bays, controller already.

This is the affected pool:
Code:
  pool: HDD-POOL
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
    repaired.
  scan: scrub repaired 0B in 0 days 02:16:46 with 0 errors on Sun Feb 14 02:16:48 2021
config:

    NAME             STATE     READ WRITE CKSUM
    HDD-POOL         DEGRADED     0     0     0
      mirror-0       DEGRADED     0     0     0
        C1-S5        ONLINE       0     0     0
        C1-S4        FAULTED      7   290     0  too many errors
    logs  
      RMS-200-part3  ONLINE       0     0     0

errors: No known data errors

As you can see I scrub all my pools once a week, so last successful scrub was 7 days back.

After a reboot of the server the degraded pool resilvers just fine:
Code:
pool: HDD-POOL
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: resilvered 5.08G in 0 days 00:02:20 with 0 errors on Sat Feb 20 19:45:37 2021
config:

    NAME             STATE     READ WRITE CKSUM
    HDD-POOL         ONLINE       0     0     0
      mirror-0       ONLINE       0     0     0
        C1-S5        ONLINE       0     0     1
        C1-S4        ONLINE       0     0    16
    logs  
      RMS-200-part3  ONLINE       0     0     0

errors: No known data errors
The next scheduled scrub is tomorrow morning, so I just have decided to skip an immediate one.

Syslog starts to go nuts on 12:02.
VM1009 has a disk attached on the affected pool, which is my mailserver. It does a backup of the mail-store around that point in time.
Before that there is nothing

All changes as mentioned in my last post did not help.
Additionally I moved the SAS-HBAs into a PCIe Slot with 8 lanes (electrically).
Hoped that this might be the "magic bullet" but seems it was not.
 

Attachments

Last edited:
Hi,

Something like(most representative for me) :

"blk_update_request: I/O error, dev sdl, sector 881869296 op 0x1:(WRITE) flags 0x700 phys_seg 2 prio class 0
sd 7:0:3:0: [sdl] tag#2256 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 7:0:3:0: [sdl] tag#2289 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
sd 7:0:3:0: [sdl] tag#2289 CDB: Write(10) 2a 00 34 8d a9 d8 00 00 80 00
sd 7:0:3:0: [sdl] tag#2256 Sense Key : Not Ready [current]
sd 7:0:3:0: [sdl] tag#2256 Add. Sense: Logical unit not ready, cause not reportable
Feb 20 12:13:45 proxmox kernel: [2169352.914274] mpt2sas_cm1: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221107000000)
mpt2sas_cm1: removing handle(0x000c), sas_addr(0x4433221107000000)
mpt2sas_cm1: enclosure logical id(0x500605b004d1e730), slot(4)"

.... could be only a hardware problem related with your storage, and has nothing to do with zfs. If your storage can not write/read a sector on disk(or disk not ready), this is not a zfs fault/bug.

Good luck / Bafta !
 
Last edited:
  • Like
Reactions: mailinglists
Thanks for your response.
There are still two things that remain from a question perspective
1. Why would one suspended pool affect the others as well? I did not experience it this time (likely because I have reacted fast enough).
2. What would cause the HBA/Disk drive to go nuts after 25days uptime. I think this is the main question here because as you said it is likely the source for (1).
Any thought on it?