Heavy I/O causes kernel panics, deadlocks, and file system corruption on ext4

0x2d206cff

New Member
Feb 1, 2023
7
0
1

Issue​

Every time I'm seeing very high I/O usage, my system completely locks up or gets Kernel Panics. I am thus unable to migrate an existing data pool onto ZFS/proxmox. I've seen posts about this here, although most are over a year old and based on Kernel and proxmox versions that are no longer applicable.

I should note, I am new to proxmox, but reasonably seasoned with GNU/Linux in general. As such, I'd like to not only solve this issue, but also make sure I actually understand what is going on, so apologies for the "blog post format" of this. I'm just trying to be thorough, so I a) can figure this out and b) others can do the same.

See:

Hardware​

  • AMD Ryzen 3 2200G
  • Gigabyte AB350-Gaming 3
  • 32GB non-ECC DDR4
  • 240G zfs SSD (proxmox root)
  • 10T zfs Pool on 6 drives, as 2 mirrors each, different capacities, with an encypted dataset
  • 8T USB external drive, ext4 + LUKS
All drives pass SMART. SSD has 85% life left (25T/100T lifetime writes).

Details​

Once I start copying ~6TB in ~15 rclone processes on the host (not in a VM), the system locks up or kernel panics after everywhere between 5 minutes and 5 hours. Only a physical hard reset fixes this. A physical tty session is not responsive.

Original Kernel Panic:
image-20230201201815359.jpg


This is not a hardware issue (qed, see below), as this exact machine ran standard Debian for years (with a >365d uptime) and ran heavy I/O every night (incoming backups from several machines) as well as right before the migration (I/O to the same external disk, using the same rclone setup, via USB).
This only started after installing proxmox. The only substantial difference (besides proxmox instead of Debian) was the usage of a single, large zfs pool instead of 3 separate mdadm software RAID-1s via ext4 + LUKS on separate drives.

Attempts to fix​

I thus far tried the following, in this order:
  • Installed amd64-microcode package - Note, I never had to do this on Debian or Arch, both of which I ran with AMD Ryzen.
  • BIOS update to the most recent version
  • Set BIOS settings: reset everything, then disabled C6, enabled IOMMU, enabled SVM
    • C6 State / AMD "Cool n Quiet" - Stops the BIOS from disabling Cores (and cache for said cores?) to save power. Do not understand how that can cause panics, but I can see how a core dissapearing at runtime could be unpleasant :)
    • IOMMU - I understand this is needed for passthrough for virtulization. I assume with this disabled I can't start VMs, so that makes sense.
    • SVM - That should be AMD for VT-x, i.e. virtualization support. That I understand.
  • aio=native on the VMs (this casues the boot disk not to be recognized during VM startup) and simply stopping all VMs. I understand that the Kernel's asynchronous I/O can, theoretically, block, but I wouldn't expect it to deadlock. I also don't understand how this is better than aio=threads, since we don't care about performance with this fix. This was suggested by @Stefan_R, I believe.
  • Installed the 6.1 pre-release kernel. I've been running a 5.x kernel for a long time on Ryzen, but I figured this won't hurt.
All this appeared to change the behavior from 100% Kernel Panics to simply freezing, which ultimately has the same effect, but indicates that it did something. My money is on the C6 State that did this. Since this is non-deterministic behavior, I couldn't try each potential solution one by one, so I'm not 100% sure on that.

Kernel Panic -> Freezes/Deadlocks (xhci_hcd ?)​

I've then made the entire ZFS pool available as NFS share to a Debian based VM (via zfs directly), alongside an NFS share for the external drive (via old-school /etc/export, using no_root_squash and hence, said root user), which had exactly the same effect - a freeze after 2hrs. I am not surprised, but at least for me, that locates the issue firmly within kernel-space and whatever I'm doing within userspace shouldn't matter.

At this point, I got a bunch of logs that all pointed towards xhci_hcd , the USB driver:

417DACF5-53C9-48BB-A3F1-1CFC120F89E2_1_105_c.jpeg

(OCR, apologies):
  • ERROR Transfer event for unknoun stream ring slot
  • critical target error, dev sdh, sector 18962111858 op ONDECREAD) flags OXBOT00 phys-seg 32 prio class E
  • AMD-Vi: Event logged CID_PAGE_FAULT domain-oxpooa address=oxb7000140 flags-0x00201
sdh is the USB drive. I can't say I really understand these issues, but for me, this reads like either the drive is kaput or the USB port or controller are, or their respective kernel modules are.

At this point, I've simply changed the USB port from 3.1 to 3.0 (i.e. a different, physical port), since I figured there could be either a hardware issue or a Kernel issue with this driver + USB 3.1. I find the latter very unlikely, since I use this hard drive for monthy backups of this server, but I also wouldn't bet money on me always using the same port.

File System is now corrupt​

Changing that did not fix it. I've been getting a whole bunch of

Code:
[  +0.001284] Buffer I/O error on dev dm-0, logical block 0, lost sync page write
[  +0.000688] EXT4-fs (dm-0): I/O error while writing superblock
[  +2.628873] EXT4-fs warning (device dm-0): htree_dirblock_to_tree:1072: inode #127271092: lblock 0: comm nfsd: error -5 reading directory block
[  +0.357568] EXT4-fs warning (device dm-0): htree_dirblock_to_tree:1072: inode #200474667: lblock 0: comm nfsd: error -5 reading directory block
[  +0.764766] EXT4-fs warning (device dm-0): htree_dirblock_to_tree:1072: inode #200605711: lblock 0: comm nfsd: error -5 reading directory block
[  +0.002122] EXT4-fs error: 11 callbacks suppressed
[  +0.000003] EXT4-fs error (device dm-0): __ext4_get_inode_loc_noinmem:4596: inode #189137947: block 1513095265: comm nfsd: unable to read itable block
[  +0.001141] buffer_io_error: 11 callbacks suppressed
[  +0.000004] Buffer I/O error on dev dm-0, logical block 0, lost sync page write
[  +0.001243] EXT4-fs: 11 callbacks suppressed
[  +0.000004] EXT4-fs (dm-0): I/O error while writing superblock
[  +0.000421] EXT4-fs error (device dm-0): __ext4_get_inode_loc_noinmem:4596: inode #189137947: block 1513095265: comm nfsd: unable to read itable block
[  +0.001608] Buffer I/O error on dev dm-0, logical block 0, lost sync page write
[  +0.000676] EXT4-fs (dm-0): I/O error while writing superblock
[  +0.238153] EXT4-fs warning (device dm-0): htree_dirblock_to_tree:1072: inode #200605746: lblock 0: comm nfsd: error -5 reading directory block
[  +0.176934] EXT4-fs warning (device dm-0): htree_dirblock_to_tree:1072: inode #200540510: lblock 0: comm nfsd: error -5 reading directory block
...
[  +0.000796] Buffer I/O error on dev dm-0, logical block 0, lost sync page write
[  +0.000786] EXT4-fs (dm-0): I/O error while writing superblock

Which, again I'm guessing here, are the result of many hard-resets and crashes and no automatic fsck runs. I guess this can break a file system pretty badly. The only reason I'm not screaming "Bad Harddrive!" (besides the good SMART values) is the fact that I've just used it to write TB of data for this very migration.

After this, I couldn't even mount the drive - I'm now looking at "Can't read superblock" and other fun things hinting at a broken ext4. I killed the terminal and didn't save exact logs.

I then took the hard drive out of the enclosure (breaking it, because this particular manufacturer is anti-consumer) and connected it directly via SATA, making sure the internal USB controller isn't broken (I've yet to confirm that) and ran fsck from a live Debian ISO, which got stuck on - as per strace - FUTEX_WAIT_PRIVATE - which could be anything).
But I think it fixed some bad blocks (or, rather, found a usable backup superblock??) by that point, since I was able to mount it and browse some files. That, however, can be misleading - if we have a bunch of broken sectors (inodes? other magic file system words I don't really understand?), you can still ls /mnt , see a bunch of directories, only to me met with I/O Errors once you cd into them - which is what happened before. I expect some corrupted data.

Now/Next Steps​

As of right now, I am unhappily typing (sitting in the basement, next to the stupid rack...) this and running ddrescue onto a brand new, (overpriced) 12TB external drive I just bought from BestBuy tonight just for this, before more sh*t breaks, hoping for some insights or thoughts from this forum.

I am looking at the following dmesg errors while that's happening:
Code:
[ 4215.556397] ata1: EH complete
[ 4215.556399] EXT4-fs warning (device dm-0): htree_dirblock_to_tree:1042: inode #100298798: lblock 0: comm rm: error -5 reading directory block
[ 4218.633178] ata1.00: exception Emask 0x0 SAct 0x40000000 SErr 0x0 action 0x0
[ 4218.633813] ata1.00: irq_stat 0x40000008
[ 4218.634437] ata1.00: failed command: READ FPDMA QUEUED
[ 4218.635072] ata1.00: cmd 60/08:f0:48:7f:87/00:00:7f:03:00/40 tag 30 ncq dma 4096 in
                        res 43/40:08:48:7f:87/00:00:7f:03:00/00 Emask 0x409 (media error) <F>
[ 4218.636346] ata1.00: status: { DRDY SENSE ERR }
[ 4218.636964] ata1.00: error: { UNC }
[ 4218.726746] ata1.00: configured for UDMA/133
[ 4218.726753] sd 0:0:0:0: [sdb] tag#30 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=3s
[ 4218.726754] sd 0:0:0:0: [sdb] tag#30 Sense Key : Medium Error [current]
[ 4218.726756] sd 0:0:0:0: [sdb] tag#30 Add. Sense: Unrecovered read error - auto reallocate failed
[ 4218.726758] sd 0:0:0:0: [sdb] tag#30 CDB: Read(16) 88 00 00 00 00 03 7f 87 7f 48 00 00 00 08 00 00
[ 4218.726760] blk_update_request: I/O error, dev sdb, sector 15024488264 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0
[ 4218.728762] ata1: EH complete
[ 4218.728764] EXT4-fs warning (device dm-0): ext4_empty_dir:2934: inode #100298798: lblock 0: comm rm: error -5 reading directory block
[ 4264.998985] EXT4-fs (dm-0): error count since last fsck: 99
[ 4264.998995] EXT4-fs (dm-0): initial error at time 1582550487: ext4_find_entry:1466: inode 132317271
[ 4264.999004] EXT4-fs (dm-0): last error at time 1675278173: ext4_journal_check_start:83

Which I think are artifacts of a broken file system due to the hard crashes and no `fsck`, but I could be mistaken. If the HDD itself is broken, I'm just the unluckiest fella tonight - it breaking while I'm migrating the file server and already erased all internal drives and figured the backup would work (needless to say, I did a million manual spot checks before wiping anything, but, here we are). :(

On a side note - I'm running vanilla Proxmox on an old System76 laptop (-> Intel) without any issues...

Thanks much in advance.
 
Last edited:
Update (for those who care or find this in the future):
  • fsck deadlocked several times, but ultimately did fix the file-system. ish.
  • ddrescue ran for 26hrs + another 9 for dealing with bad blocks, which I stopped
  • The disk had 13 bad areas, but a 99.99% rescue rate as per ddrescue. I could not be bothered to actually run badblocks to verify.
  • Using the same USB port as a target for ddrescue was not an issue, so I suspect the USB controller on the hard drive, since I copied 8TB from a SATA drive to... a USB hard drive. Same port, this time I did check (although from a live distro, so the promox kernel might still be to blame)
  • rclone instances have been running for >24hrs, with an across the board load average of 10 and memory usage of 94% (note: measuing cached RAM as "full" is very misleading in the UI, imo)
Once rclone is done and I have all backups migrated to zfs snapshots (ETA for that is about a 4-5 days at this rate), I will report back and test various USB permutations (the old USB controller, a different USB controller [methinks I have one floating around somewhere] with the same drive) and report back, in case anybody cares to find out how an f-up of this magnitude can happen.

The last time I've seen a kernel exploding and deadlocking was with broken memory, but never by means of a broken USB port/controller, so I'm dying to find out what is going on here.
 
Well, the mystery is solved (ish):

```
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
184 End-to-End_Error 0x0032 001 001 099 Old_age Always FAILING_NOW 420
```

Self test is PASSED, but that drive is actually toast. I'll take it out behind the barn. All data has been rescued. This is with a different USB-to-SATA controller.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!