WD Red NAS HDD Kernel Bug

drh

Member
Jul 30, 2011
4
0
21
Australia
There is a problem with the Western Digital Red NAS HDDs that we believe is a bug with the default kernel used by current versions of Proxmox VE 5.

We tested two WD20EFRX discs in three different Proxmox VE servers. The servers use different hardware, but unfortunately all are running up-to-date Proxmox VE 5.3 with the same kernel 4.15.18-11-pve #1 SMP PVE 4.15.18-34 (Mon, 25 Feb 2019 14:51:06 +0100) x86_64.

All three servers reproduce the same COMRESET failed (errno=-16) error, the dmesg is shown below:

root@server3:~# dmesg | grep ata
[ 0.000000] BIOS-e820: [mem 0x00000000bc855000-0x00000000bc85dfff] ACPI data
[ 0.000000] Memory: 16256600K/16690700K available (12300K kernel code, 2480K rwdata, 4288K rodata, 2424K init, 2416K bss, 434100K reserved, 0K cma-reserved)
[ 0.082500] libata version 3.00 loaded.
[ 1.034853] scsi host0: ata_generic
[ 1.034995] scsi host1: ata_generic
[ 1.035018] ata1: PATA max UDMA/100 cmd 0xf130 ctl 0xf120 bmdma 0xf0f0 irq 18
[ 1.035020] ata2: PATA max UDMA/100 cmd 0xf110 ctl 0xf100 bmdma 0xf0f8 irq 18
[ 1.388146] Write protecting the kernel read-only data: 20480k
[ 1.705685] ata3: SATA max UDMA/133 abar m2048@0xfb925000 port 0xfb925100 irq 29
[ 1.705686] ata4: DUMMY
[ 1.705687] ata5: DUMMY
[ 1.705689] ata6: SATA max UDMA/133 abar m2048@0xfb925000 port 0xfb925280 irq 29
[ 1.705691] ata7: SATA max UDMA/133 abar m2048@0xfb925000 port 0xfb925300 irq 29
[ 1.705693] ata8: SATA max UDMA/133 abar m2048@0xfb925000 port 0xfb925380 irq 29
[ 2.019178] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 2.019210] ata8: SATA link down (SStatus 0 SControl 300)
[ 2.019299] ata7: SATA link down (SStatus 0 SControl 300)
[ 2.021069] ata3.00: supports DRM functions and may not be fully accessible
[ 2.022174] ata3.00: ATA-11: Samsung SSD 860 EVO 250GB, RVT01B6Q, max UDMA/133
[ 2.022177] ata3.00: 488397168 sectors, multi 1: LBA48 NCQ (depth 31/32), AA
[ 2.024940] ata3.00: supports DRM functions and may not be fully accessible
[ 2.028079] ata3.00: configured for UDMA/133
[ 2.028634] ata3.00: Enabling discard_zeroes_data
[ 2.028838] ata3.00: Enabling discard_zeroes_data
[ 2.030182] ata3.00: Enabling discard_zeroes_data
[ 7.056067] ata6: link is slow to respond, please be patient (ready=0)
[ 11.736068] ata6: COMRESET failed (errno=-16)
[ 17.088068] ata6: link is slow to respond, please be patient (ready=0)
[ 19.788080] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 19.788376] ata6.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
[ 30.148069] ata6: link is slow to respond, please be patient (ready=0)
[ 34.828070] ata6: COMRESET failed (errno=-16)
[ 40.180071] ata6: link is slow to respond, please be patient (ready=0)
[ 42.520085] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 42.520377] ata6.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
[ 42.520381] ata6: limiting SATA link speed to 1.5 Gbps
[ 52.932072] ata6: link is slow to respond, please be patient (ready=0)
[ 57.612072] ata6: COMRESET failed (errno=-16)
[ 62.964072] ata6: link is slow to respond, please be patient (ready=0)
[ 65.304083] ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[ 65.304380] ata6.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
[ 75.716074] ata6: link is slow to respond, please be patient (ready=0)
[ 80.396074] ata6: COMRESET failed (errno=-16)
[ 85.748075] ata6: link is slow to respond, please be patient (ready=0)
[ 87.968086] ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[ 88.402834] EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: (null)
[ 88.783893] systemd[1]: Listening on LVM2 metadata daemon socket.
root@server3:~#

The disc is not shown under /dev/sd*

Server hardware:
  • Server 1: i7 7700, ASRock Z270-PRO4, Corsair 16GB RAM
  • Server 2: i5 4590, ASRock Z87-EXTREME3, Corsair 8GB RAM
  • Server 3: i5 2400, Intel BLKDQ67SWB3, Kingston 16GB RAM

Searching the web, an Ubuntu bug and possible patch has been posted here:

Search for: Linux 4.15 and onwards fails to initialize some hard drives (new users aren't allowed to post external links)​

We have yet to test the patch in a Proxmox VE system (we don't currently have a test system on hand, and we're reluctant to build and apply a custom kernel on our three production servers).

Has anyone else come across this bug?

Other Western Digital HDDs do not appear to be affected, as we have several WD Gold and WD Purple HDDs deployed in the three servers.
 
Hi Thomas, yes that's the link we couldn't post.

I've temporarily decommissioned Server 3 as a test system. It still had an older pre 4.15 kernel installed: 4.10.17-2-pve #1 SMP PVE 4.10.17-20 (Mon, 14 Aug 2017 11:23:37 +0200) x86_64.

Booting with this 4.10 kernel unfortunately produced the same COMRESET failed (errno=-16) error as with the current 4.15 kernel, on both disks. That means our case is probably unrelated to the bug reported in the Ubuntu forum, and could be a simple hardware fault.

A double hardware fault: two brand new hard disks dead on arrival?!? I'll do more testing with an external caddy on different systems tomorrow, and possibly get the warranty replacement underway. I'll post back with future results.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!