PVE server becomes unreachable every week at the same time

boomhauer

New Member
May 31, 2019
3
0
1
Recently I migrated from XenServer to PVE by reinstalling from scratch. Converted all VMs and re-imported them into PVE. All supported VMs are running the Qemu agent (Ubuntu / Debian / SLES / RHEL).

A week after the migrated I discovered I could not access PVE via either SSH or web interface any more. Using the remove console, the system was throwing continuous disk I/O errors. I tried to login via the console but that didn't work either. Most of the VMs were not accessible but some were responding to pings.

Forcefully rebooting the server brought everything back into a working state. Until a week later when the same occurred, exactly at the same time. Since there were I/O errors there was nothing saved in logs so I pushed all logs to a remote server, waited a week later and saw:

Code:
May 24 22:00:24 pve3 kernel: [1165963.600835] vmbr0: port 4(tap103i0) entered disabled state
May 24 22:00:40 pve3 kernel: [1165979.352121] vmbr0: port 13(tap112i0) entered disabled state
May 25 04:59:01 pve3 kernel: [1191079.147371] megaraid_sas 0000:01:00.0: 5555 (612068430s/0x0020/DEAD) - Fatal firmware error: Line 1026 in ../../dm/src/dm.c
May 25 04:59:01 pve3 kernel: [1191079.147371]
May 25 04:59:01 pve3 kernel: [1191079.151343] megaraid_sas 0000:01:00.0: Iop2SysDoorbellIntfor scsi0
May 25 04:59:01 pve3 kernel: [1191079.151387] megaraid_sas 0000:01:00.0: Found FW in FAULT state, will reset adapter scsi0.
May 25 04:59:01 pve3 kernel: [1191079.151391] megaraid_sas 0000:01:00.0: resetting fusion adapter scsi0.
May 25 04:59:01 pve3 kernel: [1191079.151412] megaraid_sas 0000:01:00.0: Reset not supported, killing adapter scsi0.
May 25 04:59:01 pve3 kernel: [1191079.425886] sd 0:2:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May 25 04:59:01 pve3 kernel: [1191079.425889] sd 0:2:0:0: [sda] tag#0 CDB: Write(16) 8a 00 00 00 00 00 e0 22 b0 60 00 00 00 08 00 00
May 25 04:59:01 pve3 kernel: [1191079.425892] print_req_error: I/O error, dev sda, sector 3760369760
May 25 04:59:01 pve3 kernel: [1191079.425944] sd 0:2:0:0: [sda] tag#1 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May 25 04:59:01 pve3 kernel: [1191079.425945] sd 0:2:0:0: [sda] tag#1 CDB: Write(16) 8a 00 00 00 00 00 e3 18 f5 a8 00 00 00 08 00 00
May 25 04:59:01 pve3 kernel: [1191079.425946] print_req_error: I/O error, dev sda, sector 3810063784
May 25 04:59:01 pve3 kernel: [1191079.425970] sd 0:2:0:0: [sda] tag#2 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May 25 04:59:01 pve3 kernel: [1191079.425971] sd 0:2:0:0: [sda] tag#2 CDB: Write(16) 8a 00 00 00 00 00 e7 c1 f7 68 00 00 00 08 00 00
May 25 04:59:01 pve3 kernel: [1191079.425972] print_req_error: I/O error, dev sda, sector 3888248680
May 25 04:59:01 pve3 kernel: [1191079.433681] sd 0:2:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May 25 04:59:01 pve3 kernel: [1191079.433683] sd 0:2:0:0: [sda] tag#0 CDB: Write(16) 8a 00 00 00 00 00 e0 22 b0 60 00 00 00 08 00 00
May 25 04:59:01 pve3 kernel: [1191079.433684] print_req_error: I/O error, dev sda, sector 3760369760
May 25 04:59:01 pve3 kernel: [1191079.433708] sd 0:2:0:0: [sda] tag#1 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May 25 04:59:01 pve3 kernel: [1191079.433710] sd 0:2:0:0: [sda] tag#1 CDB: Write(16) 8a 00 00 00 00 00 e3 18 f5 a8 00 00 00 08 00 00
May 25 04:59:01 pve3 kernel: [1191079.433710] print_req_error: I/O error, dev sda, sector 3810063784
May 25 04:59:01 pve3 kernel: [1191079.433731] sd 0:2:0:0: [sda] tag#2 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May 25 04:59:01 pve3 kernel: [1191079.433732] sd 0:2:0:0: [sda] tag#2 CDB: Write(16) 8a 00 00 00 00 00 e7 c1 f7 68 00 00 00 08 00 00
May 25 04:59:01 pve3 kernel: [1191079.433733] print_req_error: I/O error, dev sda, sector 3888248680
May 25 04:59:01 pve3 kernel: [1191079.441645] sd 0:2:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May 25 04:59:01 pve3 kernel: [1191079.441647] sd 0:2:0:0: [sda] tag#0 CDB: Write(16) 8a 00 00 00 00 00 e0 22 b0 60 00 00 00 08 00 00
May 25 04:59:01 pve3 kernel: [1191079.441649] print_req_error: I/O error, dev sda, sector 3760369760
May 25 04:59:01 pve3 kernel: [1191079.441695] sd 0:2:0:0: [sda] tag#1 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May 25 04:59:01 pve3 kernel: [1191079.441696] sd 0:2:0:0: [sda] tag#1 CDB: Write(16) 8a 00 00 00 00 00 e3 18 f5 a8 00 00 00 08 00 00
May 25 04:59:01 pve3 kernel: [1191079.441697] print_req_error: I/O error, dev sda, sector 3810063784
May 25 04:59:01 pve3 kernel: [1191079.441721] sd 0:2:0:0: [sda] tag#2 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May 25 04:59:01 pve3 kernel: [1191079.441723] sd 0:2:0:0: [sda] tag#2 CDB: Write(16) 8a 00 00 00 00 00 e7 c1 f7 68 00 00 00 08 00 00
May 25 04:59:01 pve3 kernel: [1191079.441724] print_req_error: I/O error, dev sda, sector 3888248680
May 25 04:59:01 pve3 kernel: [1191079.449638] sd 0:2:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May 25 04:59:01 pve3 kernel: [1191079.449640] sd 0:2:0:0: [sda] tag#0 CDB: Write(16) 8a 00 00 00 00 00 e0 22 b0 60 00 00 00 08 00 00
May 25 04:59:01 pve3 kernel: [1191079.449641] print_req_error: I/O error, dev sda, sector 3760369760
May 25 04:59:01 pve3 kernel: [1191079.538011] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
May 25 04:59:01 pve3 kernel: [1191079.545801] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
May 25 04:59:01 pve3 kernel: [1191079.553816] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
May 25 04:59:01 pve3 kernel: [1191079.561664] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
May 25 04:59:01 pve3 kernel: [1191079.569726] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
May 25 04:59:01 pve3 kernel: [1191079.577697] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
May 25 04:59:03 pve3 kernel: [1191081.263790] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
May 25 04:59:03 pve3 kernel: [1191081.271397] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
May 25 04:59:03 pve3 kernel: [1191081.279367] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
May 25 04:59:03 pve3 kernel: [1191081.287422] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
May 25 04:59:03 pve3 kernel: [1191081.479652] Read-error on swap-device (253:0:460040)
May 25 04:59:03 pve3 kernel: [1191081.479692] Read-error on swap-device (253:0:460048)
May 25 04:59:03 pve3 kernel: [1191081.479708] Read-error on swap-device (253:0:460056)
May 25 04:59:03 pve3 kernel: [1191081.479905] Read-error on swap-device (253:0:598848)
May 25 04:59:03 pve3 kernel: [1191081.479928] Read-error on swap-device (253:0:598856)
May 25 04:59:03 pve3 kernel: [1191081.483409] Buffer I/O error on dev dm-6, logical block 0, async page read
May 25 04:59:03 pve3 kernel: [1191081.600108] vmbr0v10: port 2(tap100i0) entered disabled state
May 25 04:59:03 pve3 kernel: [1191081.600531] vmbr0v10: port 2(tap100i0) entered disabled state
May 25 04:59:03 pve3 kernel: [1191081.840287] Aborting journal on device dm-1-8.
May 25 04:59:03 pve3 kernel: [1191081.840379] Buffer I/O error on dev dm-1, logical block 3702784, lost sync page write
May 25 04:59:03 pve3 kernel: [1191081.840395] Buffer I/O error on dev dm-1, logical block 0, lost sync page write
May 25 04:59:03 pve3 kernel: [1191081.840403] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
May 25 04:59:03 pve3 kernel: [1191081.840404] EXT4-fs (dm-1): Remounting filesystem read-only
May 25 04:59:03 pve3 kernel: [1191081.840408] EXT4-fs (dm-1): previous I/O error to superblock detected
May 25 04:59:03 pve3 kernel: [1191081.840426] Buffer I/O error on dev dm-1, logical block 0, lost sync page write
May 25 04:59:03 pve3 kernel: [1191081.840585] JBD2: Error -5 detected when updating journal superblock for dm-1-8.

Storage on the PVE server is configured with RAID1 (2 x 10TB SAS + 1hot spare) and it runs the Fujitsu PRAID EP400i. Similar storage was used with XenServer before the migration but have never experienced any errors.

These failures happened already 4 times in a row and it always happens on Saturday mornings at the same time. Obviously I first suspected some activity at that time but I checked the PVE server, all VMs and there is nothing that runs at that time. All the backups or more intensive IO tasks are running during the week or on Satuday night (weekly vm backup). Also tried shutting down half of the VMs with no success. I am not really considering HW failures as the timing is just too strange and everytime reboot fixes the issue for another week.

I did see SMARTd runs on PVE and disabled it but I will only be able to see if this has any effect tomorrow.

If anyone has any other ideas why this might happen or how further to debug, it would be extremely helpful.

Thanks!
 
May 25 04:59:01 pve3 kernel: [1191079.151343] megaraid_sas 0000:01:00.0: Iop2SysDoorbellIntfor scsi0 May 25 04:59:01 pve3 kernel: [1191079.151387] megaraid_sas 0000:01:00.0: Found FW in FAULT state, will reset adapter scsi0. May 25 04:59:01 pve3 kernel: [1191079.151391] megaraid_sas 0000:01:00.0: resetting fusion adapter scsi0. May 25 04:59:01 pve3 kernel: [1191079.151412] megaraid_sas 0000:01:00.0: Reset not supported, killing adapter scsi0.
seems like a problem with the raid-controller (or more likely its firmware) - make sure you have the latest updates for your system install (BIOS/UEFI/firmware of all components)
 
We fixed the issue by downgrading the server firmware to the previous one. For reference we're running on Fujitsu Primergy with firmware version 8.43F (Jul 1 2016 00:15:19 CEST). The affected firmware was 9.20F (Jan 24 2019 19:09:14 CEST).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!