During the last two backup attempts of one of the VMs in my cluster, the filesystem on the guest VM went into RO.
I am running PVE 8.2.7 on a HPE G10+ server with HW raid 1 across two SSD drives. The server does not report anything wrong with the array.
The fs on VM 103 went into RO at 04:30:57, but had already reported problems with disk IO at 04:29:20.
The backup is scheduled at 01:00.
Here are some log lines from the PVE node prior to the guest going into RO.
This is the first error reported on the VM:
And this is when the fs went into RO:
To fix the fs i had to run fsck:
Last week when this happened I thought it was a freak occurrence, but now I am seriously worried about this system.
Edit: I have cleared some space on the backup location and tried re-running the backup of the system, but it started exhibiting similar behavior. It did only 4% of the backup 20 minutes in, and it was reading at 20Mbit/s from enterprise SSD in HW raid 1. At that point I stopped the backup to avoid any fs damage again.
The VM drive is hosted on lvmthin inside the node, with the settings of "Default (No cache)" and "IO thread" on.
I am running PVE 8.2.7 on a HPE G10+ server with HW raid 1 across two SSD drives. The server does not report anything wrong with the array.
The fs on VM 103 went into RO at 04:30:57, but had already reported problems with disk IO at 04:29:20.
The backup is scheduled at 01:00.
Here are some log lines from the PVE node prior to the guest going into RO.
Code:
Oct 27 01:16:25 pve2 pvestatd[1282]: got timeout
Oct 27 01:16:35 pve2 pvestatd[1282]: got timeout
Oct 27 01:16:45 pve2 pvestatd[1282]: got timeout
Oct 27 01:16:45 pve2 pvestatd[1282]: unable to activate storage 'mybackup' - directory '/mnt/pve/mybackup' does not exist or is unreachable
Oct 27 01:16:55 pve2 pvestatd[1282]: got timeout
Oct 27 01:16:55 pve2 pvestatd[1282]: unable to activate storage 'mybackup' - directory '/mnt/pve/mybackup' does not exist or is unreachable
Oct 27 01:17:01 pve2 CRON[1751254]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Oct 27 01:17:01 pve2 CRON[1751255]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Oct 27 01:17:01 pve2 CRON[1751254]: pam_unix(cron:session): session closed for user root
Oct 27 01:17:05 pve2 pvestatd[1282]: got timeout
Oct 27 01:17:05 pve2 pvestatd[1282]: unable to activate storage 'mybackup' - directory '/mnt/pve/mybackup' does not exist or is unreachable
Oct 27 01:17:15 pve2 pvestatd[1282]: got timeout
Oct 27 01:17:15 pve2 pvestatd[1282]: unable to activate storage 'mybackup' - directory '/mnt/pve/mybackup' does not exist or is unreachable
Oct 27 01:20:49 pve2 pvestatd[1282]: status update time (5.901 seconds)
Oct 27 01:20:55 pve2 pvestatd[1282]: got timeout
Oct 27 01:21:25 pve2 pvestatd[1282]: got timeout
Oct 27 01:21:59 pve2 pvestatd[1282]: status update time (6.045 seconds)
Oct 27 01:22:05 pve2 pvestatd[1282]: got timeout
Oct 27 01:22:16 pve2 pvestatd[1282]: got timeout
Oct 27 01:22:25 pve2 pvestatd[1282]: got timeout
Oct 27 01:22:45 pve2 pvestatd[1282]: got timeout
Oct 27 01:23:05 pve2 pvestatd[1282]: got timeout
Oct 27 01:23:05 pve2 pvestatd[1282]: unable to activate storage 'mybackup' - directory '/mnt/pve/mybackup' does not exist or is unreachable
Oct 27 01:25:45 pve2 systemd[1]: Starting man-db.service - Daily man-db regeneration...
Oct 27 01:25:45 pve2 systemd[1]: man-db.service: Deactivated successfully.
Oct 27 01:25:45 pve2 systemd[1]: Finished man-db.service - Daily man-db regeneration.
Oct 27 02:09:34 pve2 pmxcfs[1139]: [dcdb] notice: data verification successful
Oct 27 02:17:01 pve2 CRON[1762116]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Oct 27 02:17:01 pve2 CRON[1762117]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Oct 27 02:17:01 pve2 CRON[1762116]: pam_unix(cron:session): session closed for user root
Oct 27 02:42:09 pve2 pvestatd[1282]: status update time (5.374 seconds)
Oct 27 02:09:34 pve2 pmxcfs[1139]: [dcdb] notice: data verification successful
Oct 27 02:17:01 pve2 CRON[1772959]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Oct 27 02:17:01 pve2 CRON[1772960]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Oct 27 02:17:01 pve2 CRON[1772959]: pam_unix(cron:session): session closed for user root
Oct 27 02:19:08 pve2 pvestatd[1282]: status update time (5.137 seconds)
Oct 27 02:41:24 pve2 pmxcfs[1139]: [status] notice: received log
Oct 27 02:41:28 pve2 pmxcfs[1139]: [status] notice: received log
Oct 27 02:47:55 pve2 systemd[1]: Starting pve-daily-update.service - Daily PVE download activities...
Oct 27 02:47:56 pve2 pveupdate[1778552]: <root@pam> starting task UPID:pve2:001B237D:037CA273:671D9BCC:aptupdate::root@pam:
Oct 27 02:47:58 pve2 pveupdate[1778557]: update new package list: /var/lib/pve-manager/pkgupdates
Oct 27 02:48:00 pve2 pveupdate[1778552]: <root@pam> end task UPID:pve2:001B237D:037CA273:671D9BCC:aptupdate::root@pam: OK
Oct 27 02:48:00 pve2 systemd[1]: pve-daily-update.service: Deactivated successfully.
Oct 27 02:48:00 pve2 systemd[1]: Finished pve-daily-update.service - Daily PVE download activities.
Oct 27 02:48:00 pve2 systemd[1]: pve-daily-update.service: Consumed 2.856s CPU time.
Oct 27 02:56:28 pve2 pvestatd[1282]: status update time (5.125 seconds)
Oct 27 03:09:34 pve2 pmxcfs[1139]: [dcdb] notice: data verification successful
Oct 27 03:10:01 pve2 CRON[1783082]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Oct 27 03:10:01 pve2 CRON[1783083]: (root) CMD (test -e /run/systemd/system || SERVICE_MODE=1 /sbin/e2scrub_all -A -r)
Oct 27 03:10:01 pve2 CRON[1783082]: pam_unix(cron:session): session closed for user root
Oct 27 03:10:55 pve2 systemd[1]: Starting e2scrub_all.service - Online ext4 Metadata Check for All Filesystems...
Oct 27 03:10:55 pve2 systemd[1]: e2scrub_all.service: Deactivated successfully.
Oct 27 03:10:55 pve2 systemd[1]: Finished e2scrub_all.service - Online ext4 Metadata Check for All Filesystems.
Oct 27 03:17:01 pve2 CRON[1784350]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Oct 27 03:17:01 pve2 CRON[1784351]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Oct 27 03:17:01 pve2 CRON[1784350]: pam_unix(cron:session): session closed for user root
Oct 27 03:21:09 pve2 pvestatd[1282]: status update time (5.832 seconds)
Oct 27 03:24:15 pve2 pvestatd[1282]: got timeout
Oct 27 03:30:01 pve2 CRON[1786700]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Oct 27 03:30:01 pve2 CRON[1786701]: (root) CMD (test -e /run/systemd/system || SERVICE_MODE=1 /usr/lib/x86_64-linux-gnu/e2fsprogs/e2scrub_all_cron)
Oct 27 03:30:01 pve2 CRON[1786700]: pam_unix(cron:session): session closed for user root
Oct 27 04:09:34 pve2 pmxcfs[1139]: [dcdb] notice: data verification successful
Oct 27 04:17:01 pve2 CRON[1795159]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Oct 27 04:17:01 pve2 CRON[1795160]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Oct 27 04:17:01 pve2 CRON[1795159]: pam_unix(cron:session): session closed for user root
Oct 27 04:18:11 pve2 pmxcfs[1139]: [status] notice: received log
Oct 27 04:18:17 pve2 pmxcfs[1139]: [status] notice: received log
Oct 27 04:26:18 pve2 pvestatd[1282]: status update time (5.258 seconds)
Oct 27 04:26:25 pve2 pvestatd[1282]: got timeout
Oct 27 04:26:35 pve2 pvestatd[1282]: got timeout
Oct 27 04:31:00 pve2 pvescheduler[1748135]: ERROR: Backup of VM 103 failed - vma_queue_write: write error - Broken pipe
Oct 27 04:31:00 pve2 pvescheduler[1748135]: INFO: Backup job finished with errors
Oct 27 04:31:00 pve2 perl[1748135]: notified via target `<.....>`
Oct 27 04:31:00 pve2 pvescheduler[1748135]: job errors
This is the first error reported on the VM:
Code:
Oct 27 04:29:20 vm103.mydomain kernel: INFO: task jbd2/sda3-8:646 blocked for more than 120 seconds.
Oct 27 04:29:20 vm103.mydomain kernel: Not tainted 4.18.0-553.22.1.el8_10.x86_64 #1
Oct 27 04:29:20 vm103.mydomain kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 27 04:29:20 vm103.mydomain kernel: task:jbd2/sda3-8 state:D stack:0 pid:646 ppid:2 flags:0x80004000
Oct 27 04:29:20 vm103.mydomain kernel: Call Trace:
Oct 27 04:29:20 vm103.mydomain kernel: __schedule+0x2d1/0x870
Oct 27 04:29:20 vm103.mydomain kernel: ? bit_wait_timeout+0xa0/0xa0
Oct 27 04:29:20 vm103.mydomain kernel: schedule+0x55/0xf0
Oct 27 04:29:20 vm103.mydomain kernel: io_schedule+0x12/0x40
Oct 27 04:29:20 vm103.mydomain kernel: bit_wait_io+0xd/0x50
Oct 27 04:29:20 vm103.mydomain kernel: __wait_on_bit+0x2d/0x90
Oct 27 04:29:20 vm103.mydomain kernel: out_of_line_wait_on_bit+0x91/0xb0
Oct 27 04:29:20 vm103.mydomain kernel: ? var_wake_function+0x30/0x30
Oct 27 04:29:20 vm103.mydomain kernel: jbd2_journal_commit_transaction+0x14d3/0x19f0 [jbd2]
Oct 27 04:29:20 vm103.mydomain kernel: ? finish_task_switch+0x86/0x2f0
Oct 27 04:29:20 vm103.mydomain kernel: kjournald2+0xbd/0x270 [jbd2]
Oct 27 04:29:20 vm103.mydomain kernel: ? finish_wait+0x80/0x80
Oct 27 04:29:20 vm103.mydomain kernel: ? commit_timeout+0x10/0x10 [jbd2]
Oct 27 04:29:20 vm103.mydomain kernel: kthread+0x134/0x150
Oct 27 04:29:20 vm103.mydomain kernel: ? set_kthread_struct+0x50/0x50
Oct 27 04:29:20 vm103.mydomain kernel: ret_from_fork+0x1f/0x40
And this is when the fs went into RO:
Code:
Oct 27 04:30:56 vm103.mydomain kernel: sd 0:0:0:0: [sda] tag#153 timing out command, waited 180s
Oct 27 04:30:56 vm103.mydomain kernel: sd 0:0:0:0: [sda] tag#153 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=261s
Oct 27 04:30:56 vm103.mydomain kernel: sd 0:0:0:0: [sda] tag#153 Sense Key : Aborted Command [current]
Oct 27 04:30:56 vm103.mydomain kernel: sd 0:0:0:0: [sda] tag#153 Add. Sense: I/O process terminated
Oct 27 04:30:56 vm103.mydomain kernel: sd 0:0:0:0: [sda] tag#153 CDB: Write(10) 2a 00 00 39 fc 80 00 00 08 00
Oct 27 04:30:56 vm103.mydomain kernel: blk_update_request: I/O error, dev sda, sector 3800192 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Oct 27 04:30:56 vm103.mydomain kernel: blk_update_request: I/O error, dev sda, sector 3800192 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Oct 27 04:30:56 vm103.mydomain kernel: Aborting journal on device sda3-8.
Oct 27 04:30:57 vm103.mydomain kernel: EXT4-fs error (device sda3): ext4_journal_check_start:61: Detected aborted journal
Oct 27 04:30:57 vm103.mydomain kernel: EXT4-fs (sda3): Remounting filesystem read-only
Oct 27 04:45:01 vm103.mydomain auditd[873]: Audit daemon is suspending logging due to previously mentioned write error
To fix the fs i had to run fsck:
Code:
fsck from util-linux 2.32.1
e2fsck 1.45.6 (20-Mar-2020)
/dev/sda3: recovering journal
JBD2: Invalid checksum recovering block 3145873 in log
....................
JBD2: Invalid checksum recovering block 3145873 in log
Journal checksum error found in /dev/sda3
/dev/sda3 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Deleted inode 735 has zero dtime. Fix? yes
Inodes that were part of a corrupted orphan linked list found. Fix? yes
Inode 2442 was part of the orphaned inode list. FIXED.
..............................
Inode 5505128, i_blocks is 10336, should be 10328. Fix? yes
Inode 5505144, end of extent exceeds allowed value
(logical block 80, physical block 2293866, len 18)
Clear? yes
Inode 5505144, i_blocks is 784, should be 648. Fix? yes
Pass 2: Checking directory structure
Entry 'Preferences' in /tmp/systemd-private-11c10ea2a2b84a6da18bdd4173e7882e-html2pdf.service-JrHkMG/tmp/puppeteer_dev_chrome_profile-XXXXXXsgymsd/Default (788259) has deleted/unused inode 788428. Clear? yes
Entry 'drift' in /var/lib/chrony (5506837) has deleted/unused inode 5507374. Clear? yes
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Inode 5507340 ref count is 1, should be 2. Fix? yes
Pass 5: Checking group summary information
Block bitmap differences: +516299 -(523002--523017) -(1021091--1021098) -1021146 -(2293866--2293882) -(2327808--2327871) -(2328032--2328063) -(2328128--2328191) -(2331168--2331199) -(2331264--2331391) -(2555904--2556415) -2557194 -(2559232--2559295) -(2562496--2562559) -(2563392--2563455) -(2563520--2563583) -(2566592--2566655) -(2570656--2570751) -(2574752--2574815) -(2578624--2578687) -(2582944--2583039) -2630712 -(2752512--2752703) -(2752736--2752863) -(2752896--2752927) -(2754944--2755071) -(2755584--2756607) -(2757888--2757983) -(2761088--2761119) -(2761216--2761727) -3154345 -3183813 -3184129 -(3186692--3186695) +22028581 -22028582 -22055153
Fix? yes
Free blocks count wrong for group #15 (10550, counted=10565).
Fix? yes
Free blocks count wrong for group #31 (13525, counted=13534).
Fix? yes
Free blocks count wrong for group #70 (15389, counted=15407).
Fix? yes
Free blocks count wrong for group #71 (22750, counted=23070).
Fix? yes
Free blocks count wrong for group #78 (31080, counted=32233).
Fix? yes
Free blocks count wrong for group #84 (21122, counted=23266).
Fix? yes
Free blocks count wrong for group #96 (23897, counted=23898).
Fix? yes
Free blocks count wrong for group #97 (12892, counted=12898).
Fix? yes
Free blocks count wrong for group #672 (23791, counted=23792).
Fix? yes
Free blocks count wrong for group #673 (26941, counted=26942).
Fix? yes
Free blocks count wrong (62047071, counted=62050704).
Fix? yes
Inode bitmap differences: -735 -2442 -2444 -2450 -788233 -788236 -(788245--788247) -(788254--788256) -788263 -788275 -788360 -788376 -(788415--788416) -788418 -788423 -788426 -(788430--788431) -(788433--788434) -(788449--788450) -788730 -788732 -788737 -788741 -788747 -788750 -788752 -788824 -789199 -(789201--789206) -789208 -789373 -789600 -789605 -789607 -789611 -(918400--918401) -5507332 -5507374
Fix? yes
Free inodes count wrong for group #0 (0, counted=4).
Fix? yes
Free inodes count wrong for group #96 (5024, counted=5068).
Fix? yes
Directories count wrong for group #96 (645, counted=644).
Fix? yes
Free inodes count wrong for group #672 (5421, counted=5422).
Fix? yes
Free inodes count wrong (16174898, counted=16174957).
Fix? yes
/dev/sda3: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda3: ***** REBOOT SYSTEM *****
/dev/sda3: 143507/16318464 files (0.3% non-contiguous), 3222384/65273088 blocks
Last week when this happened I thought it was a freak occurrence, but now I am seriously worried about this system.
Edit: I have cleared some space on the backup location and tried re-running the backup of the system, but it started exhibiting similar behavior. It did only 4% of the backup 20 minutes in, and it was reading at 20Mbit/s from enterprise SSD in HW raid 1. At that point I stopped the backup to avoid any fs damage again.
The VM drive is hosted on lvmthin inside the node, with the settings of "Default (No cache)" and "IO thread" on.
Last edited: