VM fs going ro when PVE backup is executed

shunkica

New Member
May 8, 2024
8
0
1
During the last two backup attempts of one of the VMs in my cluster, the filesystem on the guest VM went into RO.

I am running PVE 8.2.7 on a HPE G10+ server with HW raid 1 across two SSD drives. The server does not report anything wrong with the array.

The fs on VM 103 went into RO at 04:30:57, but had already reported problems with disk IO at 04:29:20.

The backup is scheduled at 01:00.

Here are some log lines from the PVE node prior to the guest going into RO.
Code:
Oct 27 01:16:25 pve2 pvestatd[1282]: got timeout
Oct 27 01:16:35 pve2 pvestatd[1282]: got timeout
Oct 27 01:16:45 pve2 pvestatd[1282]: got timeout
Oct 27 01:16:45 pve2 pvestatd[1282]: unable to activate storage 'mybackup' - directory '/mnt/pve/mybackup' does not exist or is unreachable
Oct 27 01:16:55 pve2 pvestatd[1282]: got timeout
Oct 27 01:16:55 pve2 pvestatd[1282]: unable to activate storage 'mybackup' - directory '/mnt/pve/mybackup' does not exist or is unreachable
Oct 27 01:17:01 pve2 CRON[1751254]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Oct 27 01:17:01 pve2 CRON[1751255]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Oct 27 01:17:01 pve2 CRON[1751254]: pam_unix(cron:session): session closed for user root
Oct 27 01:17:05 pve2 pvestatd[1282]: got timeout
Oct 27 01:17:05 pve2 pvestatd[1282]: unable to activate storage 'mybackup' - directory '/mnt/pve/mybackup' does not exist or is unreachable
Oct 27 01:17:15 pve2 pvestatd[1282]: got timeout
Oct 27 01:17:15 pve2 pvestatd[1282]: unable to activate storage 'mybackup' - directory '/mnt/pve/mybackup' does not exist or is unreachable
Oct 27 01:20:49 pve2 pvestatd[1282]: status update time (5.901 seconds)
Oct 27 01:20:55 pve2 pvestatd[1282]: got timeout
Oct 27 01:21:25 pve2 pvestatd[1282]: got timeout
Oct 27 01:21:59 pve2 pvestatd[1282]: status update time (6.045 seconds)
Oct 27 01:22:05 pve2 pvestatd[1282]: got timeout
Oct 27 01:22:16 pve2 pvestatd[1282]: got timeout
Oct 27 01:22:25 pve2 pvestatd[1282]: got timeout
Oct 27 01:22:45 pve2 pvestatd[1282]: got timeout
Oct 27 01:23:05 pve2 pvestatd[1282]: got timeout
Oct 27 01:23:05 pve2 pvestatd[1282]: unable to activate storage 'mybackup' - directory '/mnt/pve/mybackup' does not exist or is unreachable
Oct 27 01:25:45 pve2 systemd[1]: Starting man-db.service - Daily man-db regeneration...
Oct 27 01:25:45 pve2 systemd[1]: man-db.service: Deactivated successfully.
Oct 27 01:25:45 pve2 systemd[1]: Finished man-db.service - Daily man-db regeneration.
Oct 27 02:09:34 pve2 pmxcfs[1139]: [dcdb] notice: data verification successful
Oct 27 02:17:01 pve2 CRON[1762116]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Oct 27 02:17:01 pve2 CRON[1762117]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Oct 27 02:17:01 pve2 CRON[1762116]: pam_unix(cron:session): session closed for user root
Oct 27 02:42:09 pve2 pvestatd[1282]: status update time (5.374 seconds)
Oct 27 02:09:34 pve2 pmxcfs[1139]: [dcdb] notice: data verification successful
Oct 27 02:17:01 pve2 CRON[1772959]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Oct 27 02:17:01 pve2 CRON[1772960]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Oct 27 02:17:01 pve2 CRON[1772959]: pam_unix(cron:session): session closed for user root
Oct 27 02:19:08 pve2 pvestatd[1282]: status update time (5.137 seconds)
Oct 27 02:41:24 pve2 pmxcfs[1139]: [status] notice: received log
Oct 27 02:41:28 pve2 pmxcfs[1139]: [status] notice: received log
Oct 27 02:47:55 pve2 systemd[1]: Starting pve-daily-update.service - Daily PVE download activities...
Oct 27 02:47:56 pve2 pveupdate[1778552]: <root@pam> starting task UPID:pve2:001B237D:037CA273:671D9BCC:aptupdate::root@pam:
Oct 27 02:47:58 pve2 pveupdate[1778557]: update new package list: /var/lib/pve-manager/pkgupdates
Oct 27 02:48:00 pve2 pveupdate[1778552]: <root@pam> end task UPID:pve2:001B237D:037CA273:671D9BCC:aptupdate::root@pam: OK
Oct 27 02:48:00 pve2 systemd[1]: pve-daily-update.service: Deactivated successfully.
Oct 27 02:48:00 pve2 systemd[1]: Finished pve-daily-update.service - Daily PVE download activities.
Oct 27 02:48:00 pve2 systemd[1]: pve-daily-update.service: Consumed 2.856s CPU time.
Oct 27 02:56:28 pve2 pvestatd[1282]: status update time (5.125 seconds)
Oct 27 03:09:34 pve2 pmxcfs[1139]: [dcdb] notice: data verification successful
Oct 27 03:10:01 pve2 CRON[1783082]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Oct 27 03:10:01 pve2 CRON[1783083]: (root) CMD (test -e /run/systemd/system || SERVICE_MODE=1 /sbin/e2scrub_all -A -r)
Oct 27 03:10:01 pve2 CRON[1783082]: pam_unix(cron:session): session closed for user root
Oct 27 03:10:55 pve2 systemd[1]: Starting e2scrub_all.service - Online ext4 Metadata Check for All Filesystems...
Oct 27 03:10:55 pve2 systemd[1]: e2scrub_all.service: Deactivated successfully.
Oct 27 03:10:55 pve2 systemd[1]: Finished e2scrub_all.service - Online ext4 Metadata Check for All Filesystems.
Oct 27 03:17:01 pve2 CRON[1784350]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Oct 27 03:17:01 pve2 CRON[1784351]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Oct 27 03:17:01 pve2 CRON[1784350]: pam_unix(cron:session): session closed for user root
Oct 27 03:21:09 pve2 pvestatd[1282]: status update time (5.832 seconds)
Oct 27 03:24:15 pve2 pvestatd[1282]: got timeout
Oct 27 03:30:01 pve2 CRON[1786700]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Oct 27 03:30:01 pve2 CRON[1786701]: (root) CMD (test -e /run/systemd/system || SERVICE_MODE=1 /usr/lib/x86_64-linux-gnu/e2fsprogs/e2scrub_all_cron)
Oct 27 03:30:01 pve2 CRON[1786700]: pam_unix(cron:session): session closed for user root
Oct 27 04:09:34 pve2 pmxcfs[1139]: [dcdb] notice: data verification successful
Oct 27 04:17:01 pve2 CRON[1795159]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Oct 27 04:17:01 pve2 CRON[1795160]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Oct 27 04:17:01 pve2 CRON[1795159]: pam_unix(cron:session): session closed for user root
Oct 27 04:18:11 pve2 pmxcfs[1139]: [status] notice: received log
Oct 27 04:18:17 pve2 pmxcfs[1139]: [status] notice: received log
Oct 27 04:26:18 pve2 pvestatd[1282]: status update time (5.258 seconds)
Oct 27 04:26:25 pve2 pvestatd[1282]: got timeout
Oct 27 04:26:35 pve2 pvestatd[1282]: got timeout
Oct 27 04:31:00 pve2 pvescheduler[1748135]: ERROR: Backup of VM 103 failed - vma_queue_write: write error - Broken pipe
Oct 27 04:31:00 pve2 pvescheduler[1748135]: INFO: Backup job finished with errors
Oct 27 04:31:00 pve2 perl[1748135]: notified via target `<.....>`
Oct 27 04:31:00 pve2 pvescheduler[1748135]: job errors

This is the first error reported on the VM:

Code:
Oct 27 04:29:20 vm103.mydomain kernel: INFO: task jbd2/sda3-8:646 blocked for more than 120 seconds.
Oct 27 04:29:20 vm103.mydomain kernel:       Not tainted 4.18.0-553.22.1.el8_10.x86_64 #1
Oct 27 04:29:20 vm103.mydomain kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 27 04:29:20 vm103.mydomain kernel: task:jbd2/sda3-8     state:D stack:0     pid:646   ppid:2      flags:0x80004000
Oct 27 04:29:20 vm103.mydomain kernel: Call Trace:
Oct 27 04:29:20 vm103.mydomain kernel:  __schedule+0x2d1/0x870
Oct 27 04:29:20 vm103.mydomain kernel:  ? bit_wait_timeout+0xa0/0xa0
Oct 27 04:29:20 vm103.mydomain kernel:  schedule+0x55/0xf0
Oct 27 04:29:20 vm103.mydomain kernel:  io_schedule+0x12/0x40
Oct 27 04:29:20 vm103.mydomain kernel:  bit_wait_io+0xd/0x50
Oct 27 04:29:20 vm103.mydomain kernel:  __wait_on_bit+0x2d/0x90
Oct 27 04:29:20 vm103.mydomain kernel:  out_of_line_wait_on_bit+0x91/0xb0
Oct 27 04:29:20 vm103.mydomain kernel:  ? var_wake_function+0x30/0x30
Oct 27 04:29:20 vm103.mydomain kernel:  jbd2_journal_commit_transaction+0x14d3/0x19f0 [jbd2]
Oct 27 04:29:20 vm103.mydomain kernel:  ? finish_task_switch+0x86/0x2f0
Oct 27 04:29:20 vm103.mydomain kernel:  kjournald2+0xbd/0x270 [jbd2]
Oct 27 04:29:20 vm103.mydomain kernel:  ? finish_wait+0x80/0x80
Oct 27 04:29:20 vm103.mydomain kernel:  ? commit_timeout+0x10/0x10 [jbd2]
Oct 27 04:29:20 vm103.mydomain kernel:  kthread+0x134/0x150
Oct 27 04:29:20 vm103.mydomain kernel:  ? set_kthread_struct+0x50/0x50
Oct 27 04:29:20 vm103.mydomain kernel:  ret_from_fork+0x1f/0x40

And this is when the fs went into RO:

Code:
Oct 27 04:30:56 vm103.mydomain kernel: sd 0:0:0:0: [sda] tag#153 timing out command, waited 180s
Oct 27 04:30:56 vm103.mydomain kernel: sd 0:0:0:0: [sda] tag#153 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=261s
Oct 27 04:30:56 vm103.mydomain kernel: sd 0:0:0:0: [sda] tag#153 Sense Key : Aborted Command [current]
Oct 27 04:30:56 vm103.mydomain kernel: sd 0:0:0:0: [sda] tag#153 Add. Sense: I/O process terminated
Oct 27 04:30:56 vm103.mydomain kernel: sd 0:0:0:0: [sda] tag#153 CDB: Write(10) 2a 00 00 39 fc 80 00 00 08 00
Oct 27 04:30:56 vm103.mydomain kernel: blk_update_request: I/O error, dev sda, sector 3800192 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Oct 27 04:30:56 vm103.mydomain kernel: blk_update_request: I/O error, dev sda, sector 3800192 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Oct 27 04:30:56 vm103.mydomain kernel: Aborting journal on device sda3-8.
Oct 27 04:30:57 vm103.mydomain kernel: EXT4-fs error (device sda3): ext4_journal_check_start:61: Detected aborted journal
Oct 27 04:30:57 vm103.mydomain kernel: EXT4-fs (sda3): Remounting filesystem read-only
Oct 27 04:45:01 vm103.mydomain auditd[873]: Audit daemon is suspending logging due to previously mentioned write error

To fix the fs i had to run fsck:

Code:
fsck from util-linux 2.32.1
e2fsck 1.45.6 (20-Mar-2020)
/dev/sda3: recovering journal
JBD2: Invalid checksum recovering block 3145873 in log
....................
JBD2: Invalid checksum recovering block 3145873 in log
Journal checksum error found in /dev/sda3
/dev/sda3 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Deleted inode 735 has zero dtime.  Fix? yes
Inodes that were part of a corrupted orphan linked list found.  Fix? yes
Inode 2442 was part of the orphaned inode list.  FIXED.
..............................
Inode 5505128, i_blocks is 10336, should be 10328.  Fix? yes

Inode 5505144, end of extent exceeds allowed value
        (logical block 80, physical block 2293866, len 18)
Clear? yes

Inode 5505144, i_blocks is 784, should be 648.  Fix? yes

Pass 2: Checking directory structure
Entry 'Preferences' in /tmp/systemd-private-11c10ea2a2b84a6da18bdd4173e7882e-html2pdf.service-JrHkMG/tmp/puppeteer_dev_chrome_profile-XXXXXXsgymsd/Default (788259) has deleted/unused inode 788428.  Clear? yes

Entry 'drift' in /var/lib/chrony (5506837) has deleted/unused inode 5507374.  Clear? yes

Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Inode 5507340 ref count is 1, should be 2.  Fix? yes

Pass 5: Checking group summary information
Block bitmap differences:  +516299 -(523002--523017) -(1021091--1021098) -1021146 -(2293866--2293882) -(2327808--2327871) -(2328032--2328063) -(2328128--2328191) -(2331168--2331199) -(2331264--2331391) -(2555904--2556415) -2557194 -(2559232--2559295) -(2562496--2562559) -(2563392--2563455) -(2563520--2563583) -(2566592--2566655) -(2570656--2570751) -(2574752--2574815) -(2578624--2578687) -(2582944--2583039) -2630712 -(2752512--2752703) -(2752736--2752863) -(2752896--2752927) -(2754944--2755071) -(2755584--2756607) -(2757888--2757983) -(2761088--2761119) -(2761216--2761727) -3154345 -3183813 -3184129 -(3186692--3186695) +22028581 -22028582 -22055153
Fix? yes

Free blocks count wrong for group #15 (10550, counted=10565).
Fix? yes

Free blocks count wrong for group #31 (13525, counted=13534).
Fix? yes

Free blocks count wrong for group #70 (15389, counted=15407).
Fix? yes

Free blocks count wrong for group #71 (22750, counted=23070).
Fix? yes

Free blocks count wrong for group #78 (31080, counted=32233).
Fix? yes

Free blocks count wrong for group #84 (21122, counted=23266).
Fix? yes

Free blocks count wrong for group #96 (23897, counted=23898).
Fix? yes

Free blocks count wrong for group #97 (12892, counted=12898).
Fix? yes

Free blocks count wrong for group #672 (23791, counted=23792).
Fix? yes

Free blocks count wrong for group #673 (26941, counted=26942).
Fix? yes

Free blocks count wrong (62047071, counted=62050704).
Fix? yes

Inode bitmap differences:  -735 -2442 -2444 -2450 -788233 -788236 -(788245--788247) -(788254--788256) -788263 -788275 -788360 -788376 -(788415--788416) -788418 -788423 -788426 -(788430--788431) -(788433--788434) -(788449--788450) -788730 -788732 -788737 -788741 -788747 -788750 -788752 -788824 -789199 -(789201--789206) -789208 -789373 -789600 -789605 -789607 -789611 -(918400--918401) -5507332 -5507374
Fix? yes

Free inodes count wrong for group #0 (0, counted=4).
Fix? yes

Free inodes count wrong for group #96 (5024, counted=5068).
Fix? yes

Directories count wrong for group #96 (645, counted=644).
Fix? yes

Free inodes count wrong for group #672 (5421, counted=5422).
Fix? yes

Free inodes count wrong (16174898, counted=16174957).
Fix? yes


/dev/sda3: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda3: ***** REBOOT SYSTEM *****
/dev/sda3: 143507/16318464 files (0.3% non-contiguous), 3222384/65273088 blocks

Last week when this happened I thought it was a freak occurrence, but now I am seriously worried about this system.

Edit: I have cleared some space on the backup location and tried re-running the backup of the system, but it started exhibiting similar behavior. It did only 4% of the backup 20 minutes in, and it was reading at 20Mbit/s from enterprise SSD in HW raid 1. At that point I stopped the backup to avoid any fs damage again.

The VM drive is hosted on lvmthin inside the node, with the settings of "Default (No cache)" and "IO thread" on.
 
Last edited:
Has this kept happening for you? I keep getting these "ERROR: vma_queue_write: write error - Broken pipe" like once a month randomly. I have been unable to find any sort of root cause or reason. The VM even keeps running fine, and if I kick the backup off again it goes w/o issue. Only once did the VM actually hang up for some reason, but it was fine after a reboot.
 
I have the same issue with backup, but only when both storage controllers are connected to the switch. I have one NIC on the server in the same subnet with the storage controllers (Dell Compellent), dedicated for iSCSI. If I disconnect one controller, all works fine, there are no corruptions in the VMs file system. If both controllers stay connected, some VMs get corrupted and I either need to reset them, or in some bad cases - restore from an older backup.
If I run journalctl -u pvestatd --since="1 day ago" I get some errors like this:
Aug 29 07:45:51 pve3-srv pvestatd[2067]: VM 9002 qmp command failed - VM 9002 qmp command 'query-proxmox-support' failed - got timeout
Aug 29 07:46:12 pve3-srv pvestatd[2067]: status update time (28.699 seconds)
Aug 29 07:49:42 pve3-srv pvestatd[2067]: status update time (10.107 seconds)
Aug 29 07:49:59 pve3-srv pvestatd[2067]: status update time (16.872 seconds)
Aug 29 07:56:25 pve3-srv pvestatd[2067]: status update time (6.248 seconds)
Aug 29 08:00:27 pve3-srv pvestatd[2067]: pbs1-srv: error fetching datastores - 500 read timeout
Aug 29 08:00:28 pve3-srv pvestatd[2067]: status update time (8.093 seconds)
Aug 29 08:00:48 pve3-srv pvestatd[2067]: status update time (9.529 seconds)
Aug 29 08:01:54 pve3-srv pvestatd[2067]: status update time (5.239 seconds)

Compellent is a bit strange with the iSCSI target. You define IP addresses on each controller within the same fault domain (for instance 192.168.77.11 and 192.168.77.12), while the fault domain has it's own IP address (for instance 192.168.77.10). The iSCSI target is exposed on the "portal" IP of 192.168.77.10. So I define two iSCSI targets, both on the "portal" IP of .10, and the iqn's of each controller port. On top of this, I define one LVM.

I'm all updated on 8.4, pve-manager/8.4.12/c2ea8261d32a5020 (running kernel: 6.8.12-13-pve). I didn't jump on 9.0 yet, as this is a production environment.

Does anyone have any idea why the storage goes beserk during the backup, but only if both controllers are connected? It seems like a network issue at this point, and I'm not sure what.
 
Has this kept happening for you? I keep getting these "ERROR: vma_queue_write: write error - Broken pipe" like once a month randomly. I have been unable to find any sort of root cause or reason. The VM even keeps running fine, and if I kick the backup off again it goes w/o issue. Only once did the VM actually hang up for some reason, but it was fine after a reboot.
Does anyone have any idea why the storage goes beserk during the backup, but only if both controllers are connected? It seems like a network issue at this point, and I'm not sure what.

I was actually able to resolve my own issue eventually. When reviewing the interface stats on my Mikrotik switch, I discovered a bunch of Tx/Rx pauses and some errors.
After turning off Flow Control for Rx and Tx for the required ports on my switch, all of my proxmox backup over network issues were resolved.
 
I think I figured things out on my issue. The volume had a multipath configuration for mappings, but only one path was actually available from the server. Obviously both controllers were connected to the switch, but only one NIC on the server. I'm not sure how and why this config would cause data corruption, but since I've updated to single path, backup doesn't cause issues anymore. Probably this is a Dell issue, unrelated to Proxmox.