Root partition switches to Read Only during VM backup

Sitram

Member
Jul 10, 2021
3
0
6
40
Hello,

I have an issue which I don't know how to start debugging. Perhaps someone from this forum with more experience then me can give me some pointers.

I have Proxmox running on my HomeLab with the following configuration:

Bash:
CPU(s)                   32 x Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz (2 Sockets)
Kernel                   Version Linux 5.13.19-2-pve #1 SMP PVE 5.13.19-4 (Mon, 29 Nov 2021 12:10:09 +0100)
PVE Manager Version      pve-manager/7.1-8/5b267f33

Since I switched to version 7 I'm having a strange issue. I can't figure out if this is a HW or a SW issue.

From time to time the root partition switches to read only and I have to physically reboot the server.

I did some digging and found that this behavior appears when the nightly back-up job runs on VM105. The backup is done on a two HDD's running in raid 1 with ZFS. Below is the log from Proxmox from the most recent instance. As you can see, at ~8% backup, the log stops(job status error is reported in the web interface). When I physically log in on the server, I see the error message that the / partition is in read only mode. After manual reboot, everything goes back to normal.

This doesn't happen all the time. Last time it happened was during last night back-up after 34 days with back-up successful on this VM. Before that, I had a period when it happened daily. I thought that this behavior occurred because I had a drive on VM105 located on the same ZFS raid 1 HDD's where I make the backup. I moved all the data from this drive on local-lvm:vm-105-disk-0 and removed this drive. The issue didn't reproduce for 34 days, until last night.

Any idea how to investigate this issue?

Bash:
INFO: Starting Backup of VM 105 (qemu)
INFO: Backup started at 2022-01-12 00:04:38
INFO: status = running
INFO: VM Name: Hercules
INFO: include disk 'scsi0' 'local-lvm:vm-105-disk-0' 52G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/mnt/pve/tank1-backup/dump/vzdump-qemu-105-2022_01_12-00_04_38.vma.zst'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task 'a12fe354-f63c-4584-bdd5-2bb0cb7b0866'
INFO: resuming VM again
INFO:   2% (1.2 GiB of 52.0 GiB) in 3s, read: 416.9 MiB/s, write: 194.8 MiB/s
INFO:   3% (1.8 GiB of 52.0 GiB) in 6s, read: 212.7 MiB/s, write: 169.3 MiB/s
INFO:   4% (2.4 GiB of 52.0 GiB) in 9s, read: 190.9 MiB/s, write: 151.0 MiB/s
INFO:   5% (3.0 GiB of 52.0 GiB) in 12s, read: 199.4 MiB/s, write: 189.0 MiB/s
INFO:   6% (3.6 GiB of 52.0 GiB) in 15s, read: 209.0 MiB/s, write: 187.9 MiB/s
INFO:   7% (4.1 GiB of 52.0 GiB) in 18s, read: 181.2 MiB/s, write: 171.0 MiB/s
INFO:   8% (4.6 GiB of 52.0 GiB) in 21s, read: 175.2 MiB/s, write: 167.7 MiB/s

Output of the lsblk command

Bash:
NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
nvme0n1                      259:0    0 931.5G  0 disk
├─nvme0n1p1                  259:1    0  1007K  0 part
├─nvme0n1p2                  259:2    0   512M  0 part
└─nvme0n1p3                  259:3    0   931G  0 part
  ├─pve-swap                 253:0    0     1G  0 lvm  [SWAP]
  ├─pve-root                 253:1    0    30G  0 lvm  /
  ├─pve-data_tmeta           253:2    0   8.8G  0 lvm
  │ └─pve-data-tpool         253:4    0 866.3G  0 lvm
  │   ├─pve-data             253:5    0 866.3G  1 lvm
  │   ├─pve-vm--105--disk--0 253:6    0    52G  0 lvm
  │   ├─pve-vm--104--disk--0 253:7    0     4M  0 lvm
  │   ├─pve-vm--104--disk--1 253:8    0    32G  0 lvm
  │   ├─pve-vm--101--disk--0 253:9    0    10G  0 lvm
  │   ├─pve-vm--152--disk--0 253:10   0    64G  0 lvm
  │   ├─pve-vm--153--disk--0 253:11   0    16G  0 lvm
  │   ├─pve-vm--154--disk--0 253:12   0    16G  0 lvm
  │   ├─pve-vm--155--disk--0 253:13   0    16G  0 lvm
  │   ├─pve-vm--156--disk--0 253:14   0    32G  0 lvm
  │   ├─pve-vm--157--disk--0 253:15   0     8G  0 lvm
  │   ├─pve-vm--158--disk--0 253:16   0    32G  0 lvm
  │   ├─pve-vm--159--disk--0 253:17   0    32G  0 lvm
  │   ├─pve-vm--100--disk--0 253:18   0    32G  0 lvm
  │   ├─pve-vm--150--disk--0 253:19   0   120G  0 lvm
  │   ├─pve-vm--106--disk--0 253:20   0    60G  0 lvm
  │   ├─pve-vm--102--disk--0 253:21   0    32G  0 lvm
  │   └─pve-vm--151--disk--0 253:22   0    60G  0 lvm
  └─pve-data_tdata           253:3    0 866.3G  0 lvm
    └─pve-data-tpool         253:4    0 866.3G  0 lvm
      ├─pve-data             253:5    0 866.3G  1 lvm
      ├─pve-vm--105--disk--0 253:6    0    52G  0 lvm
      ├─pve-vm--104--disk--0 253:7    0     4M  0 lvm
      ├─pve-vm--104--disk--1 253:8    0    32G  0 lvm
      ├─pve-vm--101--disk--0 253:9    0    10G  0 lvm
      ├─pve-vm--152--disk--0 253:10   0    64G  0 lvm
      ├─pve-vm--153--disk--0 253:11   0    16G  0 lvm
      ├─pve-vm--154--disk--0 253:12   0    16G  0 lvm
      ├─pve-vm--155--disk--0 253:13   0    16G  0 lvm
      ├─pve-vm--156--disk--0 253:14   0    32G  0 lvm
      ├─pve-vm--157--disk--0 253:15   0     8G  0 lvm
      ├─pve-vm--158--disk--0 253:16   0    32G  0 lvm
      ├─pve-vm--159--disk--0 253:17   0    32G  0 lvm
      ├─pve-vm--100--disk--0 253:18   0    32G  0 lvm
      ├─pve-vm--150--disk--0 253:19   0   120G  0 lvm
      ├─pve-vm--106--disk--0 253:20   0    60G  0 lvm
      ├─pve-vm--102--disk--0 253:21   0    32G  0 lvm
      └─pve-vm--151--disk--0 253:22   0    60G  0 lvm

Output of smartctl -a /dev/nvme0n1 command on the nvme SSD the root partition and VM disks are located:

Bash:
sitram@serenity:~$ sudo smartctl -a /dev/nvme0n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.13.19-2-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       ADATA SWORDFISH
Serial Number:                      2K302L1KGBWU
Firmware Version:                   V9002s45
PCI Vendor/Subsystem ID:            0x10ec
IEEE OUI Identifier:                0x00e04c
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Wed Jan 12 12:19:25 2022 EET
Firmware Updates (0x02):            1 Slot
Optional Admin Commands (0x0007):   Security Format Frmw_DL
Optional NVM Commands (0x0014):     DS_Mngmt Sav/Sel_Feat
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     115 Celsius
Critical Comp. Temp. Threshold:     120 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.00W       -        -    0  0  0  0        0       0
 1 +     4.00W       -        -    1  1  1  1        0       0
 2 +     3.00W       -        -    2  2  2  2        0       0
 3 -   0.0128W       -        -    3  3  3  3     4000    8000
 4 -   0.0080W       -        -    4  4  4  4     8000   30000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        40 Celsius
Available Spare:                    100%
Available Spare Threshold:          32%
Percentage Used:                    2%
Data Units Read:                    116,663,831 [59.7 TB]
Data Units Written:                 32,424,125 [16.6 TB]
Host Read Commands:                 1,402,182,530
Host Write Commands:                1,112,104,350
Controller Busy Time:               0
Power Cycles:                       58
Power On Hours:                     3,855
Unsafe Shutdowns:                   40
Media and Data Integrity Errors:    0
Error Information Log Entries:      12
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Thermal Temp. 1 Transition Count:   1

Error Information (NVMe Log 0x01, 8 of 8 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  - [1 unused entry]
  1 1219368206220736577     0  0x0000  0x0000  0x000            0     0     -
  - [1 unused entry]
  3          1     0  0x0000  0x0000  0x000            0     0     -

Output of nvme-cli error-log command on the SSD drive,

Bash:
sudo nvme error-log /dev/nvme0n1
Error Log Entries for device:nvme0n1 entries:8
.................
 Entry[ 0]
.................
error_count     : 0
sqid            : 0
cmdid           : 0
status_field    : 0(SUCCESS: The command completed successfully)
parm_err_loc    : 0
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................
 Entry[ 1]
.................
error_count     : 1219368206220736577
sqid            : 0
cmdid           : 0
status_field    : 0(SUCCESS: The command completed successfully)
parm_err_loc    : 0
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................
 Entry[ 2]
.................
error_count     : 0
sqid            : 0
cmdid           : 0
status_field    : 0(SUCCESS: The command completed successfully)
parm_err_loc    : 0
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................
 Entry[ 3]
.................
error_count     : 1
sqid            : 0
cmdid           : 0
status_field    : 0(SUCCESS: The command completed successfully)
parm_err_loc    : 0
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................
 Entry[ 4]
.................
error_count     : 0
sqid            : 0
cmdid           : 0
status_field    : 0(SUCCESS: The command completed successfully)
parm_err_loc    : 0
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................
 Entry[ 5]
.................
error_count     : 0
sqid            : 0
cmdid           : 0
status_field    : 0(SUCCESS: The command completed successfully)
parm_err_loc    : 0
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................
 Entry[ 6]
.................
error_count     : 0
sqid            : 0
cmdid           : 0
status_field    : 0(SUCCESS: The command completed successfully)
parm_err_loc    : 0
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................
 Entry[ 7]
.................
error_count     : 0
sqid            : 0
cmdid           : 0
status_field    : 0(SUCCESS: The command completed successfully)
parm_err_loc    : 0
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................

Thank you in advance for your suggestions!
 
Last edited:
I found a newer BIOS version on ASUS website for my Z10PE-D8 WS Motherboard compared to the one I had flashed. The description says "Improved system performance" which doesn't help much. I upgraded the BIOS to this latest version, maybe this will help.

The only way to update ADATA SWORDFISH firmware is trough a windows utility from adata website. I was able to make a windows live USB with this utility on it and boot the server from it. Unfortunately the firmware update utility needs to connect to internet to check for updates and since my server has a VM which acts as a firewall, I had no connection to internet and no way to check for firmware updates for the SSD.

The tool from ADATA shows that the SSD is in good health. Wear level is at 2%, same as in the SMART readings from Proxmox. I couldn't find other useful information in it.

Since my initial post, I had one more instance when the root partition switched to RO, even with backup on VM105 deactivated.

If I cannot figure out this issue, I'm considering as a fallback solution replacing the SSD with an Samsung 970 EVO Plus NVMe M.2 SSD 1TB.

The plot thickens!
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!