Hello,
I have an issue which I don't know how to start debugging. Perhaps someone from this forum with more experience then me can give me some pointers.
I have Proxmox running on my HomeLab with the following configuration:
Since I switched to version 7 I'm having a strange issue. I can't figure out if this is a HW or a SW issue.
From time to time the root partition switches to read only and I have to physically reboot the server.
I did some digging and found that this behavior appears when the nightly back-up job runs on VM105. The backup is done on a two HDD's running in raid 1 with ZFS. Below is the log from Proxmox from the most recent instance. As you can see, at ~8% backup, the log stops(job status error is reported in the web interface). When I physically log in on the server, I see the error message that the / partition is in read only mode. After manual reboot, everything goes back to normal.
This doesn't happen all the time. Last time it happened was during last night back-up after 34 days with back-up successful on this VM. Before that, I had a period when it happened daily. I thought that this behavior occurred because I had a drive on VM105 located on the same ZFS raid 1 HDD's where I make the backup. I moved all the data from this drive on local-lvm:vm-105-disk-0 and removed this drive. The issue didn't reproduce for 34 days, until last night.
Any idea how to investigate this issue?
Output of the lsblk command
Output of smartctl -a /dev/nvme0n1 command on the nvme SSD the root partition and VM disks are located:
Output of nvme-cli error-log command on the SSD drive,
Thank you in advance for your suggestions!
I have an issue which I don't know how to start debugging. Perhaps someone from this forum with more experience then me can give me some pointers.
I have Proxmox running on my HomeLab with the following configuration:
Bash:
CPU(s) 32 x Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz (2 Sockets)
Kernel Version Linux 5.13.19-2-pve #1 SMP PVE 5.13.19-4 (Mon, 29 Nov 2021 12:10:09 +0100)
PVE Manager Version pve-manager/7.1-8/5b267f33
Since I switched to version 7 I'm having a strange issue. I can't figure out if this is a HW or a SW issue.
From time to time the root partition switches to read only and I have to physically reboot the server.
I did some digging and found that this behavior appears when the nightly back-up job runs on VM105. The backup is done on a two HDD's running in raid 1 with ZFS. Below is the log from Proxmox from the most recent instance. As you can see, at ~8% backup, the log stops(job status error is reported in the web interface). When I physically log in on the server, I see the error message that the / partition is in read only mode. After manual reboot, everything goes back to normal.
This doesn't happen all the time. Last time it happened was during last night back-up after 34 days with back-up successful on this VM. Before that, I had a period when it happened daily. I thought that this behavior occurred because I had a drive on VM105 located on the same ZFS raid 1 HDD's where I make the backup. I moved all the data from this drive on local-lvm:vm-105-disk-0 and removed this drive. The issue didn't reproduce for 34 days, until last night.
Any idea how to investigate this issue?
Bash:
INFO: Starting Backup of VM 105 (qemu)
INFO: Backup started at 2022-01-12 00:04:38
INFO: status = running
INFO: VM Name: Hercules
INFO: include disk 'scsi0' 'local-lvm:vm-105-disk-0' 52G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/mnt/pve/tank1-backup/dump/vzdump-qemu-105-2022_01_12-00_04_38.vma.zst'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task 'a12fe354-f63c-4584-bdd5-2bb0cb7b0866'
INFO: resuming VM again
INFO: 2% (1.2 GiB of 52.0 GiB) in 3s, read: 416.9 MiB/s, write: 194.8 MiB/s
INFO: 3% (1.8 GiB of 52.0 GiB) in 6s, read: 212.7 MiB/s, write: 169.3 MiB/s
INFO: 4% (2.4 GiB of 52.0 GiB) in 9s, read: 190.9 MiB/s, write: 151.0 MiB/s
INFO: 5% (3.0 GiB of 52.0 GiB) in 12s, read: 199.4 MiB/s, write: 189.0 MiB/s
INFO: 6% (3.6 GiB of 52.0 GiB) in 15s, read: 209.0 MiB/s, write: 187.9 MiB/s
INFO: 7% (4.1 GiB of 52.0 GiB) in 18s, read: 181.2 MiB/s, write: 171.0 MiB/s
INFO: 8% (4.6 GiB of 52.0 GiB) in 21s, read: 175.2 MiB/s, write: 167.7 MiB/s
Output of the lsblk command
Bash:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme0n1 259:0 0 931.5G 0 disk
├─nvme0n1p1 259:1 0 1007K 0 part
├─nvme0n1p2 259:2 0 512M 0 part
└─nvme0n1p3 259:3 0 931G 0 part
├─pve-swap 253:0 0 1G 0 lvm [SWAP]
├─pve-root 253:1 0 30G 0 lvm /
├─pve-data_tmeta 253:2 0 8.8G 0 lvm
│ └─pve-data-tpool 253:4 0 866.3G 0 lvm
│ ├─pve-data 253:5 0 866.3G 1 lvm
│ ├─pve-vm--105--disk--0 253:6 0 52G 0 lvm
│ ├─pve-vm--104--disk--0 253:7 0 4M 0 lvm
│ ├─pve-vm--104--disk--1 253:8 0 32G 0 lvm
│ ├─pve-vm--101--disk--0 253:9 0 10G 0 lvm
│ ├─pve-vm--152--disk--0 253:10 0 64G 0 lvm
│ ├─pve-vm--153--disk--0 253:11 0 16G 0 lvm
│ ├─pve-vm--154--disk--0 253:12 0 16G 0 lvm
│ ├─pve-vm--155--disk--0 253:13 0 16G 0 lvm
│ ├─pve-vm--156--disk--0 253:14 0 32G 0 lvm
│ ├─pve-vm--157--disk--0 253:15 0 8G 0 lvm
│ ├─pve-vm--158--disk--0 253:16 0 32G 0 lvm
│ ├─pve-vm--159--disk--0 253:17 0 32G 0 lvm
│ ├─pve-vm--100--disk--0 253:18 0 32G 0 lvm
│ ├─pve-vm--150--disk--0 253:19 0 120G 0 lvm
│ ├─pve-vm--106--disk--0 253:20 0 60G 0 lvm
│ ├─pve-vm--102--disk--0 253:21 0 32G 0 lvm
│ └─pve-vm--151--disk--0 253:22 0 60G 0 lvm
└─pve-data_tdata 253:3 0 866.3G 0 lvm
└─pve-data-tpool 253:4 0 866.3G 0 lvm
├─pve-data 253:5 0 866.3G 1 lvm
├─pve-vm--105--disk--0 253:6 0 52G 0 lvm
├─pve-vm--104--disk--0 253:7 0 4M 0 lvm
├─pve-vm--104--disk--1 253:8 0 32G 0 lvm
├─pve-vm--101--disk--0 253:9 0 10G 0 lvm
├─pve-vm--152--disk--0 253:10 0 64G 0 lvm
├─pve-vm--153--disk--0 253:11 0 16G 0 lvm
├─pve-vm--154--disk--0 253:12 0 16G 0 lvm
├─pve-vm--155--disk--0 253:13 0 16G 0 lvm
├─pve-vm--156--disk--0 253:14 0 32G 0 lvm
├─pve-vm--157--disk--0 253:15 0 8G 0 lvm
├─pve-vm--158--disk--0 253:16 0 32G 0 lvm
├─pve-vm--159--disk--0 253:17 0 32G 0 lvm
├─pve-vm--100--disk--0 253:18 0 32G 0 lvm
├─pve-vm--150--disk--0 253:19 0 120G 0 lvm
├─pve-vm--106--disk--0 253:20 0 60G 0 lvm
├─pve-vm--102--disk--0 253:21 0 32G 0 lvm
└─pve-vm--151--disk--0 253:22 0 60G 0 lvm
Output of smartctl -a /dev/nvme0n1 command on the nvme SSD the root partition and VM disks are located:
Bash:
sitram@serenity:~$ sudo smartctl -a /dev/nvme0n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.13.19-2-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: ADATA SWORDFISH
Serial Number: 2K302L1KGBWU
Firmware Version: V9002s45
PCI Vendor/Subsystem ID: 0x10ec
IEEE OUI Identifier: 0x00e04c
Controller ID: 1
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Wed Jan 12 12:19:25 2022 EET
Firmware Updates (0x02): 1 Slot
Optional Admin Commands (0x0007): Security Format Frmw_DL
Optional NVM Commands (0x0014): DS_Mngmt Sav/Sel_Feat
Log Page Attributes (0x03): S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size: 32 Pages
Warning Comp. Temp. Threshold: 115 Celsius
Critical Comp. Temp. Threshold: 120 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 8.00W - - 0 0 0 0 0 0
1 + 4.00W - - 1 1 1 1 0 0
2 + 3.00W - - 2 2 2 2 0 0
3 - 0.0128W - - 3 3 3 3 4000 8000
4 - 0.0080W - - 4 4 4 4 8000 30000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 40 Celsius
Available Spare: 100%
Available Spare Threshold: 32%
Percentage Used: 2%
Data Units Read: 116,663,831 [59.7 TB]
Data Units Written: 32,424,125 [16.6 TB]
Host Read Commands: 1,402,182,530
Host Write Commands: 1,112,104,350
Controller Busy Time: 0
Power Cycles: 58
Power On Hours: 3,855
Unsafe Shutdowns: 40
Media and Data Integrity Errors: 0
Error Information Log Entries: 12
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Thermal Temp. 1 Transition Count: 1
Error Information (NVMe Log 0x01, 8 of 8 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
- [1 unused entry]
1 1219368206220736577 0 0x0000 0x0000 0x000 0 0 -
- [1 unused entry]
3 1 0 0x0000 0x0000 0x000 0 0 -
Output of nvme-cli error-log command on the SSD drive,
Bash:
sudo nvme error-log /dev/nvme0n1
Error Log Entries for device:nvme0n1 entries:8
.................
Entry[ 0]
.................
error_count : 0
sqid : 0
cmdid : 0
status_field : 0(SUCCESS: The command completed successfully)
parm_err_loc : 0
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................
Entry[ 1]
.................
error_count : 1219368206220736577
sqid : 0
cmdid : 0
status_field : 0(SUCCESS: The command completed successfully)
parm_err_loc : 0
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................
Entry[ 2]
.................
error_count : 0
sqid : 0
cmdid : 0
status_field : 0(SUCCESS: The command completed successfully)
parm_err_loc : 0
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................
Entry[ 3]
.................
error_count : 1
sqid : 0
cmdid : 0
status_field : 0(SUCCESS: The command completed successfully)
parm_err_loc : 0
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................
Entry[ 4]
.................
error_count : 0
sqid : 0
cmdid : 0
status_field : 0(SUCCESS: The command completed successfully)
parm_err_loc : 0
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................
Entry[ 5]
.................
error_count : 0
sqid : 0
cmdid : 0
status_field : 0(SUCCESS: The command completed successfully)
parm_err_loc : 0
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................
Entry[ 6]
.................
error_count : 0
sqid : 0
cmdid : 0
status_field : 0(SUCCESS: The command completed successfully)
parm_err_loc : 0
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................
Entry[ 7]
.................
error_count : 0
sqid : 0
cmdid : 0
status_field : 0(SUCCESS: The command completed successfully)
parm_err_loc : 0
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................
Thank you in advance for your suggestions!
Last edited: