Hello Together,
first of my Hardware configuration:
Mainboard: ASRock X570 Pro4
CPU: Ryzen 5750G
GPU: Intel A380
RAM: 128GB 3200MHz ECC UDIMM
NIC: X540-T2
SSD: 2x 980 Pro Samung 2TB for rpool
3x10TB & 3x18TB for HDD-pool
Last weekend I installed Proxmox Backup Server inside a LXC with a bind mount to my HDD-pool on my pve node.
I wanted to create a backup of an external server which worked fine, but then I started to backup my VMs/LXC on my pve node.
Most worked fine, but I ran into some errors with some LXC/VMs like:
After looking in syslog I saw messages like these:
	
	
	
		
I had a bad feeling about this and sadly by checking
	
	
	
		
The 
Apparently I was stupid and had only implemented scrubbing and mail notifications for my HDD-pool.
About a month ago I made some changes to my server:
- Upgrade to pve8
- installed Intel A380
- Upgraded BIOS
- enabled Resizable Bar
- enabled SR-IOV
Both SSDs SMART values look fine and have only 2% and 3% wear with a 24TB difference (109TB vs.133TB). (Bought them 2 years ago)
I think the difference in TBW indicates that the problem persists longer than the changes I made a month ago.
Memtest ran fine multiple times.
Right now everything still works fine.
Is there any way to salvage this?
As I see it there already is data loss and replacing one drive after another and resilvering would not get clean pool again.
I could ditch both VMs 105 and 110 and delete the file in LXC 109 as there is no important data.
But I do not understand what these errors mean:
	
	
	
		
What would be the best to do right now?
I backed up all my VMs and LXCs I care about on my HDD-pool via the proxmox backup server i mentioned earlier.
There is no encryption on my pools or my backup.
My plan I thought of is:
- Backup VMs/LXCs again
- Backup pve configs to pbs lxc
- Remove old SSDs
- Install Proxmox on new SSDs
- Import HDD-Pool
- Create pbs LXC with the same config/mount point as before.
- restore pve
- restore VMs/LXC
What paths are needed for a pve backup?
Right now I would backup:
- /etc
- /var
- /root
- /opt
- /boot
- installed apt packages and repos
Will this work or am I missing something?
Or is this idea stupid and there is a better/easier way?
Thanks in advance!
				
			first of my Hardware configuration:
Mainboard: ASRock X570 Pro4
CPU: Ryzen 5750G
GPU: Intel A380
RAM: 128GB 3200MHz ECC UDIMM
NIC: X540-T2
SSD: 2x 980 Pro Samung 2TB for rpool
3x10TB & 3x18TB for HDD-pool
Last weekend I installed Proxmox Backup Server inside a LXC with a bind mount to my HDD-pool on my pve node.
I wanted to create a backup of an external server which worked fine, but then I started to backup my VMs/LXC on my pve node.
Most worked fine, but I ran into some errors with some LXC/VMs like:
INFO: Error: error at "root/world/region/r.-16.4.mca": No data available (os error 61)ERROR: job failed with err -61 - No data availableAfter looking in syslog I saw messages like these:
		Code:
	
	2023-08-13T13:07:05.034645+02:00 pve zed: eid=35 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=131072 offset=140912259072 priority=4 err=61 flags=0x1808b0 delay=253ms bookmark=76013:96729:0:1
2023-08-13T13:07:38.859831+02:00 pve kernel: [  655.716472] critical medium error, dev nvme1n1, sector 1411234976 op 0x0:(READ) flags 0x0 phys_seg 16 prio class 2
2023-08-13T13:07:38.860361+02:00 pve zed: eid=36 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=131072 offset=722014388224 priority=4 err=61 flags=0x40080cb0 delay=308ms
2023-08-13T13:07:38.860478+02:00 pve zed: eid=37 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014502912 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496371
2023-08-13T13:07:38.860684+02:00 pve zed: eid=38 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014494720 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496370
2023-08-13T13:07:38.861065+02:00 pve zed: eid=39 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014470144 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496367
2023-08-13T13:07:38.861231+02:00 pve zed: eid=40 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014453760 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496365
2023-08-13T13:07:38.861348+02:00 pve zed: eid=41 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014445568 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496364
2023-08-13T13:07:38.861834+02:00 pve zed: eid=42 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014511104 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496374
2023-08-13T13:07:38.862249+02:00 pve zed: eid=43 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014486528 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496369
2023-08-13T13:07:38.862515+02:00 pve zed: eid=44 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014478336 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496368
2023-08-13T13:07:38.863063+02:00 pve zed: eid=45 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014437376 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496363
2023-08-13T13:07:38.863366+02:00 pve zed: eid=46 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014388224 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496357
2023-08-13T13:07:38.863638+02:00 pve zed: eid=47 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014461952 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496366
2023-08-13T13:07:38.863831+02:00 pve zed: eid=48 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014429184 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496362
2023-08-13T13:07:38.864352+02:00 pve zed: eid=49 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014412800 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496360
2023-08-13T13:07:38.864754+02:00 pve zed: eid=50 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014404608 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496359
2023-08-13T13:07:38.864963+02:00 pve zed: eid=51 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014396416 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496358
2023-08-13T13:07:38.865598+02:00 pve zed: eid=52 class=checksum pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 algorithm=fletcher4 size=8192 offset=722014420992 priority=4 err=52 flags=0x3808b0 bookmark=172:1:0:1496361
2023-08-13T13:07:39.171431+02:00 pve zed: eid=53 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=131072 offset=722041638912 priority=4 err=61 flags=0x40080cb0 delay=599ms
2023-08-13T13:07:39.171578+02:00 pve zed: eid=54 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722041745408 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1444633
2023-08-13T13:07:39.171673+02:00 pve zed: eid=55 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722041753600 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1444634
2023-08-13T13:07:39.171829+02:00 pve kernel: [  656.027934] nvme1n1: I/O Cmd(0x2) @ LBA 1411288200, 256 blocks, I/O Error (sct 0x2 / sc 0x81) MORE DNR
2023-08-13T13:07:39.171835+02:00 pve kernel: [  656.027944] critical medium error, dev nvme1n1, sector 1411288200 op 0x0:(READ) flags 0x0 phys_seg 16 prio class 2
2023-08-13T13:07:39.172145+02:00 pve zed: eid=56 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722041704448 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1444628
2023-08-13T13:07:39.172202+02:00 pve zed: eid=57 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722041696256 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1444627
2023-08-13T13:07:39.172528+02:00 pve zed: eid=58 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722041679872 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1444625 zpool status I was greeted by this:
		Code:
	
	  pool: rpool
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 1.12M in 00:12:57 with 191 errors on Sun Aug 13 13:16:06 2023
config:
        NAME                                                    STATE     READ WRITE CKSUM
        rpool                                                   DEGRADED     0     0     0
          mirror-0                                              DEGRADED 2.77K     0     0
            nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510655P-part3  DEGRADED 2.98K     0   491  too many errors
            nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3  FAULTED     75     0     9  too many errors
errors: Permanent errors have been detected in the following files:
        rpool/data/vm-110-disk-0:<0x1>
        rpool/data/vm-105-disk-0:<0x1>
        <0x70>:<0x7e43f>
        <0x70>:<0x7e4c3>
        <0x70>:<0x7aac5>
        <0x70>:<0x7e4d5>
        /rpool/data/subvol-109-disk-0/root/world/region/r.-16.4.mca  scan: scrub repaired 1.12M in 00:12:57 with 191 errors on Sun Aug 13 13:16:06 2023 was started manually by me.Apparently I was stupid and had only implemented scrubbing and mail notifications for my HDD-pool.
About a month ago I made some changes to my server:
- Upgrade to pve8
- installed Intel A380
- Upgraded BIOS
- enabled Resizable Bar
- enabled SR-IOV
Both SSDs SMART values look fine and have only 2% and 3% wear with a 24TB difference (109TB vs.133TB). (Bought them 2 years ago)
I think the difference in TBW indicates that the problem persists longer than the changes I made a month ago.
Memtest ran fine multiple times.
Right now everything still works fine.
Is there any way to salvage this?
As I see it there already is data loss and replacing one drive after another and resilvering would not get clean pool again.
I could ditch both VMs 105 and 110 and delete the file in LXC 109 as there is no important data.
But I do not understand what these errors mean:
		Code:
	
	        <0x70>:<0x7e43f>
        <0x70>:<0x7e4c3>
        <0x70>:<0x7aac5>
        <0x70>:<0x7e4d5>I backed up all my VMs and LXCs I care about on my HDD-pool via the proxmox backup server i mentioned earlier.
There is no encryption on my pools or my backup.
My plan I thought of is:
- Backup VMs/LXCs again
- Backup pve configs to pbs lxc
- Remove old SSDs
- Install Proxmox on new SSDs
- Import HDD-Pool
- Create pbs LXC with the same config/mount point as before.
- restore pve
- restore VMs/LXC
What paths are needed for a pve backup?
Right now I would backup:
- /etc
- /var
- /root
- /opt
- /boot
- installed apt packages and repos
Will this work or am I missing something?
Or is this idea stupid and there is a better/easier way?
Thanks in advance!
 
	 
	 
 
		