Hello Together,
first of my Hardware configuration:
Mainboard: ASRock X570 Pro4
CPU: Ryzen 5750G
GPU: Intel A380
RAM: 128GB 3200MHz ECC UDIMM
NIC: X540-T2
SSD: 2x 980 Pro Samung 2TB for rpool
3x10TB & 3x18TB for HDD-pool
Last weekend I installed Proxmox Backup Server inside a LXC with a bind mount to my HDD-pool on my pve node.
I wanted to create a backup of an external server which worked fine, but then I started to backup my VMs/LXC on my pve node.
Most worked fine, but I ran into some errors with some LXC/VMs like:
After looking in syslog I saw messages like these:
I had a bad feeling about this and sadly by checking
The
Apparently I was stupid and had only implemented scrubbing and mail notifications for my HDD-pool.
About a month ago I made some changes to my server:
- Upgrade to pve8
- installed Intel A380
- Upgraded BIOS
- enabled Resizable Bar
- enabled SR-IOV
Both SSDs SMART values look fine and have only 2% and 3% wear with a 24TB difference (109TB vs.133TB). (Bought them 2 years ago)
I think the difference in TBW indicates that the problem persists longer than the changes I made a month ago.
Memtest ran fine multiple times.
Right now everything still works fine.
Is there any way to salvage this?
As I see it there already is data loss and replacing one drive after another and resilvering would not get clean pool again.
I could ditch both VMs 105 and 110 and delete the file in LXC 109 as there is no important data.
But I do not understand what these errors mean:
What would be the best to do right now?
I backed up all my VMs and LXCs I care about on my HDD-pool via the proxmox backup server i mentioned earlier.
There is no encryption on my pools or my backup.
My plan I thought of is:
- Backup VMs/LXCs again
- Backup pve configs to pbs lxc
- Remove old SSDs
- Install Proxmox on new SSDs
- Import HDD-Pool
- Create pbs LXC with the same config/mount point as before.
- restore pve
- restore VMs/LXC
What paths are needed for a pve backup?
Right now I would backup:
- /etc
- /var
- /root
- /opt
- /boot
- installed apt packages and repos
Will this work or am I missing something?
Or is this idea stupid and there is a better/easier way?
Thanks in advance!
first of my Hardware configuration:
Mainboard: ASRock X570 Pro4
CPU: Ryzen 5750G
GPU: Intel A380
RAM: 128GB 3200MHz ECC UDIMM
NIC: X540-T2
SSD: 2x 980 Pro Samung 2TB for rpool
3x10TB & 3x18TB for HDD-pool
Last weekend I installed Proxmox Backup Server inside a LXC with a bind mount to my HDD-pool on my pve node.
I wanted to create a backup of an external server which worked fine, but then I started to backup my VMs/LXC on my pve node.
Most worked fine, but I ran into some errors with some LXC/VMs like:
INFO: Error: error at "root/world/region/r.-16.4.mca": No data available (os error 61)
ERROR: job failed with err -61 - No data available
After looking in syslog I saw messages like these:
Code:
2023-08-13T13:07:05.034645+02:00 pve zed: eid=35 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=131072 offset=140912259072 priority=4 err=61 flags=0x1808b0 delay=253ms bookmark=76013:96729:0:1
2023-08-13T13:07:38.859831+02:00 pve kernel: [ 655.716472] critical medium error, dev nvme1n1, sector 1411234976 op 0x0:(READ) flags 0x0 phys_seg 16 prio class 2
2023-08-13T13:07:38.860361+02:00 pve zed: eid=36 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=131072 offset=722014388224 priority=4 err=61 flags=0x40080cb0 delay=308ms
2023-08-13T13:07:38.860478+02:00 pve zed: eid=37 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014502912 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496371
2023-08-13T13:07:38.860684+02:00 pve zed: eid=38 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014494720 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496370
2023-08-13T13:07:38.861065+02:00 pve zed: eid=39 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014470144 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496367
2023-08-13T13:07:38.861231+02:00 pve zed: eid=40 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014453760 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496365
2023-08-13T13:07:38.861348+02:00 pve zed: eid=41 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014445568 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496364
2023-08-13T13:07:38.861834+02:00 pve zed: eid=42 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014511104 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496374
2023-08-13T13:07:38.862249+02:00 pve zed: eid=43 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014486528 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496369
2023-08-13T13:07:38.862515+02:00 pve zed: eid=44 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014478336 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496368
2023-08-13T13:07:38.863063+02:00 pve zed: eid=45 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014437376 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496363
2023-08-13T13:07:38.863366+02:00 pve zed: eid=46 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014388224 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496357
2023-08-13T13:07:38.863638+02:00 pve zed: eid=47 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014461952 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496366
2023-08-13T13:07:38.863831+02:00 pve zed: eid=48 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014429184 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496362
2023-08-13T13:07:38.864352+02:00 pve zed: eid=49 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014412800 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496360
2023-08-13T13:07:38.864754+02:00 pve zed: eid=50 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014404608 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496359
2023-08-13T13:07:38.864963+02:00 pve zed: eid=51 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722014396416 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1496358
2023-08-13T13:07:38.865598+02:00 pve zed: eid=52 class=checksum pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 algorithm=fletcher4 size=8192 offset=722014420992 priority=4 err=52 flags=0x3808b0 bookmark=172:1:0:1496361
2023-08-13T13:07:39.171431+02:00 pve zed: eid=53 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=131072 offset=722041638912 priority=4 err=61 flags=0x40080cb0 delay=599ms
2023-08-13T13:07:39.171578+02:00 pve zed: eid=54 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722041745408 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1444633
2023-08-13T13:07:39.171673+02:00 pve zed: eid=55 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722041753600 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1444634
2023-08-13T13:07:39.171829+02:00 pve kernel: [ 656.027934] nvme1n1: I/O Cmd(0x2) @ LBA 1411288200, 256 blocks, I/O Error (sct 0x2 / sc 0x81) MORE DNR
2023-08-13T13:07:39.171835+02:00 pve kernel: [ 656.027944] critical medium error, dev nvme1n1, sector 1411288200 op 0x0:(READ) flags 0x0 phys_seg 16 prio class 2
2023-08-13T13:07:39.172145+02:00 pve zed: eid=56 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722041704448 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1444628
2023-08-13T13:07:39.172202+02:00 pve zed: eid=57 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722041696256 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1444627
2023-08-13T13:07:39.172528+02:00 pve zed: eid=58 class=io pool='rpool' vdev=nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 size=8192 offset=722041679872 priority=4 err=61 flags=0x3808b0 bookmark=172:1:0:1444625
zpool status
I was greeted by this:
Code:
pool: rpool
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 1.12M in 00:12:57 with 191 errors on Sun Aug 13 13:16:06 2023
config:
NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 2.77K 0 0
nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510655P-part3 DEGRADED 2.98K 0 491 too many errors
nvme-Samsung_SSD_980_PRO_2TB_S69ENF0R510639Y-part3 FAULTED 75 0 9 too many errors
errors: Permanent errors have been detected in the following files:
rpool/data/vm-110-disk-0:<0x1>
rpool/data/vm-105-disk-0:<0x1>
<0x70>:<0x7e43f>
<0x70>:<0x7e4c3>
<0x70>:<0x7aac5>
<0x70>:<0x7e4d5>
/rpool/data/subvol-109-disk-0/root/world/region/r.-16.4.mca
scan: scrub repaired 1.12M in 00:12:57 with 191 errors on Sun Aug 13 13:16:06 2023
was started manually by me.Apparently I was stupid and had only implemented scrubbing and mail notifications for my HDD-pool.
About a month ago I made some changes to my server:
- Upgrade to pve8
- installed Intel A380
- Upgraded BIOS
- enabled Resizable Bar
- enabled SR-IOV
Both SSDs SMART values look fine and have only 2% and 3% wear with a 24TB difference (109TB vs.133TB). (Bought them 2 years ago)
I think the difference in TBW indicates that the problem persists longer than the changes I made a month ago.
Memtest ran fine multiple times.
Right now everything still works fine.
Is there any way to salvage this?
As I see it there already is data loss and replacing one drive after another and resilvering would not get clean pool again.
I could ditch both VMs 105 and 110 and delete the file in LXC 109 as there is no important data.
But I do not understand what these errors mean:
Code:
<0x70>:<0x7e43f>
<0x70>:<0x7e4c3>
<0x70>:<0x7aac5>
<0x70>:<0x7e4d5>
I backed up all my VMs and LXCs I care about on my HDD-pool via the proxmox backup server i mentioned earlier.
There is no encryption on my pools or my backup.
My plan I thought of is:
- Backup VMs/LXCs again
- Backup pve configs to pbs lxc
- Remove old SSDs
- Install Proxmox on new SSDs
- Import HDD-Pool
- Create pbs LXC with the same config/mount point as before.
- restore pve
- restore VMs/LXC
What paths are needed for a pve backup?
Right now I would backup:
- /etc
- /var
- /root
- /opt
- /boot
- installed apt packages and repos
Will this work or am I missing something?
Or is this idea stupid and there is a better/easier way?
Thanks in advance!