Utterly confused trying to fix my Proxmox install

jolly · Mar 2, 2024

I'm quite confused trying to recover my proxmox install, and would greatly appreciate any assistance.
The server is mostly unresponsive.
I've reinstalled Proxmox, (to a new different ssd) which worked fine, but if I try restoring config.db and reboot, it immediately gives the same sort of errors and is unusable.
I've tried checking the disks offline for errors, as far as I can tell there are none. (both using truenas as well as in windows using StableBit Scanner
I've switched sata ports/cables.
Ran full memtests + has ecc memory.

Feb 27 - wake up, vm's are offline, reboot. This fixes the problem, but it goes down again and I've been unable to get it up and going since.
The same machine was hosting a VM of Truenas (on raw disks), and I can boot into that fine and access my proxmox files. I've copied over everything in /var/log and /etc/ (but it doesnt have everything that gets sourced from the config.db file )

I did seem to get low on space on feb 14, but journalctl --vacuum-time=1h seems to have fixed that.

SSHing into the server gves
-bash: /etc/profile: Input/output error
-bash: /root/.profile: Input/output error

Code:

Syslog includes errors such as 192.168.40.100    Mar  1 14:03:03    pve    kern    err    kernel    [   49.132109] Buffer I/O error on dev dm-1, logical block 6883811, lost async page write
192.168.40.100    Mar  1 14:03:03    pve    syslog    err    rsyslogd    file '/var/log/syslog'[9] write error - see https://www.rsyslog.com/solving-rsyslog-write-errors/ for help OS error: Read-only file system [v8.2102.0 try https://www.rsyslog.com/e/2027 ]
192.168.40.100    Mar  1 14:03:03    pve    syslog    err    rsyslogd    action 'action-2-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check for additional error messages before this one. [v8.2102.0 try https://www.rsyslog.com/e/2027 ]
192.168.40.100    Mar  1 14:03:03    pve    syslog    err    rsyslogd    file '/var/log/syslog'[9] write error - see https://www.rsyslog.com/solving-rsyslog-write-errors/ for help OS error: Read-only file system [v8.2102.0 try https://www.rsyslog.com/e/2027 ]
192.168.40.100    Mar  1 14:03:03    pve    syslog    err    rsyslogd    action 'action-2-builtin:omfile' (module 'builtin:o

I believe dm-1 = pve-root ie my proxmox boot drive.

Other weirdness digging through old log files: DEV sda/sdb /dev/md127 issues:
This should be two paired large sata drives, running a zfs pool tank. They seem fine within truenas, but not looking through some of the logs in proxmox???
Within ZFS:

I got an automated message a few days ago:

Code:

"This is an automatically generated mail message from mdadm
running on pve

A Fail event had been detected on md device /dev/md127.

It could be related to component device /dev/sda1.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md127 : active (auto-read-only) raid1 sdb1[1] sda1[0](F)
      2095040 blocks super 1.2 [2/1] [_U]

unused devices: <none>"

Looking through related log files:

Code:

"Feb 24 13:36:20 pve udisksd[3144]: Unable to resolve /sys/devices/virtual/block/md127/md/dev-sda1/block symlink
Feb 24 13:36:20 pve udisksd[3144]: Unable to resolve /sys/devices/virtual/block/md127/md/dev-sda1/block symlink
Feb 24 13:36:20 pve udisksd[3144]: Unable to resolve /sys/devices/virtual/block/md127/md/dev-sdb1/block symlink
Feb 24 13:36:20 pve udisksd[3144]: Unable to resolve /sys/devices/virtual/block/md127/md/dev-sdb1/block symlink
Feb 24 13:36:20 pve udisksd[3144]: Error reading sysfs attr `/sys/devices/virtual/block/md127/md/degraded': Failed to open file “/sys/devices/virtual/block/md127/md/degraded”: No such file or directory (g-file-error-quark, 4)
Feb 24 13:36:20 pve udisksd[3144]: Error reading sysfs attr `/sys/devices/virtual/block/md127/md/sync_action': Failed to open file “/sys/devices/virtual/block/md127/md/sync_action”: No such file or directory (g-file-error-quark, 4)
Feb 24 13:36:20 pve udisksd[3144]: Error reading sysfs attr `/sys/devices/virtual/block/md127/md/sync_completed': Failed to open file “/sys/devices/virtual/block/md127/md/sync_completed”: No such file or directory (g-file-error-quark, 4)
Feb 24 13:36:20 pve udisksd[3144]: Error reading sysfs attr `/sys/devices/virtual/block/md127/md/bitmap/location': Failed to open file “/sys/devices/virtual/block/md127/md/bitmap/location”: No such file or directory (g-file-error-quark, 4)

+V("J
 MESSAGE=Registering new address record for fe80::42:97ff:fea0:dc69 on br-57d53181c146.*.
 _SOURCE_REALTIME_TIMESTAMP=1708788916590128
 '&q.&.X
 L+V("J
 _SOURCE_MONOTONIC_TIMESTAMP=18873332
 MESSAGE=ata1.00: disabled
 _SOURCE_MONOTONIC_TIMESTAMP=18873882
 MESSAGE=sd 0:0:0:0: [sda] Synchronizing SCSI cache
 _SOURCE_MONOTONIC_TIMESTAMP=18873903
 MESSAGE=sd 0:0:0:0: [sda] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
 _SOURCE_MONOTONIC_TIMESTAMP=18873906
 MESSAGE=sd 0:0:0:0: [sda] Stopping disk
 _SOURCE_MONOTONIC_TIMESTAMP=18873911
 MESSAGE=sd 0:0:0:0: [sda] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
 _SOURCE_MONOTONIC_TIMESTAMP=18879282
 PRIORITY=2
 MESSAGE=md/raid1:md127: Disk failure on sda1, disabling device.
 md/raid1:md127: Operation continuing on 1 devices.
 MESSAGE=Unable to resolve /sys/devices/virtual/block/md127/md/dev-sda1/block symlink
 CODE_FUNC=udisks_linux_mdraid_update
 CODE_FILE=udiskslinuxmdraid.c:444
 _SOURCE_REALTIME_TIMESTAMP=1708788916666222
 _SOURCE_MONOTONIC_TIMESTAMP=18904084
 MESSAGE=ata2.00: disabled
 _SOURCE_MONOTONIC_TIMESTAMP=18904898
 MESSAGE=sd 1:0:0:0: [sdb] Synchronizing SCSI cache
 _SOURCE_MONOTONIC_TIMESTAMP=18904921
 MESSAGE=sd 1:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
 _SOURCE_MONOTONIC_TIMESTAMP=18904926
 MESSAGE=sd 1:0:0:0: [sdb] Stopping disk
 _SOURCE_MONOTONIC_TIMESTAMP=18904932
 MESSAGE=sd 1:0:0:0: [sdb] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
 _SOURCE_MONOTONIC_TIMESTAMP=18934191
 MESSAGE=md127: detected capacity change from 4190080 to 0
 _SOURCE_MONOTONIC_TIMESTAMP=18934197
 MESSAGE=md: md127 stopped.
 _SOURCE_MONOTONIC_TIMESTAMP=18956063
 MESSAGE=ata3.00: disabled
 _SOURCE_MONOTONIC_TIMESTAMP=18956802
 MESSAGE=sd 2:0:0:0: [sdc] Synchronizing SCSI cache
 _SOURCE_MONOTONIC_TIMESTAMP=18956835
 MESSAGE=sd 2:0:0:0: [sdc] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
 _SOURCE_MONOTONIC_TIMESTAMP=18956842
 MESSAGE=sd 2:0:0:0: [sdc] Stopping disk
 _SOURCE_MONOTONIC_TIMESTAMP=18956853
 MESSAGE=sd 2:0:0:0: [sdc] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
 _SOURCE_MONOTONIC_TIMESTAMP=19064106
 MESSAGE=ata4.00: disabled
 _SOURCE_MONOTONIC_TIMESTAMP=19064712
 MESSAGE=sd 3:0:0:0: [sdd] Synchronizing SCSI cache
 _SOURCE_MONOTONIC_TIMESTAMP=19064743
 MESSAGE=sd 3:0:0:0: [sdd] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
 _SOURCE_MONOTONIC_TIMESTAMP=19064748
 MESSAGE=sd 3:0:0:0: [sdd] Stopping disk
 e @#V
 _SOURCE_MONOTONIC_TIMESTAMP=19064755
 MESSAGE=sd 3:0:0:0: [sdd] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
 SYSLOG_PID=5242
 MESSAGE=AA8981A7128: uid=0 from=<root>
 _PID=5242
 _SOURCE_REALTIME_TIMESTAMP=1708788916698606
 4g Q"(X.R
 _SOURCE_REALTIME_TIMESTAMP=1708788916666734
 SYSLOG_PID=5755
 MESSAGE=AA8981A7128: message-id=<20240224153516.AA8981A7128@pve.lan>
 _PID=5755
 _SOURCE_REALTIME_TIMESTAMP=1708788916702785
 MESSAGE=Unable to resolve /sys/devices/virtual/block/md127/md/dev-sdb1/block symlink
 _SOURCE_REALTIME_TIMESTAMP=1708788916667357
 SYSLOG_PID=5243
 r
 )Uc B
 MESSAGE=AA8981A7128: from=<root@pve.lan>, size=798, nrcpt=1 (queue active)
 _PID=5243
 _SOURCE_REALTIME_TIMESTAMP=1708788916743565
 r
 )Uc Bh3V
 _SOURCE_REALTIME_TIMESTAMP=1708788916694276
 SYSLOG_PID=5772
 _PID=5772
 _SOURCE_REALTIME_TIMESTAMP=1708788916748046
 _SOURCE_REALTIME_TIMESTAMP=1708788916694853
 MESSAGE=C27331A712E: uid=65534 from=<root>
 _SOURCE_REALTIME_TIMESTAMP=1708788916796528
 4g Q"(X.R
 MESSAGE=Error reading sysfs attr `/sys/devices/virtual/block/md127/md/degraded': Failed to open file “/sys/devices/virtual/block/md127/md/degraded”: No such file or directory (g-file-error-quark, 4)
 CODE_FUNC=read_sysfs_attr
 CODE_FILE=udiskslinuxmdraidhelpers.c:59
 _SOURCE_REALTIME_TIMESTAMP=1708788916771064
 d L LHEV
 MESSAGE=C27331A712E: message-id=<20240224153516.AA8981A7128@pve.lan>
 _SOURCE_REALTIME_TIMESTAMP=1708788916796677
 MESSAGE=Error reading sysfs attr `/sys/devices/virtual/block/md127/md/sync_action': Failed to open file “/sys/devices/virtual/block/md127/md/sync_action”: No such file or directory (g-file-error-quark, 4)
 _SOURCE_REALTIME_TIMESTAMP=1708788916771078
 d L LXFV
 SYSLOG_PID=5771
 MESSAGE=AA8981A7128: to=<root@pve.lan>, orig_to=<root>, relay=local, delay=0.15, delays=0.09/0/0/0.05, dsn=2.0.0, status=sent (delivered to command: /usr/bin/proxmox-mail-forward)
 _PID=5771
 _SOURCE_REALTIME_TIMESTAMP=1708788916797164
 Lc J`PR
 MESSAGE=Error reading sysfs attr `/sys/devices/virtual/block/md127/md/sync_completed': Failed to open file “/sys/devices/virtual/block/md127/md/sync_completed”: No such file or directory (g-file-error-quark, 4)
 _SOURCE_REALTIME_TIMESTAMP=1708788916771089
 d L LXFV
 MESSAGE=AA8981A7128: removed
 _SOURCE_REALTIME_TIMESTAMP=1708788916797283
 o%h3V
 MESSAGE=Error reading sysfs attr `/sys/devices/virtual/block/md127/md/bitmap/location': Failed to open file “/sys/devices/virtual/block/md127/md/bitmap/location”: No such file or directory (g-file-error-quark, 4)
 _SOURCE_REALTIME_TIMESTAMP=1708788916771098
 d L LXFV
 MESSAGE=C27331A712E: from=<root@pve.lan>, size=954, nrcpt=1 (queue active)
 _SOURCE_REALTIME_TIMESTAMP=1708788916826438
 o%h3V
 MESSAGE=time="2024-02-24T11:35:17.012577062-04:00" level=warning msg="WARNING: bridge-nf-call-iptables is disabled"
 MESSAGE=time="2024-02-24T11:35:17.012600646-04:00" level=warning msg="WARNING: bridge-nf-call-ip6tables is disabled"
 MESSAGE=time="2024-02-24T11:35:17.012622357-04:00" level=info msg="Docker daemon" commit=f417435 containerd-snapshotter=false storage-driver=overlay2 version=25.0.3
 >;00e V
 MESSAGE=time="2024-02-24T11:35:17.013904561-04:00" level=info msg="Daemon has completed initialization"
 MESSAGE=time="2024-02-24T11:35:17.052552377-04:00" level=info msg="API listen on /run/docker.sock"
 MESSAGE=Started Docker Application Container Engine.
 _SOURCE_REALTIME_TIMESTAMP=1708788917052655"



MESSAGE=Device: /dev/sdb [SAT], opened
 SYSLOG_RAW=<30>Feb 27 16:40:52 smartd[3123]: Device: /dev/sdb [SAT], opened

 _SOURCE_REALTIME_TIMESTAMP=1709066452183733
 d;g E[%
 MESSAGE=Device: /dev/sdb [SAT], WDC WD140EDGZ-11B2DA2, S/N:2BG59KYE, WWN:5-000cca-295c269d0, FW:85.00A85, 14.0 TB
 SYSLOG_RAW=<30>Feb 27 16:40:52 smartd[3123]: Device: /dev/sdb [SAT], WDC WD140EDGZ-11B2DA2, S/N:2BG59KYE, WWN:5-000cca-295c269d0, FW:85.00A85, 14.0 TB

 _SOURCE_REALTIME_TIMESTAMP=1709066452183881
 d;g E[%
 MESSAGE=Device: /dev/sdb [SAT], not found in smartd database.
 SYSLOG_RAW=<30>Feb 27 16:40:52 smartd[3123]: Device: /dev/sdb [SAT], not found in smartd database.

 _SOURCE_REALTIME_TIMESTAMP=1709066452188172
 d;g E[%
 MESSAGE=Device: /dev/sdb [SAT], is SMART capable. Adding to "monitor" list.
 SYSLOG_RAW=<30>Feb 27 16:40:52 smartd[3123]: Device: /dev/sdb [SAT], is SMART capable. Adding to "monitor" list.

 _SOURCE_REALTIME_TIMESTAMP=1709066452192352
 d;g E[%
 MESSAGE=Device: /dev/sdb [SAT], state read from /var/lib/smartmontools/smartd.WDC_WD140EDGZ_11B2DA2-2BG59KYE.ata.state
 SYSLOG_RAW=<30>Feb 27 16:40:52 smartd[3123]: Device: /dev/sdb [SAT], state read from /var/lib/smartmontools/smartd.WDC_WD140EDGZ_11B2DA2-2BG59KYE.ata.state

Dunuin · Mar 3, 2024

jolly said:
Other weirdness digging through old log files: DEV sda/sdb /dev/md127 issues:
This should be two paired large sata drives, running a zfs pool tank. They seem fine within truenas, but not looking through some of the logs in proxmox???
Within ZFS:

I got an automated message a few days ago:

Code:

"This is an automatically generated mail message from mdadm running on pve A Fail event had been detected on md device /dev/md127. It could be related to component device /dev/sda1. Faithfully yours, etc. P.S. The /proc/mdstat file currently contains the following: Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] md127 : active (auto-read-only) raid1 sdb1[1] sda1[0](F) 2095040 blocks super 1.2 [2/1] [_U] unused devices: <none>"

You are mixing ZFS and mdadm here. Or did you create a raid1 using mdadm and ZFS as a "single disk" on top of that mdadm raid array?

jolly · Mar 3, 2024

Dunuin said:
You are mixing ZFS and mdadm here. Or did you create a raid1 using mdadm and ZFS as a "single disk" on top of that mdadm raid array?

I believe I passed through the two disks to truenas, and then had truenas create the storage pool - which is why it says 1x mirror 2 wide. I don't think I ever created a linux mdadm raid array, which is why I was surprised to get that e-mail. Unless thats just how proxmox interprets the truenas pool?

And Truenas can boot fine if I change my bios boot order to directly boot into Truenas, it's the Proxmox install that I'm really struggling with.

Dunuin · Mar 3, 2024

jolly said:
Unless thats just how proxmox interprets the truenas pool?

No. PVE doesn't officially support mdadm. Its either some custom stuff of your hoster, you set that up yourself or it is some onboard-raid implementation that uses mdadm raid.

emunt6 · Mar 3, 2024

Hi!

You installed a new system to SSD and you have 2x SATA disk you want to pass-through to the VM (FREENAS).
Errors like this:

Code:

-bash: /etc/profile: Input/output error
-bash: /root/.profile: Input/output error

Means "no space left on the disk" or "disk read/write error".

First problem, Proxmox OS, when you create the MDADM arrays, you need to specify the "--metadata 1.0"
(https://raid.wiki.kernel.org/index.php/RAID_superblock_formats)

Code:

Sub-Version     Superblock Position on Device
0.9     At the end of the device
1.0     At the end of the device
1.1     At the beginning of the device
1.2     4K from the beginning of the device

becasue you using the SSD disk for booting: The bootloader will overwrites the mdadm metadata, thus corrupting the array/filesystem.
Check you MDADM arrays for "version":

Code:

$> mdadm --detail --scan

If you not using metadata 0.9 or 1.0 you need to recreate the MDADM arrays (Reinstalling the base system):

Code:

$> mdadm --create /dev/md125 --name="PROX:0" --assume-clean --verbose --metadata 1.0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1
$> mdadm --create /dev/md126 --name="PROX:1" --assume-clean --verbose --metadata 1.0 --level=1 --raid-devices=2 /dev/sda2 /dev/sdb2

Second problem the Proxmox OS has ZFS module imports the SATA disk too ( =parallel write problem) - you need to prevent this - ( assuming you dont want to use ZFS on Proxmox host, only inside the VM )

Code:

/etc/modprobe.d/zfs.conf
blacklist zfs

Code:

$> update-initramfs -c -d -u
$> update-grub
REBOOT

jolly · Mar 3, 2024

Hi, thanks again for your help.
The server had been running for two years without incident.

The core proxmox install consists of:
1 128gig SSD for proxmox to boot from - Toshiba. I believe this is the md127 device, I think I used to have it in a raid config. If memory is right, it was raided but the second device was using a usb adapter which failed. (this was two years ago) This must have been done from the bios, as I'm super comfortable in linux.

1 WD nvme 2tb SSD running ZFS which Proxmox VM's are run off. "ZFSWD"

In addition we have:
2x128gig disks passed through to the TrueNas vm. 1xlexar, 1xinland professional(SATA_SSD). zfs name is boot-pool
2x14tb spinning disks also passed through to the VM. zfs name is tank

Trying your command:

Code:

root@pve:~# mdadm --detail  --scan
-bash: mdadm: command not found
root@pve:~#

I can access the zfs web/console interface fine, unless I try restoring the config.db

If I do try restoring config.db, thats when i end up with the i/o errors, and the console/web interface is unaccessable.

I've looked through the config.db, and I cant see anything that would cause issues, just seems like my vm configs.

Code:

From the config.db file
config:
# ls -al /sys/block/sd* - list drives
#
#ls pci
#
#Bios%3A
#1 -
#2 16tb         sdae
#3
#4 Tosh 128   sdaf
#5 Lexar   sdac
#6
#7 14tb wd sdaa
#8 14tb   sdab
#
#
#
#
#2/3 on pci 8
#
#5-8 on 9
#
#
#
#zpool list - lists pools, how much is used/available.
#
#
#
#
#
#
#
#
#PCI\VEN_10DE&DEV_2531&SUBSYS_151D103C&REV_A1\4&1ebe8d07&0&00C0 = a2000
#
#Ven 1002 Dev 7422
#
#
##sata1%3A ZFSWD%3Avm-103-disk-1,size=700G,ssd=1
##hostpci0%3A 0000%3A0a%3A00,pcie=1,romfile=1030_GP108_patched.rom
#agent%3A 1
#
#
#
#hostpci0%3A 0000%3A0a%3A00,pcie=1,romfile=1030_GP108_patched.rom
#
#
#
#
#root@pve%3A~# zpool list
#NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
#ZFSWD  1.81T   852G  1004G        -         -    33%    45%  1.00x    ONLINE  -
#
#root@pve%3A~# zfs get all ZFSWD | grep used
#ZFSWD  used                  1.73T                  -
#ZFSWD  usedbysnapshots       0B                     -
#ZFSWD  usedbydataset         96K                    -
#ZFSWD  usedbychildren        1.73T                  -
#ZFSWD  usedbyrefreservation  0B                     -
#ZFSWD  logicalused           847G                   -
#
#
#
#root@pve%3A~# zfs get all | grep vm | grep -i referenced
#ZFSWD/vm-101-disk-0             referenced            35.8G                                    -
#ZFSWD/vm-101-disk-0             logicalreferenced     35.6G                                    -
#ZFSWD/vm-103-disk-0             referenced            102G                                     -
#ZFSWD/vm-103-disk-0             logicalreferenced     101G                                     -
#ZFSWD/vm-103-disk-1             referenced            591G                                     -
#ZFSWD/vm-103-disk-1             logicalreferenced     588G                                     -
#ZFSWD/vm-105-disk-0             referenced            25.1G                                    -
#ZFSWD/vm-105-disk-0             logicalreferenced     25.0G                                    -
#ZFSWD/vm-106-disk-0             referenced            33.8G                                    -
#ZFSWD/vm-106-disk-0             logicalreferenced     33.6G                                    -
#ZFSWD/vm-106-disk-0@base        referenced            15.5G                                    -
#ZFSWD/vm-106-disk-0@base        logicalreferenced     15.4G                                    -
#ZFSWD/vm-106-disk-0@automating  referenced            18.6G                                    -
#ZFSWD/vm-106-disk-0@automating  logicalreferenced     18.5G                                    -
#ZFSWD/vm-108-disk-0             referenced            21.9G                                    -
#ZFSWD/vm-108-disk-0             logicalreferenced     21.8G                                    -
#ZFSWD/vm-109-disk-1             referenced            192K                                     -
#ZFSWD/vm-109-disk-1             logicalreferenced     160K                                     -
#ZFSWD/vm-110-disk-0             referenced            18.5G                                    -
#ZFSWD/vm-110-disk-0             logicalreferenced     18.4G                                    -
#
#
#
#root@pve%3A~# pvesm status
#Name               Type     Status           Total            Used       Available        %
#ZFSWD           zfspool     active      1885861344      1855228324        30633020   98.38%
#dappnode-vg         lvm     active      1952485376      1952485376               0  100.00%
#local               dir     active        30316484        24514304         4239148   80.86%
#local-lvm       lvmthin     active        67620864               0        67620864    0.00%
#
#
#root@pve%3A~# zfs set refreservation=0G ZFSWD/base-100-disk-0
#root@pve%3A~# zfs get all | grep usedbyref
#ZFSWD                           usedbyrefreservation  0B                                       -
#ZFSWD/base-100-disk-0           usedbyrefreservation  0B                                       -
#ZFSWD/base-102-disk-0           usedbyrefreservation  0B                                       -
#ZFSWD/vm-101-disk-0             usedbyrefreservation  0B                                       -
#ZFSWD/vm-103-disk-0             usedbyrefreservation  0B                                       -
#ZFSWD/vm-103-disk-1             usedbyrefreservation  79.2G                                    -
#ZFSWD/vm-105-disk-0             usedbyrefreservation  0B                                       -
#ZFSWD/vm-106-disk-0             usedbyrefreservation  0B                                       -
#ZFSWD/vm-108-disk-0             usedbyrefreservation  242G                                     -
#ZFSWD/vm-109-disk-1             usedbyrefreservation  2.81M                                    -
#ZFSWD/vm-110-disk-0             usedbyrefreservation  80.5G
#
#zfs set refreservation=0G ZFSWD/vm-103-disk-1
#zfs set refreservation=0G ZFSWD/vm-108-disk-0
#zfs set refreservation=0G ZFSWD/vm-109-disk-1
#zfs set refreservation=0G ZFSWD/vm-110-disk-0
#
#
#root@pve%3A~# pvesm status
#Name               Type     Status           Total            Used       Available        %
#ZFSWD           zfspool     active      1885863384       893241056       992622328   47.37%
#dappnode-vg         lvm     active      1952485376      1952485376               0  100.00%
#local               dir     active        30316484        24515884         4237568   80.87%
#local-lvm       lvmthin     active        67620864               0        67620864    0.00%
#root@pve%3A~#

Code:

storage.cfg:
dir: local
    path /var/lib/vz
    content vztmpl,backup,iso

lvmthin: local-lvm
    thinpool data
    vgname pve
    content rootdir,images

zfspool: ZFSWD
    pool ZFSWD
    content rootdir,images
    mountpoint /ZFSWD
    nodes pve
    sparse 1

lvm: dappnode-vg
    vgname dappnode-vg
    content images,rootdir
    shared 0

emunt6 · Mar 4, 2024

jolly said:

Hi, thanks again for your help.
The server had been running for two years without incident.

The core proxmox install consists of:
1 128gig SSD for proxmox to boot from - Toshiba. I believe this is the md127 device, I think I used to have it in a raid config. If memory is right, it was raided but the second device was using a usb adapter which failed. (this was two years ago) This must have been done from the bios, as I'm super comfortable in linux.

1 WD nvme 2tb SSD running ZFS which Proxmox VM's are run off. "ZFSWD"

In addition we have:
2x128gig disks passed through to the TrueNas vm. 1xlexar, 1xinland professional(SATA_SSD). zfs name is boot-pool
2x14tb spinning disks also passed through to the VM. zfs name is tank

Trying your command:

Code:

root@pve:~# mdadm --detail  --scan
-bash: mdadm: command not found
root@pve:~#

I can access the zfs web/console interface fine, unless I try restoring the config.db
View attachment 64065

If I do try restoring config.db, thats when i end up with the i/o errors, and the console/web interface is unaccessable.

I've looked through the config.db, and I cant see anything that would cause issues, just seems like my vm configs.

View attachment 64067
View attachment 64066

Code:

From the config.db file
config:
# ls -al /sys/block/sd* - list drives
#
#ls pci
#
#Bios%3A
#1 -
#2 16tb         sdae
#3
#4 Tosh 128   sdaf
#5 Lexar   sdac
#6
#7 14tb wd sdaa
#8 14tb   sdab
#
#
#
#
#2/3 on pci 8
#
#5-8 on 9
#
#
#
#zpool list - lists pools, how much is used/available.
#
#
#
#
#
#
#
#
#PCI\VEN_10DE&DEV_2531&SUBSYS_151D103C&REV_A1\4&1ebe8d07&0&00C0 = a2000
#
#Ven 1002 Dev 7422
#
#
##sata1%3A ZFSWD%3Avm-103-disk-1,size=700G,ssd=1
##hostpci0%3A 0000%3A0a%3A00,pcie=1,romfile=1030_GP108_patched.rom
#agent%3A 1
#
#
#
#hostpci0%3A 0000%3A0a%3A00,pcie=1,romfile=1030_GP108_patched.rom
#
#
#
#
#root@pve%3A~# zpool list
#NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
#ZFSWD  1.81T   852G  1004G        -         -    33%    45%  1.00x    ONLINE  -
#
#root@pve%3A~# zfs get all ZFSWD | grep used
#ZFSWD  used                  1.73T                  -
#ZFSWD  usedbysnapshots       0B                     -
#ZFSWD  usedbydataset         96K                    -
#ZFSWD  usedbychildren        1.73T                  -
#ZFSWD  usedbyrefreservation  0B                     -
#ZFSWD  logicalused           847G                   -
#
#
#
#root@pve%3A~# zfs get all | grep vm | grep -i referenced
#ZFSWD/vm-101-disk-0             referenced            35.8G                                    -
#ZFSWD/vm-101-disk-0             logicalreferenced     35.6G                                    -
#ZFSWD/vm-103-disk-0             referenced            102G                                     -
#ZFSWD/vm-103-disk-0             logicalreferenced     101G                                     -
#ZFSWD/vm-103-disk-1             referenced            591G                                     -
#ZFSWD/vm-103-disk-1             logicalreferenced     588G                                     -
#ZFSWD/vm-105-disk-0             referenced            25.1G                                    -
#ZFSWD/vm-105-disk-0             logicalreferenced     25.0G                                    -
#ZFSWD/vm-106-disk-0             referenced            33.8G                                    -
#ZFSWD/vm-106-disk-0             logicalreferenced     33.6G                                    -
#ZFSWD/vm-106-disk-0@base        referenced            15.5G                                    -
#ZFSWD/vm-106-disk-0@base        logicalreferenced     15.4G                                    -
#ZFSWD/vm-106-disk-0@automating  referenced            18.6G                                    -
#ZFSWD/vm-106-disk-0@automating  logicalreferenced     18.5G                                    -
#ZFSWD/vm-108-disk-0             referenced            21.9G                                    -
#ZFSWD/vm-108-disk-0             logicalreferenced     21.8G                                    -
#ZFSWD/vm-109-disk-1             referenced            192K                                     -
#ZFSWD/vm-109-disk-1             logicalreferenced     160K                                     -
#ZFSWD/vm-110-disk-0             referenced            18.5G                                    -
#ZFSWD/vm-110-disk-0             logicalreferenced     18.4G                                    -
#
#
#
#root@pve%3A~# pvesm status
#Name               Type     Status           Total            Used       Available        %
#ZFSWD           zfspool     active      1885861344      1855228324        30633020   98.38%
#dappnode-vg         lvm     active      1952485376      1952485376               0  100.00%
#local               dir     active        30316484        24514304         4239148   80.86%
#local-lvm       lvmthin     active        67620864               0        67620864    0.00%
#
#
#root@pve%3A~# zfs set refreservation=0G ZFSWD/base-100-disk-0
#root@pve%3A~# zfs get all | grep usedbyref
#ZFSWD                           usedbyrefreservation  0B                                       -
#ZFSWD/base-100-disk-0           usedbyrefreservation  0B                                       -
#ZFSWD/base-102-disk-0           usedbyrefreservation  0B                                       -
#ZFSWD/vm-101-disk-0             usedbyrefreservation  0B                                       -
#ZFSWD/vm-103-disk-0             usedbyrefreservation  0B                                       -
#ZFSWD/vm-103-disk-1             usedbyrefreservation  79.2G                                    -
#ZFSWD/vm-105-disk-0             usedbyrefreservation  0B                                       -
#ZFSWD/vm-106-disk-0             usedbyrefreservation  0B                                       -
#ZFSWD/vm-108-disk-0             usedbyrefreservation  242G                                     -
#ZFSWD/vm-109-disk-1             usedbyrefreservation  2.81M                                    -
#ZFSWD/vm-110-disk-0             usedbyrefreservation  80.5G
#
#zfs set refreservation=0G ZFSWD/vm-103-disk-1
#zfs set refreservation=0G ZFSWD/vm-108-disk-0
#zfs set refreservation=0G ZFSWD/vm-109-disk-1
#zfs set refreservation=0G ZFSWD/vm-110-disk-0
#
#
#root@pve%3A~# pvesm status
#Name               Type     Status           Total            Used       Available        %
#ZFSWD           zfspool     active      1885863384       893241056       992622328   47.37%
#dappnode-vg         lvm     active      1952485376      1952485376               0  100.00%
#local               dir     active        30316484        24515884         4237568   80.87%
#local-lvm       lvmthin     active        67620864               0        67620864    0.00%
#root@pve%3A~#

Code:

storage.cfg:
dir: local
    path /var/lib/vz
    content vztmpl,backup,iso

lvmthin: local-lvm
    thinpool data
    vgname pve
    content rootdir,images

zfspool: ZFSWD
    pool ZFSWD
    content rootdir,images
    mountpoint /ZFSWD
    nodes pve
    sparse 1

lvm: dappnode-vg
    vgname dappnode-vg
    content images,rootdir
    shared 0

The ZFS error messages already told you, the Proxmox is trying to import ZFS disk, but you using in VM ( FreeNAS) different ZFS specs., you need to prevent to import ( I already wrote howto ).

For first, you need to manage to recover/repair the Proxmox base system, unplug/remove every other disks.

How was your raid created ?
1., You created with in the BIOS ( fakeraid / dmraid )
2., You created with Debian Linux installer (mdadm)

Fakeraid never recommanded, impossible to properly "repair", read about it on google.

I suggest reinstall the Proxmox without fakeraid, so change disk to "simple disk" in BIOS. ( Later if you want raid on Proxmox SSD disk, you can do it with mdadm).
Before doing it, you need some files from the Proxmox disk (as I read your post, you dont store any VM data on the Proxmox SSD disk ).

When you booted up, save the necessary files (to USB Pendrive) that needed from the Proxmox SSD disk,
for the new PVE you need the (.conf) files from:

Code:

/etc/pve/qemu-server/    - contains the configuration of the VMs, since you don't have cluster you dont need other files from the PVE folder

So when you save all the necessary files form the SSD disk and the config files, you need to reinstall the Proxmox base system.
After when the Proxmox OS is running without erros, then you can copy back the .config files, add disks again, add the ZFS blacklist, reboot, done.

jolly · Mar 4, 2024

The ZFS error messages already told you, the Proxmox is trying to import ZFS disk, but you using in VM ( FreeNAS) different ZFS specs., you need to prevent to import ( I already wrote howto ).

I can't blacklist ZFS as my VM drive "ZFSWD" uses ZFS.

For first, you need to manage to recover/repair the Proxmox base system, unplug/remove every other disks.

How was your raid created ?
1., You created with in the BIOS ( fakeraid / dmraid )
2., You created with Debian Linux installer (mdadm)

There is no raid as far as I can tell, I'm not sure where it's picking this up from.
Bios says AHCI, not raid mode.

The new proxmox install def did not have any raid setup.

I can't copy /etc/pve/qemu-server directly, since that only exists when proxmox is running. I can export them out of the config.db file though.

jolly · Mar 4, 2024

If I connect only the OS disk and the ZFSWD disk, copy /etc/pve/qemu-server/ over, and do a zpool import -f ZFSWD, that sorta works, but

192.168.40.100 Mar 3 21:45:38 pve daemon err pvedaemon[1890] storage 'ZFSWD' does not exist
192.168.40.100 Mar 3 21:45:38 pve daemon err pvedaemon[1770] <root@pam> end task UPID

ve:00000762:000012B4:65E527C2:qmstart:101:root@pam: storage 'ZFSWD' does not exist

If I then try restoring storage.cfg it goes back to freaking out and warning about read-only file systems.

jolly · Mar 19, 2024

To follow up on this turned out to be some weird hardware issue, if I move the same disks and put them in a different machine it works perfectly.

This is despite me switching ports, cables, power supply, fully testing memory..m

Search

Search

Utterly confused trying to fix my Proxmox install

jolly

Member

Dunuin

Distinguished Member

jolly

Member

Dunuin

Distinguished Member

emunt6

Active Member

jolly

Member

emunt6

Active Member

jolly

Member

jolly

Member

jolly

Member