I'm quite confused trying to recover my proxmox install, and would greatly appreciate any assistance.
The server is mostly unresponsive.
I've reinstalled Proxmox, (to a new different ssd) which worked fine, but if I try restoring config.db and reboot, it immediately gives the same sort of errors and is unusable.
I've tried checking the disks offline for errors, as far as I can tell there are none. (both using truenas as well as in windows using StableBit Scanner
I've switched sata ports/cables.
Ran full memtests + has ecc memory.
Feb 27 - wake up, vm's are offline, reboot. This fixes the problem, but it goes down again and I've been unable to get it up and going since.
The same machine was hosting a VM of Truenas (on raw disks), and I can boot into that fine and access my proxmox files. I've copied over everything in /var/log and /etc/ (but it doesnt have everything that gets sourced from the config.db file )
I did seem to get low on space on feb 14, but journalctl --vacuum-time=1h seems to have fixed that.
SSHing into the server gves
-bash: /etc/profile: Input/output error
-bash: /root/.profile: Input/output error
I believe dm-1 = pve-root ie my proxmox boot drive.
Other weirdness digging through old log files: DEV sda/sdb /dev/md127 issues:
This should be two paired large sata drives, running a zfs pool tank. They seem fine within truenas, but not looking through some of the logs in proxmox???
Within ZFS:
I got an automated message a few days ago:
Looking through related log files:
The server is mostly unresponsive.
I've reinstalled Proxmox, (to a new different ssd) which worked fine, but if I try restoring config.db and reboot, it immediately gives the same sort of errors and is unusable.
I've tried checking the disks offline for errors, as far as I can tell there are none. (both using truenas as well as in windows using StableBit Scanner
I've switched sata ports/cables.
Ran full memtests + has ecc memory.
Feb 27 - wake up, vm's are offline, reboot. This fixes the problem, but it goes down again and I've been unable to get it up and going since.
The same machine was hosting a VM of Truenas (on raw disks), and I can boot into that fine and access my proxmox files. I've copied over everything in /var/log and /etc/ (but it doesnt have everything that gets sourced from the config.db file )
I did seem to get low on space on feb 14, but journalctl --vacuum-time=1h seems to have fixed that.
SSHing into the server gves
-bash: /etc/profile: Input/output error
-bash: /root/.profile: Input/output error
Code:
Syslog includes errors such as 192.168.40.100 Mar 1 14:03:03 pve kern err kernel [ 49.132109] Buffer I/O error on dev dm-1, logical block 6883811, lost async page write
192.168.40.100 Mar 1 14:03:03 pve syslog err rsyslogd file '/var/log/syslog'[9] write error - see https://www.rsyslog.com/solving-rsyslog-write-errors/ for help OS error: Read-only file system [v8.2102.0 try https://www.rsyslog.com/e/2027 ]
192.168.40.100 Mar 1 14:03:03 pve syslog err rsyslogd action 'action-2-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check for additional error messages before this one. [v8.2102.0 try https://www.rsyslog.com/e/2027 ]
192.168.40.100 Mar 1 14:03:03 pve syslog err rsyslogd file '/var/log/syslog'[9] write error - see https://www.rsyslog.com/solving-rsyslog-write-errors/ for help OS error: Read-only file system [v8.2102.0 try https://www.rsyslog.com/e/2027 ]
192.168.40.100 Mar 1 14:03:03 pve syslog err rsyslogd action 'action-2-builtin:omfile' (module 'builtin:o
I believe dm-1 = pve-root ie my proxmox boot drive.
Other weirdness digging through old log files: DEV sda/sdb /dev/md127 issues:
This should be two paired large sata drives, running a zfs pool tank. They seem fine within truenas, but not looking through some of the logs in proxmox???
Within ZFS:
I got an automated message a few days ago:
Code:
"This is an automatically generated mail message from mdadm
running on pve
A Fail event had been detected on md device /dev/md127.
It could be related to component device /dev/sda1.
Faithfully yours, etc.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md127 : active (auto-read-only) raid1 sdb1[1] sda1[0](F)
2095040 blocks super 1.2 [2/1] [_U]
unused devices: <none>"
Looking through related log files:
Code:
"Feb 24 13:36:20 pve udisksd[3144]: Unable to resolve /sys/devices/virtual/block/md127/md/dev-sda1/block symlink
Feb 24 13:36:20 pve udisksd[3144]: Unable to resolve /sys/devices/virtual/block/md127/md/dev-sda1/block symlink
Feb 24 13:36:20 pve udisksd[3144]: Unable to resolve /sys/devices/virtual/block/md127/md/dev-sdb1/block symlink
Feb 24 13:36:20 pve udisksd[3144]: Unable to resolve /sys/devices/virtual/block/md127/md/dev-sdb1/block symlink
Feb 24 13:36:20 pve udisksd[3144]: Error reading sysfs attr `/sys/devices/virtual/block/md127/md/degraded': Failed to open file “/sys/devices/virtual/block/md127/md/degraded”: No such file or directory (g-file-error-quark, 4)
Feb 24 13:36:20 pve udisksd[3144]: Error reading sysfs attr `/sys/devices/virtual/block/md127/md/sync_action': Failed to open file “/sys/devices/virtual/block/md127/md/sync_action”: No such file or directory (g-file-error-quark, 4)
Feb 24 13:36:20 pve udisksd[3144]: Error reading sysfs attr `/sys/devices/virtual/block/md127/md/sync_completed': Failed to open file “/sys/devices/virtual/block/md127/md/sync_completed”: No such file or directory (g-file-error-quark, 4)
Feb 24 13:36:20 pve udisksd[3144]: Error reading sysfs attr `/sys/devices/virtual/block/md127/md/bitmap/location': Failed to open file “/sys/devices/virtual/block/md127/md/bitmap/location”: No such file or directory (g-file-error-quark, 4)
+V("J
MESSAGE=Registering new address record for fe80::42:97ff:fea0:dc69 on br-57d53181c146.*.
_SOURCE_REALTIME_TIMESTAMP=1708788916590128
'&q.&.X
L+V("J
_SOURCE_MONOTONIC_TIMESTAMP=18873332
MESSAGE=ata1.00: disabled
_SOURCE_MONOTONIC_TIMESTAMP=18873882
MESSAGE=sd 0:0:0:0: [sda] Synchronizing SCSI cache
_SOURCE_MONOTONIC_TIMESTAMP=18873903
MESSAGE=sd 0:0:0:0: [sda] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
_SOURCE_MONOTONIC_TIMESTAMP=18873906
MESSAGE=sd 0:0:0:0: [sda] Stopping disk
_SOURCE_MONOTONIC_TIMESTAMP=18873911
MESSAGE=sd 0:0:0:0: [sda] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
_SOURCE_MONOTONIC_TIMESTAMP=18879282
PRIORITY=2
MESSAGE=md/raid1:md127: Disk failure on sda1, disabling device.
md/raid1:md127: Operation continuing on 1 devices.
MESSAGE=Unable to resolve /sys/devices/virtual/block/md127/md/dev-sda1/block symlink
CODE_FUNC=udisks_linux_mdraid_update
CODE_FILE=udiskslinuxmdraid.c:444
_SOURCE_REALTIME_TIMESTAMP=1708788916666222
_SOURCE_MONOTONIC_TIMESTAMP=18904084
MESSAGE=ata2.00: disabled
_SOURCE_MONOTONIC_TIMESTAMP=18904898
MESSAGE=sd 1:0:0:0: [sdb] Synchronizing SCSI cache
_SOURCE_MONOTONIC_TIMESTAMP=18904921
MESSAGE=sd 1:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
_SOURCE_MONOTONIC_TIMESTAMP=18904926
MESSAGE=sd 1:0:0:0: [sdb] Stopping disk
_SOURCE_MONOTONIC_TIMESTAMP=18904932
MESSAGE=sd 1:0:0:0: [sdb] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
_SOURCE_MONOTONIC_TIMESTAMP=18934191
MESSAGE=md127: detected capacity change from 4190080 to 0
_SOURCE_MONOTONIC_TIMESTAMP=18934197
MESSAGE=md: md127 stopped.
_SOURCE_MONOTONIC_TIMESTAMP=18956063
MESSAGE=ata3.00: disabled
_SOURCE_MONOTONIC_TIMESTAMP=18956802
MESSAGE=sd 2:0:0:0: [sdc] Synchronizing SCSI cache
_SOURCE_MONOTONIC_TIMESTAMP=18956835
MESSAGE=sd 2:0:0:0: [sdc] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
_SOURCE_MONOTONIC_TIMESTAMP=18956842
MESSAGE=sd 2:0:0:0: [sdc] Stopping disk
_SOURCE_MONOTONIC_TIMESTAMP=18956853
MESSAGE=sd 2:0:0:0: [sdc] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
_SOURCE_MONOTONIC_TIMESTAMP=19064106
MESSAGE=ata4.00: disabled
_SOURCE_MONOTONIC_TIMESTAMP=19064712
MESSAGE=sd 3:0:0:0: [sdd] Synchronizing SCSI cache
_SOURCE_MONOTONIC_TIMESTAMP=19064743
MESSAGE=sd 3:0:0:0: [sdd] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
_SOURCE_MONOTONIC_TIMESTAMP=19064748
MESSAGE=sd 3:0:0:0: [sdd] Stopping disk
e @#V
_SOURCE_MONOTONIC_TIMESTAMP=19064755
MESSAGE=sd 3:0:0:0: [sdd] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
SYSLOG_PID=5242
MESSAGE=AA8981A7128: uid=0 from=<root>
_PID=5242
_SOURCE_REALTIME_TIMESTAMP=1708788916698606
4g Q"(X.R
_SOURCE_REALTIME_TIMESTAMP=1708788916666734
SYSLOG_PID=5755
MESSAGE=AA8981A7128: message-id=<20240224153516.AA8981A7128@pve.lan>
_PID=5755
_SOURCE_REALTIME_TIMESTAMP=1708788916702785
MESSAGE=Unable to resolve /sys/devices/virtual/block/md127/md/dev-sdb1/block symlink
_SOURCE_REALTIME_TIMESTAMP=1708788916667357
SYSLOG_PID=5243
r
)Uc B
MESSAGE=AA8981A7128: from=<root@pve.lan>, size=798, nrcpt=1 (queue active)
_PID=5243
_SOURCE_REALTIME_TIMESTAMP=1708788916743565
r
)Uc Bh3V
_SOURCE_REALTIME_TIMESTAMP=1708788916694276
SYSLOG_PID=5772
_PID=5772
_SOURCE_REALTIME_TIMESTAMP=1708788916748046
_SOURCE_REALTIME_TIMESTAMP=1708788916694853
MESSAGE=C27331A712E: uid=65534 from=<root>
_SOURCE_REALTIME_TIMESTAMP=1708788916796528
4g Q"(X.R
MESSAGE=Error reading sysfs attr `/sys/devices/virtual/block/md127/md/degraded': Failed to open file “/sys/devices/virtual/block/md127/md/degraded”: No such file or directory (g-file-error-quark, 4)
CODE_FUNC=read_sysfs_attr
CODE_FILE=udiskslinuxmdraidhelpers.c:59
_SOURCE_REALTIME_TIMESTAMP=1708788916771064
d L LHEV
MESSAGE=C27331A712E: message-id=<20240224153516.AA8981A7128@pve.lan>
_SOURCE_REALTIME_TIMESTAMP=1708788916796677
MESSAGE=Error reading sysfs attr `/sys/devices/virtual/block/md127/md/sync_action': Failed to open file “/sys/devices/virtual/block/md127/md/sync_action”: No such file or directory (g-file-error-quark, 4)
_SOURCE_REALTIME_TIMESTAMP=1708788916771078
d L LXFV
SYSLOG_PID=5771
MESSAGE=AA8981A7128: to=<root@pve.lan>, orig_to=<root>, relay=local, delay=0.15, delays=0.09/0/0/0.05, dsn=2.0.0, status=sent (delivered to command: /usr/bin/proxmox-mail-forward)
_PID=5771
_SOURCE_REALTIME_TIMESTAMP=1708788916797164
Lc J`PR
MESSAGE=Error reading sysfs attr `/sys/devices/virtual/block/md127/md/sync_completed': Failed to open file “/sys/devices/virtual/block/md127/md/sync_completed”: No such file or directory (g-file-error-quark, 4)
_SOURCE_REALTIME_TIMESTAMP=1708788916771089
d L LXFV
MESSAGE=AA8981A7128: removed
_SOURCE_REALTIME_TIMESTAMP=1708788916797283
o%h3V
MESSAGE=Error reading sysfs attr `/sys/devices/virtual/block/md127/md/bitmap/location': Failed to open file “/sys/devices/virtual/block/md127/md/bitmap/location”: No such file or directory (g-file-error-quark, 4)
_SOURCE_REALTIME_TIMESTAMP=1708788916771098
d L LXFV
MESSAGE=C27331A712E: from=<root@pve.lan>, size=954, nrcpt=1 (queue active)
_SOURCE_REALTIME_TIMESTAMP=1708788916826438
o%h3V
MESSAGE=time="2024-02-24T11:35:17.012577062-04:00" level=warning msg="WARNING: bridge-nf-call-iptables is disabled"
MESSAGE=time="2024-02-24T11:35:17.012600646-04:00" level=warning msg="WARNING: bridge-nf-call-ip6tables is disabled"
MESSAGE=time="2024-02-24T11:35:17.012622357-04:00" level=info msg="Docker daemon" commit=f417435 containerd-snapshotter=false storage-driver=overlay2 version=25.0.3
>;00e V
MESSAGE=time="2024-02-24T11:35:17.013904561-04:00" level=info msg="Daemon has completed initialization"
MESSAGE=time="2024-02-24T11:35:17.052552377-04:00" level=info msg="API listen on /run/docker.sock"
MESSAGE=Started Docker Application Container Engine.
_SOURCE_REALTIME_TIMESTAMP=1708788917052655"
MESSAGE=Device: /dev/sdb [SAT], opened
SYSLOG_RAW=<30>Feb 27 16:40:52 smartd[3123]: Device: /dev/sdb [SAT], opened
_SOURCE_REALTIME_TIMESTAMP=1709066452183733
d;g E[%
MESSAGE=Device: /dev/sdb [SAT], WDC WD140EDGZ-11B2DA2, S/N:2BG59KYE, WWN:5-000cca-295c269d0, FW:85.00A85, 14.0 TB
SYSLOG_RAW=<30>Feb 27 16:40:52 smartd[3123]: Device: /dev/sdb [SAT], WDC WD140EDGZ-11B2DA2, S/N:2BG59KYE, WWN:5-000cca-295c269d0, FW:85.00A85, 14.0 TB
_SOURCE_REALTIME_TIMESTAMP=1709066452183881
d;g E[%
MESSAGE=Device: /dev/sdb [SAT], not found in smartd database.
SYSLOG_RAW=<30>Feb 27 16:40:52 smartd[3123]: Device: /dev/sdb [SAT], not found in smartd database.
_SOURCE_REALTIME_TIMESTAMP=1709066452188172
d;g E[%
MESSAGE=Device: /dev/sdb [SAT], is SMART capable. Adding to "monitor" list.
SYSLOG_RAW=<30>Feb 27 16:40:52 smartd[3123]: Device: /dev/sdb [SAT], is SMART capable. Adding to "monitor" list.
_SOURCE_REALTIME_TIMESTAMP=1709066452192352
d;g E[%
MESSAGE=Device: /dev/sdb [SAT], state read from /var/lib/smartmontools/smartd.WDC_WD140EDGZ_11B2DA2-2BG59KYE.ata.state
SYSLOG_RAW=<30>Feb 27 16:40:52 smartd[3123]: Device: /dev/sdb [SAT], state read from /var/lib/smartmontools/smartd.WDC_WD140EDGZ_11B2DA2-2BG59KYE.ata.state