[FIXED] Drive issues since upgrading kernel and pve-firmware

laubi

New Member
Nov 8, 2024
1
1
3
Hello guys, yesterday i updated my proxmox version via web gui to kernel
Install: proxmox-kernel-6.8.12-3-pve-signed:amd64 (6.8.12-3, automatic)
and firmware: Upgrade: pve-firmware:amd64 (3.13-2, 3.14-1), proxmox-kernel-6.8:amd64 (6.8.12-2, 6.8.12-3)

Today when i tried to store some files and extract some of them, i got some wired timeouts, winrar (executed on a smb share hosted on a TrueNas VM) told me the archives were corrupted, which they weren't or crashed completely. I tried a different storage pool hosted on a single nvme, same behaviour.
When i tried to reboot the node it took longer than usual, same for booting. Some of the LCX containers didn't even start with those errors:

Code:
failed to connect to monitor socket: Connection refused
Failed to start pve-container@105.service: Transaction for pve-container@105.service/start is destructive (dev-disk-by\x2did-dm\x2duuid\x2dLVM\x2dW7A3ezZkrfdpoX3YuW3i3kIhPv6sE3QRgDgKSm40BPVqmGkBCTObNEj30mmXiURp.swap has 'stop' job queued, but 'start' is included in transaction).
See system logs and 'systemctl status pve-container@105.service' for details.
TASK ERROR: command 'systemctl start pve-container@105' failed: exit code 4
Systemctl didn't show anything special.

So i investigated further and found Error Information Log Entries: 3,158 on one of my NVMe drives (Samsung 970 EVO latest firmware).
Code:
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        24 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    8%
Data Units Read:                    195,590,836 [100 TB]
Data Units Written:                 301,215,368 [154 TB]
Host Read Commands:                 1,882,338,232
Host Write Commands:                1,463,810,401
Controller Busy Time:               6,314
Power Cycles:                       1,592
Power On Hours:                     11,418
Unsafe Shutdowns:                   118
Media and Data Integrity Errors:    0
Error Information Log Entries:      3,158
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               24 Celsius
Temperature Sensor 2:               24 Celsius

I tried booting with all the VMs/LCX container offline, same "long" booting time. And after starting each i got the same behaviour.
Tried with the older 6.8.12-2 Kernel, but same problem.

Im not that experienced and kinda lost now.
Any help would be appreciated.

Forgot something: I tried updating NextCloud before it happened, got several errors on the LCX, had to manually copy files and eventually failed. Next thing i recoginzed the errors of rar archives not working.


Edit: I might have fixed it with a faulty LCX container having a passthrough HDD from TrueNas / Nope still damaged archives... Newly copied working archives are instantly corrupted.

Edit2: i've now mentioned that video files, already stored on different vdevs, are being aborted via smb on windows at some random point. I don't really think it's TrueNas related, since it's on either hdd or nvme and 2 different vdevs.

Edit3: NVM. it was a bug caused by a TrueNas Update: NAS-132166 and was fixed this morning by the recent version: Dragonfish-24.04.2.5
 
Last edited:
  • Like
Reactions: Kingneutron