[FIXED] Drive issues since upgrading kernel and pve-firmware

laubi

New Member
Nov 8, 2024
1
1
3
Hello guys, yesterday i updated my proxmox version via web gui to kernel
Install: proxmox-kernel-6.8.12-3-pve-signed:amd64 (6.8.12-3, automatic)
and firmware: Upgrade: pve-firmware:amd64 (3.13-2, 3.14-1), proxmox-kernel-6.8:amd64 (6.8.12-2, 6.8.12-3)

Today when i tried to store some files and extract some of them, i got some wired timeouts, winrar (executed on a smb share hosted on a TrueNas VM) told me the archives were corrupted, which they weren't or crashed completely. I tried a different storage pool hosted on a single nvme, same behaviour.
When i tried to reboot the node it took longer than usual, same for booting. Some of the LCX containers didn't even start with those errors:

Code:
failed to connect to monitor socket: Connection refused
Failed to start pve-container@105.service: Transaction for pve-container@105.service/start is destructive (dev-disk-by\x2did-dm\x2duuid\x2dLVM\x2dW7A3ezZkrfdpoX3YuW3i3kIhPv6sE3QRgDgKSm40BPVqmGkBCTObNEj30mmXiURp.swap has 'stop' job queued, but 'start' is included in transaction).
See system logs and 'systemctl status pve-container@105.service' for details.
TASK ERROR: command 'systemctl start pve-container@105' failed: exit code 4
Systemctl didn't show anything special.

So i investigated further and found Error Information Log Entries: 3,158 on one of my NVMe drives (Samsung 970 EVO latest firmware).
Code:
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        24 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    8%
Data Units Read:                    195,590,836 [100 TB]
Data Units Written:                 301,215,368 [154 TB]
Host Read Commands:                 1,882,338,232
Host Write Commands:                1,463,810,401
Controller Busy Time:               6,314
Power Cycles:                       1,592
Power On Hours:                     11,418
Unsafe Shutdowns:                   118
Media and Data Integrity Errors:    0
Error Information Log Entries:      3,158
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               24 Celsius
Temperature Sensor 2:               24 Celsius

I tried booting with all the VMs/LCX container offline, same "long" booting time. And after starting each i got the same behaviour.
Tried with the older 6.8.12-2 Kernel, but same problem.

Im not that experienced and kinda lost now.
Any help would be appreciated.

Forgot something: I tried updating NextCloud before it happened, got several errors on the LCX, had to manually copy files and eventually failed. Next thing i recoginzed the errors of rar archives not working.


Edit: I might have fixed it with a faulty LCX container having a passthrough HDD from TrueNas / Nope still damaged archives... Newly copied working archives are instantly corrupted.

Edit2: i've now mentioned that video files, already stored on different vdevs, are being aborted via smb on windows at some random point. I don't really think it's TrueNas related, since it's on either hdd or nvme and 2 different vdevs.

Edit3: NVM. it was a bug caused by a TrueNas Update: NAS-132166 and was fixed this morning by the recent version: Dragonfish-24.04.2.5
 
Last edited:
  • Like
Reactions: Kingneutron

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!