Hello guys, yesterday i updated my proxmox version via web gui to kernel
Install: proxmox-kernel-6.8.12-3-pve-signed:amd64 (6.8.12-3, automatic)
and firmware: Upgrade: pve-firmware:amd64 (3.13-2, 3.14-1), proxmox-kernel-6.8:amd64 (6.8.12-2, 6.8.12-3)
Today when i tried to store some files and extract some of them, i got some wired timeouts, winrar (executed on a smb share hosted on a TrueNas VM) told me the archives were corrupted, which they weren't or crashed completely. I tried a different storage pool hosted on a single nvme, same behaviour.
When i tried to reboot the node it took longer than usual, same for booting. Some of the LCX containers didn't even start with those errors:
Systemctl didn't show anything special.
So i investigated further and found Error Information Log Entries: 3,158 on one of my NVMe drives (Samsung 970 EVO latest firmware).
I tried booting with all the VMs/LCX container offline, same "long" booting time. And after starting each i got the same behaviour.
Tried with the older 6.8.12-2 Kernel, but same problem.
Im not that experienced and kinda lost now.
Any help would be appreciated.
Forgot something: I tried updating NextCloud before it happened, got several errors on the LCX, had to manually copy files and eventually failed. Next thing i recoginzed the errors of rar archives not working.
Edit: I might have fixed it with a faulty LCX container having a passthrough HDD from TrueNas / Nope still damaged archives... Newly copied working archives are instantly corrupted.
Edit2: i've now mentioned that video files, already stored on different vdevs, are being aborted via smb on windows at some random point. I don't really think it's TrueNas related, since it's on either hdd or nvme and 2 different vdevs.
Edit3: NVM. it was a bug caused by a TrueNas Update: NAS-132166 and was fixed this morning by the recent version: Dragonfish-24.04.2.5
Install: proxmox-kernel-6.8.12-3-pve-signed:amd64 (6.8.12-3, automatic)
and firmware: Upgrade: pve-firmware:amd64 (3.13-2, 3.14-1), proxmox-kernel-6.8:amd64 (6.8.12-2, 6.8.12-3)
Today when i tried to store some files and extract some of them, i got some wired timeouts, winrar (executed on a smb share hosted on a TrueNas VM) told me the archives were corrupted, which they weren't or crashed completely. I tried a different storage pool hosted on a single nvme, same behaviour.
When i tried to reboot the node it took longer than usual, same for booting. Some of the LCX containers didn't even start with those errors:
Code:
failed to connect to monitor socket: Connection refused
Failed to start pve-container@105.service: Transaction for pve-container@105.service/start is destructive (dev-disk-by\x2did-dm\x2duuid\x2dLVM\x2dW7A3ezZkrfdpoX3YuW3i3kIhPv6sE3QRgDgKSm40BPVqmGkBCTObNEj30mmXiURp.swap has 'stop' job queued, but 'start' is included in transaction).
See system logs and 'systemctl status pve-container@105.service' for details.
TASK ERROR: command 'systemctl start pve-container@105' failed: exit code 4
So i investigated further and found Error Information Log Entries: 3,158 on one of my NVMe drives (Samsung 970 EVO latest firmware).
Code:
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 24 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 8%
Data Units Read: 195,590,836 [100 TB]
Data Units Written: 301,215,368 [154 TB]
Host Read Commands: 1,882,338,232
Host Write Commands: 1,463,810,401
Controller Busy Time: 6,314
Power Cycles: 1,592
Power On Hours: 11,418
Unsafe Shutdowns: 118
Media and Data Integrity Errors: 0
Error Information Log Entries: 3,158
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 24 Celsius
Temperature Sensor 2: 24 Celsius
I tried booting with all the VMs/LCX container offline, same "long" booting time. And after starting each i got the same behaviour.
Tried with the older 6.8.12-2 Kernel, but same problem.
Im not that experienced and kinda lost now.
Any help would be appreciated.
Forgot something: I tried updating NextCloud before it happened, got several errors on the LCX, had to manually copy files and eventually failed. Next thing i recoginzed the errors of rar archives not working.
Edit: I might have fixed it with a faulty LCX container having a passthrough HDD from TrueNas / Nope still damaged archives... Newly copied working archives are instantly corrupted.
Edit2: i've now mentioned that video files, already stored on different vdevs, are being aborted via smb on windows at some random point. I don't really think it's TrueNas related, since it's on either hdd or nvme and 2 different vdevs.
Edit3: NVM. it was a bug caused by a TrueNas Update: NAS-132166 and was fixed this morning by the recent version: Dragonfish-24.04.2.5
Last edited: