Proxmox freeze hardware

ivan.audero

Member
Jun 12, 2020
16
1
23
39
Good evening, I am facing a strange issue. I installed Proxmox on an old pc (old Asus M4A78LT-M LE motherboard, upgraded with a 512GB SSD and 16GB ram) and everything seemed to work. I have now 2 VMs running, a Debian server and a Truenas distro (with a lot of disks passed directly to the VM).

Everything is basically working, except that after a couple of hours running, proxmox becomes unreachable (both SSH and web GUI). My journal -f shows a lot of "authentication failure", even though I am perfectly able to login with root user both in SSH and web gui. I also changed password, removing special characters in it, restarted the services and the server many times, but I can't understand which is the problem. Here's the output of journal -f:

Feb 16 23:00:00 proxmox sshd[6560]: Accepted password for root from 192.168.168.3 port 5630 ssh2
Feb 16 23:00:00 proxmox sshd[6560]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Feb 16 23:00:00 proxmox systemd-logind[1273]: New session 5 of user root.
Feb 16 23:00:00 proxmox systemd[1]: Started Session 5 of user root.
Feb 16 23:00:33 proxmox IPCC.xs[4938]: pam_unix(proxmox-ve-auth:auth): authentication failure; logname= uid=0 euid=0 tty= ruser= rhost= user=root
Feb 16 23:00:33 proxmox IPCC.xs[4940]: pam_unix(proxmox-ve-auth:auth): authentication failure; logname= uid=0 euid=0 tty= ruser= rhost= user=root
Feb 16 23:00:34 proxmox pvedaemon[4938]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 23:00:34 proxmox pvedaemon[4940]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 23:01:36 proxmox IPCC.xs[4940]: pam_unix(proxmox-ve-auth:auth): authentication failure; logname= uid=0 euid=0 tty= ruser= rhost= user=root
Feb 16 23:01:36 proxmox IPCC.xs[4938]: pam_unix(proxmox-ve-auth:auth): authentication failure; logname= uid=0 euid=0 tty= ruser= rhost= user=root
Feb 16 23:01:39 proxmox pvedaemon[4940]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 23:01:39 proxmox pvedaemon[4938]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 23:02:05 proxmox IPCC.xs[4940]: pam_unix(proxmox-ve-auth:auth): authentication failure; logname= uid=0 euid=0 tty= ruser= rhost= user=root
Feb 16 23:02:06 proxmox pvedaemon[4940]: authentication failure; rhost=::ffff:5.42.199.51 user=root@pam msg=Authentication failure
Feb 16 23:02:39 proxmox IPCC.xs[4939]: pam_unix(proxmox-ve-auth:auth): authentication failure; logname= uid=0 euid=0 tty= ruser= rhost= user=root
Feb 16 23:02:39 proxmox IPCC.xs[4940]: pam_unix(proxmox-ve-auth:auth): authentication failure; logname= uid=0 euid=0 tty= ruser= rhost= user=root
Feb 16 23:02:41 proxmox pvedaemon[4939]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 23:02:41 proxmox pvedaemon[4940]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 23:03:42 proxmox IPCC.xs[4938]: pam_unix(proxmox-ve-auth:auth): authentication failure; logname= uid=0 euid=0 tty= ruser= rhost= user=root
Feb 16 23:03:42 proxmox IPCC.xs[4939]: pam_unix(proxmox-ve-auth:auth): authentication failure; logname= uid=0 euid=0 tty= ruser= rhost= user=root
Feb 16 23:03:44 proxmox pvedaemon[4938]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 23:03:44 proxmox pvedaemon[4939]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure

I suppose the first row is the succeded login through web ui, while I do not know what are the following lines.

Following, the output of syslog during the last crash (the log stops at 13:04, and at 22:33 I physically restarted the pc).

Feb 16 12:47:20 proxmox pvedaemon[4031]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 12:47:21 proxmox pvedaemon[4030]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 12:47:49 proxmox pvedaemon[4032]: authentication failure; rhost=::ffff:5.42.199.51 user=root@pam msg=Authentication failure
Feb 16 12:48:23 proxmox pvedaemon[4031]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 12:48:24 proxmox pvedaemon[4032]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 12:48:40 proxmox pvedaemon[4030]: authentication failure; rhost=::ffff:5.42.199.51 user=root@pam msg=Authentication failure
Feb 16 12:49:21 proxmox kernel: [ 3798.478541] ata2: lost interrupt (Status 0x50)
Feb 16 12:49:21 proxmox kernel: [ 3798.478565] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Feb 16 12:49:21 proxmox kernel: [ 3798.478675] ata2.00: failed command: WRITE DMA EXT
Feb 16 12:49:21 proxmox kernel: [ 3798.478741] ata2.00: cmd 35/00:08:58:5c:c0/00:00:1d:00:00/e0 tag 0 dma 4096 out
Feb 16 12:49:21 proxmox kernel: [ 3798.478741] res 40/00:01:01:4f:c2/00:00:00:00:00/10 Emask 0x4 (timeout)
Feb 16 12:49:21 proxmox kernel: [ 3798.478932] ata2.00: status: { DRDY }
Feb 16 12:49:21 proxmox kernel: [ 3798.479000] ata2: soft resetting link
Feb 16 12:49:22 proxmox kernel: [ 3798.700576] ata2.00: configured for UDMA/100
Feb 16 12:49:22 proxmox kernel: [ 3799.051077] ata2.01: configured for UDMA/100
Feb 16 12:49:22 proxmox kernel: [ 3799.051100] ata2: EH complete
Feb 16 12:49:26 proxmox pvedaemon[4031]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 12:49:27 proxmox pvedaemon[4032]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 12:50:29 proxmox pvedaemon[4031]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 12:50:29 proxmox pvedaemon[4032]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 12:50:59 proxmox pvedaemon[4031]: authentication failure; rhost=::ffff:5.42.199.51 user=root@pam msg=Authentication failure
Feb 16 12:51:26 proxmox pvedaemon[4031]: authentication failure; rhost=::ffff:5.42.199.51 user=root@pam msg=Authentication failure
Feb 16 12:51:32 proxmox pvedaemon[4032]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 12:51:33 proxmox pvedaemon[4030]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 12:52:35 proxmox pvedaemon[4031]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 12:52:37 proxmox pvedaemon[4032]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 12:53:38 proxmox pvedaemon[4030]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 12:53:39 proxmox kernel: [ 4056.520679] ata2: lost interrupt (Status 0x50)
Feb 16 12:53:39 proxmox kernel: [ 4056.520703] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Feb 16 12:53:39 proxmox kernel: [ 4056.520811] ata2.00: failed command: WRITE DMA EXT
Feb 16 12:53:39 proxmox kernel: [ 4056.520876] ata2.00: cmd 35/00:08:88:8e:c0/00:00:1e:00:00/e0 tag 0 dma 4096 out
Feb 16 12:53:39 proxmox kernel: [ 4056.520876] res 40/00:01:01:4f:c2/00:00:00:00:00/10 Emask 0x4 (timeout)
Feb 16 12:53:39 proxmox kernel: [ 4056.521066] ata2.00: status: { DRDY }
Feb 16 12:53:39 proxmox kernel: [ 4056.521134] ata2: soft resetting link
Feb 16 12:53:40 proxmox kernel: [ 4056.746662] ata2.00: configured for UDMA/100
Feb 16 12:53:40 proxmox kernel: [ 4057.105229] ata2.01: configured for UDMA/100
Feb 16 12:53:40 proxmox kernel: [ 4057.105251] ata2: EH complete
Feb 16 12:53:40 proxmox pvedaemon[4031]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 12:54:41 proxmox pvedaemon[4030]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 12:54:43 proxmox pvedaemon[4031]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 12:54:57 proxmox pvedaemon[4030]: authentication failure; rhost=::ffff:5.42.199.51 user=root@pam msg=Authentication failure
Feb 16 12:55:28 proxmox pvedaemon[4032]: authentication failure; rhost=::ffff:5.42.199.51 user=root@pam msg=Authentication failure
Feb 16 12:55:44 proxmox pvedaemon[4031]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 12:55:46 proxmox pvedaemon[4030]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 12:56:46 proxmox pvedaemon[4032]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 12:56:48 proxmox pvedaemon[4031]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 12:57:50 proxmox pvedaemon[4030]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 12:57:52 proxmox pvedaemon[4031]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 12:58:00 proxmox pvedaemon[4030]: authentication failure; rhost=::ffff:5.42.199.51 user=root@pam msg=Authentication failure
Feb 16 12:58:03 proxmox pvedaemon[4031]: authentication failure; rhost=::ffff:5.42.199.51 user=root@pam msg=Authentication failure
Feb 16 12:58:53 proxmox pvedaemon[4031]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 12:58:55 proxmox pvedaemon[4032]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 12:59:56 proxmox pvedaemon[4031]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 12:59:58 proxmox pvedaemon[4032]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 13:00:59 proxmox pvedaemon[4031]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 13:01:01 proxmox pvedaemon[4032]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 13:01:39 proxmox pvedaemon[4031]: authentication failure; rhost=::ffff:5.42.199.51 user=root@pam msg=Authentication failure
Feb 16 13:02:03 proxmox pvedaemon[4030]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 13:02:04 proxmox pvedaemon[4031]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 13:02:09 proxmox pvedaemon[4032]: authentication failure; rhost=::ffff:5.42.199.51 user=root@pam msg=Authentication failure
Feb 16 13:03:05 proxmox pvedaemon[4031]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 13:03:06 proxmox pvedaemon[4032]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 13:04:08 proxmox pvedaemon[4031]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 13:04:10 proxmox pvedaemon[4030]: authentication failure; rhost=::ffff:192.168.168.4 user=root@pam msg=Authentication failure
Feb 16 13:04:18 proxmox kernel: [ 4694.742041] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 16 13:04:18 proxmox kernel: [ 4694.766182] ata5.00: configured for UDMA/133
Feb 16 13:04:18 proxmox kernel: [ 4695.042017] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 16 13:04:18 proxmox kernel: [ 4695.091017] ata6.00: configured for UDMA/133
Feb 16 22:33:55 proxmox systemd-modules-load[410]: Inserted module 'iscsi_tcp'
Feb 16 22:33:55 proxmox dmeventd[429]: dmeventd ready for processing.
Feb 16 22:33:55 proxmox systemd-modules-load[410]: Inserted module 'ib_iser'
Feb 16 22:33:55 proxmox lvm[429]: Monitoring thin pool pve-data-tpool.
Feb 16 22:33:55 proxmox systemd-modules-load[410]: Inserted module 'vhost_net'
Feb 16 22:33:55 proxmox systemd[1]: Starting Flush Journal to Persistent Storage...
Feb 16 22:33:55 proxmox lvm[404]: 8 logical volume(s) in volume group "pve" monitored
Feb 16 22:33:55 proxmox systemd[1]: Started Rule-based Manager for Device Events and Files.
Feb 16 22:33:55 proxmox systemd[1]: Finished Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling.
Feb 16 22:33:55 proxmox systemd[1]: Reached target Local File Systems (Pre).
Feb 16 22:33:55 proxmox systemd[1]: Finished Flush Journal to Persistent Storage.
Feb 16 22:33:55 proxmox systemd-modules-load[410]: Inserted module 'zfs'
Feb 16 22:33:55 proxmox systemd[1]: Finished Load Kernel Modules.
Feb 16 22:33:55 proxmox systemd-udevd[455]: Using default interface naming scheme 'v247'.
Feb 16 22:33:55 proxmox systemd[1]: Starting Apply Kernel Variables...
Feb 16 22:33:55 proxmox systemd[1]: Finished Apply Kernel Variables.
Feb 16 22:33:55 proxmox systemd-udevd[457]: Using default interface naming scheme 'v247'.
Feb 16 22:33:55 proxmox systemd-udevd[455]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 16 22:33:55 proxmox systemd-udevd[457]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 16 22:33:55 proxmox systemd[1]: Listening on Load/Save RF Kill Switch Status /dev/rfkill Watch.
Feb 16 22:33:55 proxmox systemd-udevd[454]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 16 22:33:55 proxmox udevadm[436]: systemd-udev-settle.service is deprecated. Please fix zfs-import-cache.service, zfs-import-scan.service not to pull it in.
Feb 16 22:33:55 proxmox systemd[1]: Found device /dev/pve/swap.
Feb 16 22:33:55 proxmox systemd[1]: Activating swap /dev/pve/swap...
Feb 16 22:33:55 proxmox systemd[1]: Reached target Sound Card.
Feb 16 22:33:55 proxmox systemd[1]: Activated swap /dev/pve/swap.
Feb 16 22:33:55 proxmox systemd[1]: Reached target Swap.
Feb 16 22:33:55 proxmox systemd[1]: Created slice system-lvm2\x2dpvscan.slice.
Feb 16 22:33:55 proxmox systemd[1]: Starting LVM event activation on device 8:3...



Hope you can help, I am struggling with this for days :-(
 
lot of "authentication failure"
Smells like hammering/password trying from hackers, if 5.42.199.51 is not your public IP or anything related to you.
The webif from proxmox should be for security only reachable from your internal 192.168.168.0 net!

Also ata2, ata5 and ata6 throw errors. Unlikely that all 3 disks are dying at the same time, so I guess your disk-controller is looping resets of the links and then gives up completely, which freezes the whole system in the end.
Could be flaky cables, firmware bugs if these are SSDs (Samsung+ old AMDController is no good idea, google around), but the link resets of the disk-controller should be the root of your error.
 
5.42.199.51 - this IP is known very well for us, tried to break into every node...
Setup ACL for GUI access or Fail2Ban

https://www.abuseipdb.com/check/5.42.199.51 <-- You can check it here, breaks into publicly open proxmox instances around the world Russia so you can't do anything as it's hosting provider does not care :confused:
 
Smells like hammering/password trying from hackers, if 5.42.199.51 is not your public IP or anything related to you.
The webif from proxmox should be for security only reachable from your internal 192.168.168.0 net!

Also ata2, ata5 and ata6 throw errors. Unlikely that all 3 disks are dying at the same time, so I guess your disk-controller is looping resets of the links and then gives up completely, which freezes the whole system in the end.
Could be flaky cables, firmware bugs if these are SSDs (Samsung+ old AMDController is no good idea, google around), but the link resets of the disk-controller should be the root of your error.
Thank you very much mr44er, I will immediately restrict access to web gui (I set up a VPN instead of opening port on the router).

For what is concerning the disks: I can see the ata2 explicit error and consequent soft reset (I can also think about replace this disk, is an old HD which I wanted to use as "shared bin"), but the messages related to ata5 and ata6 are also critical errors ("configured for UDMA/133")?

Ayway, I will further investigate on hardware compatibility... Thank you very much!
 
5.42.199.51 - this IP is known very well for us, tried to break into every node...
Setup ACL for GUI access or Fail2Ban

https://www.abuseipdb.com/check/5.42.199.51 <-- You can check it here, breaks into publicly open proxmox instances around the world Russia so you can't do anything as it's hosting provider does not care :confused:
Thank you very much for your kind reply! At least I immediately learnt which is not good practice open port on the router with public IP :)
 
Here I am... My system is constantly freezing after some time, I am now almost sure that depends on hard disk problem related. The fact is that I have no idea on how to proceed. Here's some log which identifies hard drive errors, could you give some hints on what could I do to solve this issue? Are there any checks I can do? I forgot to mention that the drives are passed through to a TrueNAS VM, can this create problems?

THANKS!

Feb 27 07:40:50 proxmox kernel: [ 3054.575870] ata2: lost interrupt (Status 0x50)
Feb 27 07:40:50 proxmox kernel: [ 3054.575902] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Feb 27 07:40:50 proxmox kernel: [ 3054.576115] ata2.00: failed command: WRITE DMA
Feb 27 07:40:50 proxmox kernel: [ 3054.576253] ata2.00: cmd ca/00:18:d8:b0:c0/00:00:00:00:00/e0 tag 0 dma 12288 out
Feb 27 07:40:50 proxmox kernel: [ 3054.576253] res 40/00:01:01:4f:c2/00:00:00:00:00/10 Emask 0x4 (timeout)
Feb 27 07:40:50 proxmox kernel: [ 3054.576644] ata2.00: status: { DRDY }
Feb 27 07:40:50 proxmox kernel: [ 3054.576779] ata2: soft resetting link
Feb 27 07:40:50 proxmox kernel: [ 3054.798395] ata2.00: configured for UDMA/100
Feb 27 07:40:51 proxmox kernel: [ 3055.148644] ata2.01: configured for UDMA/100
Feb 27 07:40:51 proxmox kernel: [ 3055.148664] ata2: EH complete
Feb 27 07:50:13 proxmox smartd[1179]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 71 to 72
Feb 27 07:50:13 proxmox smartd[1179]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 29 to 28
Feb 27 07:50:14 proxmox smartd[1179]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 241 to 240
 
Are there any checks I can do?
smartctl -t long /dev/sdb starts a long/complete selftest of the disk. It will really take some hours, you can check the status with smartctl -a /dev/sdb or smartctl -x /dev/sdb. You can run this check on all problematic disks in parallel to save time.

passed through to a TrueNAS VM, can this create problems?
It can, but I'm not so sure according to the logs. Either the controller is flapping around (caused by dying disk) or the disk has broken sectors (which also means disk is done, but not fully dead).
Please post smart-output here after the scans are done to decide further. If the scan is 100% good and no broken sectors on all disks, we can rule that out.
 
Thank you very much for your help! Here's the output of the command you suggested:


smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.85-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Seagate Laptop HDD
Device Model: ST500LT012-1DG142
Serial Number: WBY9RRMS
LU WWN Device Id: 5 000c50 0a92e3955
Firmware Version: 0007LIM1
User Capacity: 500,107,862,016 bytes [500 GB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 2.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Thu Mar 9 17:44:53 2023 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 103) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x1031) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 120 100 034 Pre-fail Always - 239133280
3 Spin_Up_Time 0x0003 100 099 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 099 099 020 Old_age Always - 1146
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 069 060 030 Pre-fail Always - 9381226
9 Power_On_Hours 0x0032 081 081 000 Old_age Always - 16895 (36 163 0)
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 120
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 065 045 045 Old_age Always In_the_past 35 (Min/Max 29/37)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 6
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 32
193 Load_Cycle_Count 0x0032 087 087 000 Old_age Always - 26647
194 Temperature_Celsius 0x0022 035 055 000 Old_age Always - 35 (0 13 0 0 0)
196 Reallocated_Event_Count 0x000f 100 100 030 Pre-fail Always - 653 (11059 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
254 Free_Fall_Sensor 0x0032 100 100 000 Old_age Always - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 16895 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay
 
Ok, looks good. Maybe useful to learn in the future for what to look on harddrives:
5 Reallocated_Sector_Ct -> everything non-zero = bad, 1-10 maybe tolerable, but better replace it and take no risks
193 Load_Cycle_Count -> caution when reaching ~300.000 on consumer disk (laptop disk is consumer grade and will reach this even earlier. sidenote: I've seen that proxmox sets APM255 to prevent that if possible) and ~600.000 for sas/enterprise grade
197 Current_Pending_Sector -> everything non-zero = bad, 1-10 maybe tolerable, but better replace it and take no risks
198 Offline_Uncorrectable -> everything non-zero = bad, 1-10 maybe tolerable, but better replace it and take no risks

But all in all the most important thing is the long-test, which is perfect:
# 1 Extended offline Completed without error 00% 16895 -

What about the other disks? In the initial post there were errors on ata2, ata5 and ata6? (I don't know the corresponding /dev/sd$)
 
Last edited:
Thank you very much for your super-precious information! I will run the command on all the other disks when I have time, but I suspect that the problem is somewhere else, since the disks are almost new! I have doubts on the controller, since the hardware is so old, but I don't know what to do :)
 
Yes, it's unlikely, but not impossible. Because fiddling with the controller and in the end it is an error from a disk this costs pulled out hair and time. ;)
Double-check all the cables to the disk.
Either your hardware is really old and has only sata300 ports or it could be a sign of flaky cables. -> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)

Freezes could also be caused by an old (half-dead) power adapter, which doesn't deliver quick and enough energy the disks need at a time. Also a classic for freeze is broken RAM, but this shouldn't show errors with controller/disks. I can't rule out anything for now.
 
Ok, I'm running the check on all disks. I encountered this message on /dev/sdd:

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Interrupted (host reset) 00% 720 -



and this one on /dev/sde:

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Interrupted (host reset) 00% 7980 -
# 2 Extended offline Completed without error 00% 7895 -


Are they normal?
 
Are they normal?
If the interruption happened for real during the running test, then yes. Could be only the controller resetting itself (saw that in your first post) or the whole host crashes and reboots.
So I think, we can rule out the disks here...plus the fact, that they are indeed new.

Double check the cables of the disk and try other ones, if you have some spares. Do they sit tight or wobbly? Also a good idead would be a spare power supply to check if it behaves the same.
 
Thank you very much mr44er. I will of course check the cables, and also try to replace one disk that seems to have damaged sectors (it is one of the old ones, recovered from a laptop). Anyway, since the power supply can be one of the reason, I really suspect it can be the problem. The power supply unit is very old, and I connected 6 SATA disks with several power splitters... I imagine this is not the best design :-D

I will try in the next days to change the power supply, it could be a long operation since I do not have so much time. I will return here when I'll have some news!

Thank you very very much! <3
 
several power splitters... I imagine this is not the best design :-D
Uhm...yes :) Been there, done that. Sometimes it works ok, but flapping controller is normal behaviour if not. A new (modular!) power supply with more than enough SATA connectors is really the better choice. ;)
 
Safe is the wrong word. I've seen images of burnt/molten ones on amazon reviews. :oops:
Could be just a bad batch, short circuits from cheapest material...or running 10+ disks on a single rail. I don't know which brand is recommendable and what is available in your area. I would buy 2 or 3 different brands and test it, one after another with a fire extinguisher prepared. :)
Also the problem hasn't to be the power cabling...with SATA cables it's the same story@quality. Also avoid the cheap ones, and only use these with metal clip. These give the better chance of a tight fit, but no guarantee of good contact.
 

Attachments

  • 1.jpg
    1.jpg
    10.9 KB · Views: 4
  • 2.jpg
    2.jpg
    13.1 KB · Views: 4
Safe is the wrong word. I've seen images of burnt/molten ones on amazon reviews. :oops:
Could be just a bad batch, short circuits from cheapest material...or running 10+ disks on a single rail. I don't know which brand is recommendable and what is available in your area. I would buy 2 or 3 different brands and test it, one after another with a fire extinguisher prepared. :)
Also the problem hasn't to be the power cabling...with SATA cables it's the same story@quality. Also avoid the cheap ones, and only use these with metal clip. These give the better chance of a tight fit, but no guarantee of good contact.
Your overview has been extremely precise and exhaustive, thank you very much! Yesterday I bought a new PSU and installed, and the SATA errors seem to be gone! Proxmox running all night without the ATA errors encountered in the past, I will monitor the situation in the next few days but I am quite confident :-)

There is just one left (and this could be true, since the disk is quite old):

Mar 21 09:23:12 proxmox smartd[1180]: Device: /dev/sdf [SAT], 8 Currently unreadable (pending) sectors
Mar 21 09:23:12 proxmox smartd[1180]: Device: /dev/sdf [SAT], 8 Offline uncorrectable sectors

Are those errors fixable in some way? Or I need to replace the disk?

After this, I need to understand why the TrueNAS VM keeps freezing. Could it be for the errors above?

I don't know if this is the right thread, but I try to ask a question: is the "pass through" solution the best one? I chose this setup just to get rid of one additional layer, but I am not sure it's the best (5 sata disks passed through the TrueNAS VM).

Thank you very very very much!
 
Yesterday I bought a new PSU and installed, and the SATA errors seem to be gone! Proxmox running all night without the ATA errors encountered in the past, I will monitor the situation in the next few days but I am quite confident
Yeah, that sounds good :)

Are those errors fixable in some way? Or I need to replace the disk?
It depends. Forget about fixing broken sectors, it's not worth it and damage on the area on the platter is irreparable anyway (for end-users). A disk has reserve sectors more or less, either it fixes these itself (block and use 8 spare ones in your case) or the number of broken ones rises up faster and over the number of available spares. What you can do is a one time write run with zeros. Be careful, this will wipe all the data and to use the right disk!
dd bs=1M if=/dev/zero of=/dev/sdf status=progress if this does not work, use dd bs=1M if=/dev/zero of=/dev/sdf without any output about progress. It will take some hours.
If the disk throws any (write)error or errors on proxmox-console, the disk is dead and should be replaced. If this runs without error, check SMART again. It should have 16 (or more) Offline uncorrectable sectors. If more -> replace it, more and more sectors will break. If 16 or 8 after the run -> you decide to take the risk. Check again smart after a week, after a month. If it stays on a low number over time (unlikely), it can be tolerable.

After this, I need to understand why the TrueNAS VM keeps freezing. Could it be for the errors above?
Yes, possible.

is the "pass through" solution the best one? I chose this setup just to get rid of one additional layer, but I am not sure it's the best (5 sata disks passed through the TrueNAS VM).
Mhm, the best is passthrough of the whole disk controller (with everything connected to it), but that's not possible here. I don't know much about passing through single disk and behaviour, I think it's better if you open a new thread with that specific question.