HELP Please All of a sudden Proxmox won't boot. It boots until the login screen and then stops all services and shuts down.

Jarvar

Active Member
Aug 27, 2019
317
10
38
There is a report of [sde] Synchronize Cache(10) failed: Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
and then the stopping of all services and the shutdown process begins.
 
A quick skim of Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK on Google seems to indicate that this is very frequently a hardware problem with the disk.

Are you using HDDs or SSDs with your setup? Booting with a rescue drive and invoking smartctl /dev/<disk-id> on your root disk should give you some vital information about your device's health.
If you couldn't resolve your problem yet, please post some more information about your setup, this is essential to narrow down what the problem could be.
 
Thank you @datschlatscher
I'm ussing SSDs, dual Raidz.
I have a pair of 240GB SSDs in Mirror configuration for the Proxmox install and then another two 960 GB SSDs for the zpool storage of VMs.
Both sets of hard drives or enterprise Intel SSDS.
Here is the output of one.


Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Dell Certified Intel S4x00/D3-S4x10 Series SSDs Device Model: SSDSC2KG240G7R Serial Number: LU WWN Device Id: 5 5cd2e4 14f1454fb Add. Product Id: DELL(tm) Firmware Version: SCV1DL57 User Capacity: 240,057,409,536 bytes [240 GB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: Solid State Device Form Factor: 2.5 inches TRIM Command: Available, deterministic, zeroed Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-3 T13/2161-D revision 5 SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Mon Jul 11 10:45:45 2022 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x02) Offline data collection activity was completed without error. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 36) seconds. Offline data collection capabilities: (0x79) SMART execute Offline immediate. No Auto Offline data collection support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 60) minutes. Conveyance self-test routine recommended polling time: ( 60) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000e 130 130 039 Old_age Always - 314034 5 Reallocated_Sector_Ct 0x0033 100 100 001 Pre-fail Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 33126 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 123 13 Read_Soft_Error_Rate 0x001e 130 130 000 Old_age Always - 2027224877746 170 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0 174 Unsafe_Shutdown_Count 0x0032 100 100 000 Old_age Always - 66 179 Used_Rsvd_Blk_Cnt_Tot 0x0033 100 100 010 Pre-fail Always - 0 180 Unused_Rsvd_Blk_Cnt_Tot 0x0032 100 100 000 Old_age Always - 9865 181 Program_Fail_Cnt_Total 0x003a 100 100 000 Old_age Always - 0 182 Erase_Fail_Count_Total 0x003a 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 000 Old_age Always - 0 194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 22 195 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 201 Power_Loss_Cap_Test 0x0033 100 100 010 Pre-fail Always - 2789 (334 8937) 202 End_of_Life 0x0027 100 100 000 Pre-fail Always - 0 225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 1150957 226 Workld_Media_Wear_Indic 0x0032 100 100 000 Old_age Always - 3542 227 Workld_Host_Reads_Perc 0x0032 100 100 000 Old_age Always - 34 228 Workload_Minutes 0x0032 100 100 000 Old_age Always - 1986949 232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0 233 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 1150957 234 Thermal_Throttle_Status 0x0032 100 100 000 Old_age Always - 0/0 241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 1150957 242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 629107 245 Percent_Life_Remaining 0x0032 097 097 000 Old_age Always - 97 SMART Error Log not supported SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 32977 - # 2 Extended offline Completed without error 00% 32298 - # 3 Extended offline Completed without error 00% 31560 - # 4 Extended offline Completed without error 00% 30847 - # 5 Extended offline Completed without error 00% 30110 - # 6 Extended offline Completed without error 00% 29445 - # 7 Extended offline Completed without error 00% 28707 - # 8 Extended offline Completed without error 00% 27970 - # 9 Extended offline Completed without error 00% 27255 - #10 Extended offline Completed without error 00% 26518 - #11 Extended offline Completed without error 00% 25804 - #12 Extended offline Completed without error 00% 25067 - #13 Extended offline Completed without error 00% 24330 - #14 Extended offline Completed without error 00% 23616 - #15 Extended offline Completed without error 00% 22879 - #16 Extended offline Completed without error 00% 22165 - #17 Extended offline Completed without error 00% 21429 - #18 Extended offline Completed without error 00% 20763 - #19 Extended offline Completed without error 00% 20026 - #20 Extended offline Completed without error 00% 19288 - #21 Extended offline Completed without error 00% 18575 - Read SMART Selective Self-test Log failed: scsi error aborted command
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.39-1-pve] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Dell Certified Intel S4x00/D3-S4x10 Series SSDs Device Model: SSDSC2KG240G7R Serial Number: LU WWN Device Id: 5 5cd2e4 14f142274 Add. Product Id: DELL(tm) Firmware Version: SCV1DL57 User Capacity: 240,057,409,536 bytes [240 GB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: Solid State Device Form Factor: 2.5 inches TRIM Command: Available, deterministic, zeroed Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-3 T13/2161-D revision 5 SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Mon Jul 11 10:51:16 2022 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x02) Offline data collection activity was completed without error. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 36) seconds. Offline data collection capabilities: (0x79) SMART execute Offline immediate. No Auto Offline data collection support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 60) minutes. Conveyance self-test routine recommended polling time: ( 60) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000e 130 130 039 Old_age Always - 398871 5 Reallocated_Sector_Ct 0x0033 100 100 001 Pre-fail Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 33126 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 123 13 Read_Soft_Error_Rate 0x001e 130 130 000 Old_age Always - 2680059991575 170 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0 174 Unsafe_Shutdown_Count 0x0032 100 100 000 Old_age Always - 66 179 Used_Rsvd_Blk_Cnt_Tot 0x0033 100 100 010 Pre-fail Always - 0 180 Unused_Rsvd_Blk_Cnt_Tot 0x0032 100 100 000 Old_age Always - 9818 181 Program_Fail_Cnt_Total 0x003a 100 100 000 Old_age Always - 0 182 Erase_Fail_Count_Total 0x003a 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 000 Old_age Always - 0 194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 24 195 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 201 Power_Loss_Cap_Test 0x0033 100 100 010 Pre-fail Always - 2563 (328 8883) 202 End_of_Life 0x0027 100 100 000 Pre-fail Always - 0 225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 1150932 226 Workld_Media_Wear_Indic 0x0032 100 100 000 Old_age Always - 3563 227 Workld_Host_Reads_Perc 0x0032 100 100 000 Old_age Always - 33 228 Workload_Minutes 0x0032 100 100 000 Old_age Always - 1986971 232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0 233 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 1150932 234 Thermal_Throttle_Status 0x0032 100 100 000 Old_age Always - 0/0 241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 1150932 242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 605961 245 Percent_Life_Remaining 0x0032 097 097 000 Old_age Always - 97 SMART Error Log not supported SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 32978 - # 2 Extended offline Completed without error 00% 32299 - # 3 Extended offline Completed without error 00% 31562 - # 4 Extended offline Completed without error 00% 30848 - # 5 Extended offline Completed without error 00% 30112 - # 6 Extended offline Completed without error 00% 29446 - # 7 Extended offline Completed without error 00% 28708 - # 8 Extended offline Completed without error 00% 27971 - # 9 Extended offline Completed without error 00% 27257 - #10 Extended offline Completed without error 00% 26519 - #11 Extended offline Completed without error 00% 25806 - #12 Extended offline Completed without error 00% 25068 - #13 Extended offline Completed without error 00% 24331 - #14 Extended offline Completed without error 00% 23617 - #15 Extended offline Completed without error 00% 22880 - #16 Extended offline Completed without error 00% 22167 - #17 Extended offline Completed without error 00% 21430 - #18 Extended offline Completed without error 00% 20764 - #19 Extended offline Completed without error 00% 20027 - #20 Extended offline Completed without error 00% 19290 - #21 Extended offline Completed without error 00% 18576 - Read SMART Selective Self-test Log failed: scsi error aborted command
 
A quick skim of Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK on Google seems to indicate that this is very frequently a hardware problem with the disk.

Are you using HDDs or SSDs with your setup? Booting with a rescue drive and invoking smartctl /dev/<disk-id> on your root disk should give you some vital information about your device's health.
If you couldn't resolve your problem yet, please post some more information about your setup, this is essential to narrow down what the problem could be.
One thing I'm wondering also is it could be a loop where I had setup the Proxmox server as a NUT-Client from a Synology NAS UPS server on the same network. It was running for a few days, but maybe there was an error and it was constantly being sent a shutdown signal?
Is that possible?
It shut down yesterday after reinstalling Proxmox, I turned off the NUT server on the UPS and it seemed to have stopped. It was on 6.4 and now it's on 7.2-7.
 
Your SMART data looks fine to me, I can't see any problem indicators.
Is the error message you posted above the only error message you are getting? I just realized that /dev/sde seems to be the culprit.
That, however, makes me a bit confused about your setup, as you said you have only 4 disks (but nothing is guaranteed with /dev/sd* device mappings).
Could you please post the output of lsblk?

Also, have you tried launching Proxmox in Rescue Mode from the boot menu? The output from journalctl or dmesg might also be very valuable, but I'm not sure how to access those/whether you can somehow chroot into your setup.

PS: (I've been told) This may be of interest, are you using a multipath setup?
 
Last edited:
Thank you @datschlatscher
I think /dev/sde is most likely an external USB drive.
On the Proxmox there was a PBS installed on a VM with USB pass through for backups.
The Proxmox was being backed up to the PBS locally onsite, and then synced to a remote offsite nightly.
I've done a journalctl and dmesg but both are really long. Should I post them here? or what am I looking for?
 
Another thing I saw in a few forums regarding this was that some people received these errors because of e.g. faulty SATA connectors. If you have such at hand, please try to switch them out.

Should I post them here?
Yes, please do so. The best way would be to pipe them into a file and attaching them to your next post here.
The only important thing to look out for is, to include the relevant, new ones. You could either just post all entries from today via e.g. journalctl --since="today" or just the messages since the last boot journalctl -b 0
The output of lsblk should still be useful
 
Here is my lsblk
lsblk
Code:
 lsblk
NAME     MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda        8:0    0 223.6G  0 disk
├─sda1     8:1    0  1007K  0 part
├─sda2     8:2    0   512M  0 part
└─sda3     8:3    0 223.1G  0 part
sdb        8:16   0 223.6G  0 disk
├─sdb1     8:17   0  1007K  0 part
├─sdb2     8:18   0   512M  0 part
└─sdb3     8:19   0 223.1G  0 part
sdc        8:32   0 894.3G  0 disk
├─sdc1     8:33   0 894.2G  0 part
└─sdc9     8:41   0     8M  0 part
sdd        8:48   0 894.3G  0 disk
├─sdd1     8:49   0 894.2G  0 part
└─sdd9     8:57   0     8M  0 part
sr0       11:0    1  1024M  0 rom
zd0      230:0    0   240G  0 disk
├─zd0p1  230:1    0   549M  0 part
└─zd0p2  230:2    0 239.5G  0 part
zd16     230:16   0    32G  0 disk
├─zd16p1 230:17   0    30G  0 part
├─zd16p2 230:18   0     1K  0 part
└─zd16p5 230:21   0     2G  0 part
zd32     230:32   0   960G  0 disk
├─zd32p1 230:33   0   128M  0 part
└─zd32p2 230:34   0 959.9G  0 part
zd48     230:48   0   100G  0 disk
├─zd48p1 230:49   0  1007K  0 part
├─zd48p2 230:50   0   512M  0 part
└─zd48p3 230:51   0  99.5G  0 part

I have since resintalled Proxmox. Last night I went ahead and updated the bios, and also any drivers for the server which is a Dell T330 to the latest. That [sde] Synchronize Cache(10) failed: Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK did show up again, but the server did not shut down or reboot.

I attached the latest journalctl -b 0
 

Attachments

  • 07122022_journalctl_pve.txt
    158 KB · Views: 2
Ok, thanks for the log.
According to your journalctl messages it looks like you are using external hard drives connected via USB. Why you experienced these crashes/errors during system start? Hard to say, unfortunately, I don't know. Here's to hoping they don't come back on the new install.

It seems the Synchronize Cache(10) failed: Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK error was not the thing causing your crashes then.
As I mentioned earlier, all these messages are preceded by [sde], which in this case, should reference a disk at /dev/sde. (As far as I can tell from the log this is a portable USB hard disk drive?)

It does not show up in the lsblk output though, did you unplug it? The caching errors might still signal a hardware error with the drive or the connector, you should try to test this if you can. Considering that this is not your root disk, though, it should not be the cause of your problems.
Feel free to post again if you experience some issues again.
 
The USB drive in particular is being passed through to a Proxmox Backup Server VM, another drive, possible /dev/sdf would be another drive I have which is passed through to a Windows Server VM. I'm guessing when they are claimed by a VM or Container that they don't show up in the primary PVE OS anymore unless I disconnect them from the VM.
So a likely suspicion for me would be a NUT Server initiating shut down due to improper messages being sent about a power outtage.
I'll monitor to see if that happens again. Atleast the shutdown is a graceful one and not an abrupt disconnect.
Thank you so much for your time @datschlatscher
I have put in requests to change the External Drives as they're going on 4 years of 24/7 operation and then they had a recent UPS die on them and everything was a cold disconnect 3-4 times in the past couple months which can be hard on the systems.
 
  • Like
Reactions: datschlatscher

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!