[SOLVED] Proxmox PVE disappear from the host

Guillaume Soucy

Well-Known Member
Oct 20, 2017
77
5
48
31
L'Orignal, Canada
guillaumesoucy.com
Hello,

I tried to login into my PVE and it fails saying that the credentials aren't good. I did a reboot, when connected using SSH, the machine is very slow and the weirdest part is that the directory /etc/pve is completely empty. No more VM configuration files.

I will do a fresh install and restore the backups but, how to know what really happened?

Thanks,

Guillaume
 
I tried to login into my PVE and it fails saying that the credentials aren't good. I did a reboot, when connected using SSH, the machine is very slow and the weirdest part is that the directory /etc/pve is completely empty. No more VM configuration files.
The /etc/pve directory is empty when the PVE services do not (all) start (because /etc/pve/ are not real files but come from a database https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_pmxcfs ). This happens sometimes and can probably be restored by fixing the underlying cause.
I will do a fresh install and restore the backups but, how to know what really happened?
Check which services did not start (systemctl --failed) and check the system log (journalctl -b 0) for error messages to find out where the problem might be, for a start.
 
The /etc/pve directory is empty when the PVE services do not (all) start (because /etc/pve/ are not real files but come from a database https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_pmxcfs ). This happens sometimes and can probably be restored by fixing the underlying cause.

Check which services did not start (systemctl --failed) and check the system log (journalctl -b 0) for error messages to find out where the problem might be, for a start.
Okay, will try this before and will let you know the results.

Thanks
 
Code:
systemctl --failed
returns:

Code:
systemctl --failed
  UNIT                     LOAD   ACTIVE SUB    DESCRIPTION
● pve-daily-update.service loaded failed failed Daily PVE download activities
● pve-firewall.service     loaded failed failed Proxmox VE firewall
● pve-guests.service       loaded failed failed PVE guests
● pve-ha-crm.service       loaded failed failed PVE Cluster HA Resource Manager Daemon
● pve-ha-lrm.service       loaded failed failed PVE Local HA Resource Manager Daemon
● pvescheduler.service     loaded failed failed Proxmox VE scheduler
● pvestatd.service         loaded failed failed PVE Status Daemon

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.
7 loaded units listed.

May I post the logs here?
 
systemctl start $(systemctl --failed|grep failed|awk '{print $2}'|xargs)
systemctl --failed
 
check your syslog or journalctl logs.
All services down, a prerequested value maybe modified or failed to up.

What's your last modification on your host ?
 
systemctl start $(systemctl --failed|grep failed|awk '{print $2}'|xargs)
systemctl --failed
This is the return of the above commands:

Code:
Failed to start pve-guests.service: Operation refused, unit pve-guests.service may be requested by dependency only (it is configured to refuse manual start/stop).

See system logs and 'systemctl status pve-guests.service' for details.

Job for pve-daily-update.service failed because the control process exited with error code.

See "systemctl status pve-daily-update.service" and "journalctl -xe" for details.

Job for pvestatd.service failed because the control process exited with error code.

See "systemctl status pvestatd.service" and "journalctl -xe" for details.

Job for pve-ha-crm.service failed because the control process exited with error code.

See "systemctl status pve-ha-crm.service" and "journalctl -xe" for details.

Job for pvescheduler.service failed because the control process exited with error code.

See "systemctl status pvescheduler.service" and "journalctl -xe" for details.

Job for pve-firewall.service failed because the control process exited with error code.

See "systemctl status pve-firewall.service" and "journalctl -xe" for details.

Job for pve-ha-lrm.service failed because the control process exited with error code.

See "systemctl status pve-ha-lrm.service" and "journalctl -xe" for details.

  UNIT                     LOAD   ACTIVE SUB    DESCRIPTION

● pve-daily-update.service loaded failed failed Daily PVE download activities

● pve-firewall.service     loaded failed failed Proxmox VE firewall

● pve-guests.service       loaded failed failed PVE guests

● pve-ha-crm.service       loaded failed failed PVE Cluster HA Resource Manager Daemon

● pve-ha-lrm.service       loaded failed failed PVE Local HA Resource Manager Daemon

● pvescheduler.service     loaded failed failed Proxmox VE scheduler

● pvestatd.service         loaded failed failed PVE Status Daemon


LOAD   = Reflects whether the unit definition was properly loaded.

ACTIVE = The high-level unit activation state, i.e. generalization of SUB.

SUB    = The low-level unit activation state, values depend on unit type.

7 loaded units listed.

What's your last modification on your host ?

The only modification that I did to the host wasn't really one, I added a VM however that was months ago.



I'd notice the load is also very high for a host with no VM running.

Code:
uptime
returns:

Code:
10:56:31 up 1 day,  1:54,  1 user,  load average: 4.61, 5.01, 5.22

Should I post the logs here or use a pastebin to do so?
 
There is still some going really bad (maybe disk broken or so), look at these to find out more:
systemctl status pve-ha-crm
systemctl status pve-ha-lrm
systemctl status pvescheduler
systemctl status pvestat
journalctl -xe
 
There is still some going really bad (maybe disk broken or so), look at these to find out more:
systemctl status pve-ha-crm
systemctl status pve-ha-lrm
systemctl status pvescheduler
systemctl status pvestat
journalctl -xe
Yes, the SSD went bad. I just got the replacement one here.

This a sample of the syslog:

Code:
Oct 12 10:40:34 pve-02 pveproxy[207479]: worker exit
Oct 12 10:40:34 pve-02 pveproxy[207505]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1996.
Oct 12 10:40:34 pve-02 pveproxy[207480]: worker exit
Oct 12 10:40:34 pve-02 pveproxy[1049]: worker 207479 finished
Oct 12 10:40:34 pve-02 pveproxy[1049]: starting 1 worker(s)
Oct 12 10:40:34 pve-02 pveproxy[1049]: worker 207506 started
Oct 12 10:40:34 pve-02 pveproxy[1049]: worker 207480 finished
Oct 12 10:40:34 pve-02 pveproxy[1049]: starting 1 worker(s)
Oct 12 10:40:34 pve-02 pveproxy[1049]: worker 207507 started
Oct 12 10:40:34 pve-02 pveproxy[207506]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1996.
Oct 12 10:40:34 pve-02 pveproxy[207507]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1996.
Oct 12 10:40:35 pve-02 kernel: [265118.252476] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 12 10:40:35 pve-02 kernel: [265118.253581] ata2.00: BMDMA stat 0x25
Oct 12 10:40:35 pve-02 kernel: [265118.254647] ata2.00: failed command: READ DMA
Oct 12 10:40:35 pve-02 kernel: [265118.255694] ata2.00: cmd c8/00:08:c0:97:73/00:00:00:00:00/e2 tag 0 dma 4096 in
Oct 12 10:40:35 pve-02 kernel: [265118.255694]          res 51/40:00:f8:ff:ff/40:00:ff:00:00/e2 Emask 0x9 (media error)
Oct 12 10:40:35 pve-02 kernel: [265118.257857] ata2.00: status: { DRDY ERR }
Oct 12 10:40:35 pve-02 kernel: [265118.258962] ata2.00: error: { UNC }
Oct 12 10:40:35 pve-02 kernel: [265118.324676] ata2.00: configured for UDMA/133
Oct 12 10:40:35 pve-02 kernel: [265118.324691] sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Oct 12 10:40:35 pve-02 kernel: [265118.324694] sd 1:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current]
Oct 12 10:40:35 pve-02 kernel: [265118.324696] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed
Oct 12 10:40:35 pve-02 kernel: [265118.324699] sd 1:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 02 73 97 c0 00 00 08 00
Oct 12 10:40:35 pve-02 kernel: [265118.324700] blk_update_request: I/O error, dev sdb, sector 41129920 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Oct 12 10:40:35 pve-02 kernel: [265118.325789] ata2: EH complete
Oct 12 10:40:37 pve-02 kernel: [265120.036527] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 12 10:40:37 pve-02 kernel: [265120.037616] ata2.00: BMDMA stat 0x25
Oct 12 10:40:37 pve-02 kernel: [265120.038679] ata2.00: failed command: READ DMA
Oct 12 10:40:37 pve-02 kernel: [265120.039726] ata2.00: cmd c8/00:08:c0:97:73/00:00:00:00:00/e2 tag 0 dma 4096 in
Oct 12 10:40:37 pve-02 kernel: [265120.039726]          res 51/40:00:f8:ff:ff/40:00:ff:00:00/e2 Emask 0x9 (media error)
Oct 12 10:40:37 pve-02 kernel: [265120.041841] ata2.00: status: { DRDY ERR }
Oct 12 10:40:37 pve-02 kernel: [265120.042914] ata2.00: error: { UNC }
Oct 12 10:40:37 pve-02 kernel: [265120.108685] ata2.00: configured for UDMA/133
Oct 12 10:40:37 pve-02 kernel: [265120.108693] sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1s
Oct 12 10:40:37 pve-02 kernel: [265120.108696] sd 1:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current]
Oct 12 10:40:37 pve-02 kernel: [265120.108697] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed
Oct 12 10:40:37 pve-02 kernel: [265120.108699] sd 1:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 02 73 97 c0 00 00 08 00
Oct 12 10:40:37 pve-02 kernel: [265120.108700] blk_update_request: I/O error, dev sdb, sector 41129920 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Oct 12 10:40:37 pve-02 kernel: [265120.109779] ata2: EH complete
Oct 12 10:40:38 pve-02 kernel: [265121.820496] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 12 10:40:38 pve-02 kernel: [265121.821591] ata2.00: BMDMA stat 0x25
Oct 12 10:40:38 pve-02 kernel: [265121.822653] ata2.00: failed command: READ DMA
Oct 12 10:40:38 pve-02 kernel: [265121.823701] ata2.00: cmd c8/00:08:c0:97:73/00:00:00:00:00/e2 tag 0 dma 4096 in
Oct 12 10:40:38 pve-02 kernel: [265121.823701]          res 51/40:00:f8:ff:ff/40:00:ff:00:00/e2 Emask 0x9 (media error)
Oct 12 10:40:38 pve-02 kernel: [265121.825847] ata2.00: status: { DRDY ERR }
Oct 12 10:40:38 pve-02 kernel: [265121.826934] ata2.00: error: { UNC }
Oct 12 10:40:39 pve-02 pmxcfs[207503]: [database] crit: unable to set WAL mode: disk I/O error#010
Oct 12 10:40:39 pve-02 pmxcfs[207503]: [database] crit: unable to set WAL mode: disk I/O error#010
Oct 12 10:40:39 pve-02 kernel: [265121.892689] ata2.00: configured for UDMA/133
Oct 12 10:40:39 pve-02 kernel: [265121.892709] sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Oct 12 10:40:39 pve-02 kernel: [265121.892713] sd 1:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current]
Oct 12 10:40:39 pve-02 kernel: [265121.892714] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed
Oct 12 10:40:39 pve-02 kernel: [265121.892717] sd 1:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 02 73 97 c0 00 00 08 00
Oct 12 10:40:39 pve-02 kernel: [265121.892718] blk_update_request: I/O error, dev sdb, sector 41129920 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Oct 12 10:40:39 pve-02 kernel: [265121.893821] ata2: EH complete
 
Hello

How did you confirm that the disk was the issue?
I have exactly the same issue and error messages, plus I cannot access Web interface
But when I make a smartctl on the disk, the test pass
I would not be surprise if it was the disk as I suspected it has cause issue before, but I want to make before wiping out everything

Bash:
root@server:~# systemctl --failed
  UNIT                 LOAD   ACTIVE SUB    DESCRIPTION                           
● pve-firewall.service loaded failed failed Proxmox VE firewall
● pve-guests.service   loaded failed failed PVE guests
● pve-ha-crm.service   loaded failed failed PVE Cluster HA Resource Manager Daemon
● pve-ha-lrm.service   loaded failed failed PVE Local HA Resource Manager Daemon
● pvescheduler.service loaded failed failed Proxmox VE scheduler
● pvestatd.service     loaded failed failed PVE Status Daemon

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.
6 loaded units listed.

And the Smart test

Bash:
root@server:~# smartctl -a /dev/sdb
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-10-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ADATA SU635
Serial Number:    2K1120125862
LU WWN Device Id: 0 000000 000000000
Firmware Version: S1120A0
User Capacity:    240,057,409,536 bytes [240 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available
Device is:        Not in smartctl database 7.3/5319
ATA Version is:   ACS-3 T13/2161-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Mar  6 12:21:32 2025 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x11) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0002) Does not save SMART data before
                                        entering power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  10) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   100   100   050    Old_age   Always       -       0
  5 Reallocated_Sector_Ct   0x0032   100   100   050    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   050    Old_age   Always       -       28031
 12 Power_Cycle_Count       0x0032   100   100   050    Old_age   Always       -       608
160 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       0
161 Unknown_Attribute       0x0033   100   100   050    Pre-fail  Always       -       100
163 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       20
164 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       249690
165 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       3705
166 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       100
167 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       532
168 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       1500
169 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       65
175 Program_Fail_Count_Chip 0x0032   100   100   050    Old_age   Always       -       0
176 Erase_Fail_Count_Chip   0x0032   100   100   050    Old_age   Always       -       0
177 Wear_Leveling_Count     0x0032   100   100   050    Old_age   Always       -       0
178 Used_Rsvd_Blk_Cnt_Chip  0x0032   100   100   050    Old_age   Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   050    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   050    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   050    Old_age   Always       -       423
194 Temperature_Celsius     0x0022   100   100   050    Old_age   Always       -       30
195 Hardware_ECC_Recovered  0x0032   100   100   050    Old_age   Always       -       11231945
196 Reallocated_Event_Count 0x0032   100   100   050    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   050    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0032   100   100   050    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   050    Old_age   Always       -       0
232 Available_Reservd_Space 0x0032   100   100   050    Old_age   Always       -       100
241 Total_LBAs_Written      0x0030   100   100   050    Old_age   Offline      -       549465
242 Total_LBAs_Read         0x0030   100   100   050    Old_age   Offline      -       3266377
245 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       839100

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     12189         -

Selective Self-tests/Logging not supported

Thanks