[SOLVED] Proxmox PVE disappear from the host

Guillaume Soucy

Well-Known Member
Oct 20, 2017
77
5
48
31
L'Orignal, Canada
guillaumesoucy.com
Hello,

I tried to login into my PVE and it fails saying that the credentials aren't good. I did a reboot, when connected using SSH, the machine is very slow and the weirdest part is that the directory /etc/pve is completely empty. No more VM configuration files.

I will do a fresh install and restore the backups but, how to know what really happened?

Thanks,

Guillaume
 
I tried to login into my PVE and it fails saying that the credentials aren't good. I did a reboot, when connected using SSH, the machine is very slow and the weirdest part is that the directory /etc/pve is completely empty. No more VM configuration files.
The /etc/pve directory is empty when the PVE services do not (all) start (because /etc/pve/ are not real files but come from a database https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_pmxcfs ). This happens sometimes and can probably be restored by fixing the underlying cause.
I will do a fresh install and restore the backups but, how to know what really happened?
Check which services did not start (systemctl --failed) and check the system log (journalctl -b 0) for error messages to find out where the problem might be, for a start.
 
The /etc/pve directory is empty when the PVE services do not (all) start (because /etc/pve/ are not real files but come from a database https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_pmxcfs ). This happens sometimes and can probably be restored by fixing the underlying cause.

Check which services did not start (systemctl --failed) and check the system log (journalctl -b 0) for error messages to find out where the problem might be, for a start.
Okay, will try this before and will let you know the results.

Thanks
 
Code:
systemctl --failed
returns:

Code:
systemctl --failed
  UNIT                     LOAD   ACTIVE SUB    DESCRIPTION
● pve-daily-update.service loaded failed failed Daily PVE download activities
● pve-firewall.service     loaded failed failed Proxmox VE firewall
● pve-guests.service       loaded failed failed PVE guests
● pve-ha-crm.service       loaded failed failed PVE Cluster HA Resource Manager Daemon
● pve-ha-lrm.service       loaded failed failed PVE Local HA Resource Manager Daemon
● pvescheduler.service     loaded failed failed Proxmox VE scheduler
● pvestatd.service         loaded failed failed PVE Status Daemon

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.
7 loaded units listed.

May I post the logs here?
 
systemctl start $(systemctl --failed|grep failed|awk '{print $2}'|xargs)
systemctl --failed
 
check your syslog or journalctl logs.
All services down, a prerequested value maybe modified or failed to up.

What's your last modification on your host ?
 
systemctl start $(systemctl --failed|grep failed|awk '{print $2}'|xargs)
systemctl --failed
This is the return of the above commands:

Code:
Failed to start pve-guests.service: Operation refused, unit pve-guests.service may be requested by dependency only (it is configured to refuse manual start/stop).

See system logs and 'systemctl status pve-guests.service' for details.

Job for pve-daily-update.service failed because the control process exited with error code.

See "systemctl status pve-daily-update.service" and "journalctl -xe" for details.

Job for pvestatd.service failed because the control process exited with error code.

See "systemctl status pvestatd.service" and "journalctl -xe" for details.

Job for pve-ha-crm.service failed because the control process exited with error code.

See "systemctl status pve-ha-crm.service" and "journalctl -xe" for details.

Job for pvescheduler.service failed because the control process exited with error code.

See "systemctl status pvescheduler.service" and "journalctl -xe" for details.

Job for pve-firewall.service failed because the control process exited with error code.

See "systemctl status pve-firewall.service" and "journalctl -xe" for details.

Job for pve-ha-lrm.service failed because the control process exited with error code.

See "systemctl status pve-ha-lrm.service" and "journalctl -xe" for details.

  UNIT                     LOAD   ACTIVE SUB    DESCRIPTION

● pve-daily-update.service loaded failed failed Daily PVE download activities

● pve-firewall.service     loaded failed failed Proxmox VE firewall

● pve-guests.service       loaded failed failed PVE guests

● pve-ha-crm.service       loaded failed failed PVE Cluster HA Resource Manager Daemon

● pve-ha-lrm.service       loaded failed failed PVE Local HA Resource Manager Daemon

● pvescheduler.service     loaded failed failed Proxmox VE scheduler

● pvestatd.service         loaded failed failed PVE Status Daemon


LOAD   = Reflects whether the unit definition was properly loaded.

ACTIVE = The high-level unit activation state, i.e. generalization of SUB.

SUB    = The low-level unit activation state, values depend on unit type.

7 loaded units listed.

What's your last modification on your host ?

The only modification that I did to the host wasn't really one, I added a VM however that was months ago.



I'd notice the load is also very high for a host with no VM running.

Code:
uptime
returns:

Code:
10:56:31 up 1 day,  1:54,  1 user,  load average: 4.61, 5.01, 5.22

Should I post the logs here or use a pastebin to do so?
 
There is still some going really bad (maybe disk broken or so), look at these to find out more:
systemctl status pve-ha-crm
systemctl status pve-ha-lrm
systemctl status pvescheduler
systemctl status pvestat
journalctl -xe
 
There is still some going really bad (maybe disk broken or so), look at these to find out more:
systemctl status pve-ha-crm
systemctl status pve-ha-lrm
systemctl status pvescheduler
systemctl status pvestat
journalctl -xe
Yes, the SSD went bad. I just got the replacement one here.

This a sample of the syslog:

Code:
Oct 12 10:40:34 pve-02 pveproxy[207479]: worker exit
Oct 12 10:40:34 pve-02 pveproxy[207505]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1996.
Oct 12 10:40:34 pve-02 pveproxy[207480]: worker exit
Oct 12 10:40:34 pve-02 pveproxy[1049]: worker 207479 finished
Oct 12 10:40:34 pve-02 pveproxy[1049]: starting 1 worker(s)
Oct 12 10:40:34 pve-02 pveproxy[1049]: worker 207506 started
Oct 12 10:40:34 pve-02 pveproxy[1049]: worker 207480 finished
Oct 12 10:40:34 pve-02 pveproxy[1049]: starting 1 worker(s)
Oct 12 10:40:34 pve-02 pveproxy[1049]: worker 207507 started
Oct 12 10:40:34 pve-02 pveproxy[207506]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1996.
Oct 12 10:40:34 pve-02 pveproxy[207507]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1996.
Oct 12 10:40:35 pve-02 kernel: [265118.252476] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 12 10:40:35 pve-02 kernel: [265118.253581] ata2.00: BMDMA stat 0x25
Oct 12 10:40:35 pve-02 kernel: [265118.254647] ata2.00: failed command: READ DMA
Oct 12 10:40:35 pve-02 kernel: [265118.255694] ata2.00: cmd c8/00:08:c0:97:73/00:00:00:00:00/e2 tag 0 dma 4096 in
Oct 12 10:40:35 pve-02 kernel: [265118.255694]          res 51/40:00:f8:ff:ff/40:00:ff:00:00/e2 Emask 0x9 (media error)
Oct 12 10:40:35 pve-02 kernel: [265118.257857] ata2.00: status: { DRDY ERR }
Oct 12 10:40:35 pve-02 kernel: [265118.258962] ata2.00: error: { UNC }
Oct 12 10:40:35 pve-02 kernel: [265118.324676] ata2.00: configured for UDMA/133
Oct 12 10:40:35 pve-02 kernel: [265118.324691] sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Oct 12 10:40:35 pve-02 kernel: [265118.324694] sd 1:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current]
Oct 12 10:40:35 pve-02 kernel: [265118.324696] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed
Oct 12 10:40:35 pve-02 kernel: [265118.324699] sd 1:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 02 73 97 c0 00 00 08 00
Oct 12 10:40:35 pve-02 kernel: [265118.324700] blk_update_request: I/O error, dev sdb, sector 41129920 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Oct 12 10:40:35 pve-02 kernel: [265118.325789] ata2: EH complete
Oct 12 10:40:37 pve-02 kernel: [265120.036527] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 12 10:40:37 pve-02 kernel: [265120.037616] ata2.00: BMDMA stat 0x25
Oct 12 10:40:37 pve-02 kernel: [265120.038679] ata2.00: failed command: READ DMA
Oct 12 10:40:37 pve-02 kernel: [265120.039726] ata2.00: cmd c8/00:08:c0:97:73/00:00:00:00:00/e2 tag 0 dma 4096 in
Oct 12 10:40:37 pve-02 kernel: [265120.039726]          res 51/40:00:f8:ff:ff/40:00:ff:00:00/e2 Emask 0x9 (media error)
Oct 12 10:40:37 pve-02 kernel: [265120.041841] ata2.00: status: { DRDY ERR }
Oct 12 10:40:37 pve-02 kernel: [265120.042914] ata2.00: error: { UNC }
Oct 12 10:40:37 pve-02 kernel: [265120.108685] ata2.00: configured for UDMA/133
Oct 12 10:40:37 pve-02 kernel: [265120.108693] sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1s
Oct 12 10:40:37 pve-02 kernel: [265120.108696] sd 1:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current]
Oct 12 10:40:37 pve-02 kernel: [265120.108697] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed
Oct 12 10:40:37 pve-02 kernel: [265120.108699] sd 1:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 02 73 97 c0 00 00 08 00
Oct 12 10:40:37 pve-02 kernel: [265120.108700] blk_update_request: I/O error, dev sdb, sector 41129920 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Oct 12 10:40:37 pve-02 kernel: [265120.109779] ata2: EH complete
Oct 12 10:40:38 pve-02 kernel: [265121.820496] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 12 10:40:38 pve-02 kernel: [265121.821591] ata2.00: BMDMA stat 0x25
Oct 12 10:40:38 pve-02 kernel: [265121.822653] ata2.00: failed command: READ DMA
Oct 12 10:40:38 pve-02 kernel: [265121.823701] ata2.00: cmd c8/00:08:c0:97:73/00:00:00:00:00/e2 tag 0 dma 4096 in
Oct 12 10:40:38 pve-02 kernel: [265121.823701]          res 51/40:00:f8:ff:ff/40:00:ff:00:00/e2 Emask 0x9 (media error)
Oct 12 10:40:38 pve-02 kernel: [265121.825847] ata2.00: status: { DRDY ERR }
Oct 12 10:40:38 pve-02 kernel: [265121.826934] ata2.00: error: { UNC }
Oct 12 10:40:39 pve-02 pmxcfs[207503]: [database] crit: unable to set WAL mode: disk I/O error#010
Oct 12 10:40:39 pve-02 pmxcfs[207503]: [database] crit: unable to set WAL mode: disk I/O error#010
Oct 12 10:40:39 pve-02 kernel: [265121.892689] ata2.00: configured for UDMA/133
Oct 12 10:40:39 pve-02 kernel: [265121.892709] sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Oct 12 10:40:39 pve-02 kernel: [265121.892713] sd 1:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current]
Oct 12 10:40:39 pve-02 kernel: [265121.892714] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed
Oct 12 10:40:39 pve-02 kernel: [265121.892717] sd 1:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 02 73 97 c0 00 00 08 00
Oct 12 10:40:39 pve-02 kernel: [265121.892718] blk_update_request: I/O error, dev sdb, sector 41129920 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Oct 12 10:40:39 pve-02 kernel: [265121.893821] ata2: EH complete