[SOLVED] Proxmox PVE disappear from the host

Guillaume Soucy

Well-Known Member
Oct 20, 2017
70
5
48
30
L'Orignal, Canada
guillaumesoucy.com
Hello,

I tried to login into my PVE and it fails saying that the credentials aren't good. I did a reboot, when connected using SSH, the machine is very slow and the weirdest part is that the directory /etc/pve is completely empty. No more VM configuration files.

I will do a fresh install and restore the backups but, how to know what really happened?

Thanks,

Guillaume
 
I tried to login into my PVE and it fails saying that the credentials aren't good. I did a reboot, when connected using SSH, the machine is very slow and the weirdest part is that the directory /etc/pve is completely empty. No more VM configuration files.
The /etc/pve directory is empty when the PVE services do not (all) start (because /etc/pve/ are not real files but come from a database https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_pmxcfs ). This happens sometimes and can probably be restored by fixing the underlying cause.
I will do a fresh install and restore the backups but, how to know what really happened?
Check which services did not start (systemctl --failed) and check the system log (journalctl -b 0) for error messages to find out where the problem might be, for a start.
 
The /etc/pve directory is empty when the PVE services do not (all) start (because /etc/pve/ are not real files but come from a database https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_pmxcfs ). This happens sometimes and can probably be restored by fixing the underlying cause.

Check which services did not start (systemctl --failed) and check the system log (journalctl -b 0) for error messages to find out where the problem might be, for a start.
Okay, will try this before and will let you know the results.

Thanks
 
Code:
systemctl --failed
returns:

Code:
systemctl --failed
  UNIT                     LOAD   ACTIVE SUB    DESCRIPTION
● pve-daily-update.service loaded failed failed Daily PVE download activities
● pve-firewall.service     loaded failed failed Proxmox VE firewall
● pve-guests.service       loaded failed failed PVE guests
● pve-ha-crm.service       loaded failed failed PVE Cluster HA Resource Manager Daemon
● pve-ha-lrm.service       loaded failed failed PVE Local HA Resource Manager Daemon
● pvescheduler.service     loaded failed failed Proxmox VE scheduler
● pvestatd.service         loaded failed failed PVE Status Daemon

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.
7 loaded units listed.

May I post the logs here?
 
systemctl start $(systemctl --failed|grep failed|awk '{print $2}'|xargs)
systemctl --failed
 
check your syslog or journalctl logs.
All services down, a prerequested value maybe modified or failed to up.

What's your last modification on your host ?
 
systemctl start $(systemctl --failed|grep failed|awk '{print $2}'|xargs)
systemctl --failed
This is the return of the above commands:

Code:
Failed to start pve-guests.service: Operation refused, unit pve-guests.service may be requested by dependency only (it is configured to refuse manual start/stop).

See system logs and 'systemctl status pve-guests.service' for details.

Job for pve-daily-update.service failed because the control process exited with error code.

See "systemctl status pve-daily-update.service" and "journalctl -xe" for details.

Job for pvestatd.service failed because the control process exited with error code.

See "systemctl status pvestatd.service" and "journalctl -xe" for details.

Job for pve-ha-crm.service failed because the control process exited with error code.

See "systemctl status pve-ha-crm.service" and "journalctl -xe" for details.

Job for pvescheduler.service failed because the control process exited with error code.

See "systemctl status pvescheduler.service" and "journalctl -xe" for details.

Job for pve-firewall.service failed because the control process exited with error code.

See "systemctl status pve-firewall.service" and "journalctl -xe" for details.

Job for pve-ha-lrm.service failed because the control process exited with error code.

See "systemctl status pve-ha-lrm.service" and "journalctl -xe" for details.

  UNIT                     LOAD   ACTIVE SUB    DESCRIPTION

● pve-daily-update.service loaded failed failed Daily PVE download activities

● pve-firewall.service     loaded failed failed Proxmox VE firewall

● pve-guests.service       loaded failed failed PVE guests

● pve-ha-crm.service       loaded failed failed PVE Cluster HA Resource Manager Daemon

● pve-ha-lrm.service       loaded failed failed PVE Local HA Resource Manager Daemon

● pvescheduler.service     loaded failed failed Proxmox VE scheduler

● pvestatd.service         loaded failed failed PVE Status Daemon


LOAD   = Reflects whether the unit definition was properly loaded.

ACTIVE = The high-level unit activation state, i.e. generalization of SUB.

SUB    = The low-level unit activation state, values depend on unit type.

7 loaded units listed.

What's your last modification on your host ?

The only modification that I did to the host wasn't really one, I added a VM however that was months ago.



I'd notice the load is also very high for a host with no VM running.

Code:
uptime
returns:

Code:
10:56:31 up 1 day,  1:54,  1 user,  load average: 4.61, 5.01, 5.22

Should I post the logs here or use a pastebin to do so?
 
There is still some going really bad (maybe disk broken or so), look at these to find out more:
systemctl status pve-ha-crm
systemctl status pve-ha-lrm
systemctl status pvescheduler
systemctl status pvestat
journalctl -xe
 
There is still some going really bad (maybe disk broken or so), look at these to find out more:
systemctl status pve-ha-crm
systemctl status pve-ha-lrm
systemctl status pvescheduler
systemctl status pvestat
journalctl -xe
Yes, the SSD went bad. I just got the replacement one here.

This a sample of the syslog:

Code:
Oct 12 10:40:34 pve-02 pveproxy[207479]: worker exit
Oct 12 10:40:34 pve-02 pveproxy[207505]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1996.
Oct 12 10:40:34 pve-02 pveproxy[207480]: worker exit
Oct 12 10:40:34 pve-02 pveproxy[1049]: worker 207479 finished
Oct 12 10:40:34 pve-02 pveproxy[1049]: starting 1 worker(s)
Oct 12 10:40:34 pve-02 pveproxy[1049]: worker 207506 started
Oct 12 10:40:34 pve-02 pveproxy[1049]: worker 207480 finished
Oct 12 10:40:34 pve-02 pveproxy[1049]: starting 1 worker(s)
Oct 12 10:40:34 pve-02 pveproxy[1049]: worker 207507 started
Oct 12 10:40:34 pve-02 pveproxy[207506]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1996.
Oct 12 10:40:34 pve-02 pveproxy[207507]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1996.
Oct 12 10:40:35 pve-02 kernel: [265118.252476] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 12 10:40:35 pve-02 kernel: [265118.253581] ata2.00: BMDMA stat 0x25
Oct 12 10:40:35 pve-02 kernel: [265118.254647] ata2.00: failed command: READ DMA
Oct 12 10:40:35 pve-02 kernel: [265118.255694] ata2.00: cmd c8/00:08:c0:97:73/00:00:00:00:00/e2 tag 0 dma 4096 in
Oct 12 10:40:35 pve-02 kernel: [265118.255694]          res 51/40:00:f8:ff:ff/40:00:ff:00:00/e2 Emask 0x9 (media error)
Oct 12 10:40:35 pve-02 kernel: [265118.257857] ata2.00: status: { DRDY ERR }
Oct 12 10:40:35 pve-02 kernel: [265118.258962] ata2.00: error: { UNC }
Oct 12 10:40:35 pve-02 kernel: [265118.324676] ata2.00: configured for UDMA/133
Oct 12 10:40:35 pve-02 kernel: [265118.324691] sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Oct 12 10:40:35 pve-02 kernel: [265118.324694] sd 1:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current]
Oct 12 10:40:35 pve-02 kernel: [265118.324696] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed
Oct 12 10:40:35 pve-02 kernel: [265118.324699] sd 1:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 02 73 97 c0 00 00 08 00
Oct 12 10:40:35 pve-02 kernel: [265118.324700] blk_update_request: I/O error, dev sdb, sector 41129920 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Oct 12 10:40:35 pve-02 kernel: [265118.325789] ata2: EH complete
Oct 12 10:40:37 pve-02 kernel: [265120.036527] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 12 10:40:37 pve-02 kernel: [265120.037616] ata2.00: BMDMA stat 0x25
Oct 12 10:40:37 pve-02 kernel: [265120.038679] ata2.00: failed command: READ DMA
Oct 12 10:40:37 pve-02 kernel: [265120.039726] ata2.00: cmd c8/00:08:c0:97:73/00:00:00:00:00/e2 tag 0 dma 4096 in
Oct 12 10:40:37 pve-02 kernel: [265120.039726]          res 51/40:00:f8:ff:ff/40:00:ff:00:00/e2 Emask 0x9 (media error)
Oct 12 10:40:37 pve-02 kernel: [265120.041841] ata2.00: status: { DRDY ERR }
Oct 12 10:40:37 pve-02 kernel: [265120.042914] ata2.00: error: { UNC }
Oct 12 10:40:37 pve-02 kernel: [265120.108685] ata2.00: configured for UDMA/133
Oct 12 10:40:37 pve-02 kernel: [265120.108693] sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1s
Oct 12 10:40:37 pve-02 kernel: [265120.108696] sd 1:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current]
Oct 12 10:40:37 pve-02 kernel: [265120.108697] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed
Oct 12 10:40:37 pve-02 kernel: [265120.108699] sd 1:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 02 73 97 c0 00 00 08 00
Oct 12 10:40:37 pve-02 kernel: [265120.108700] blk_update_request: I/O error, dev sdb, sector 41129920 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Oct 12 10:40:37 pve-02 kernel: [265120.109779] ata2: EH complete
Oct 12 10:40:38 pve-02 kernel: [265121.820496] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 12 10:40:38 pve-02 kernel: [265121.821591] ata2.00: BMDMA stat 0x25
Oct 12 10:40:38 pve-02 kernel: [265121.822653] ata2.00: failed command: READ DMA
Oct 12 10:40:38 pve-02 kernel: [265121.823701] ata2.00: cmd c8/00:08:c0:97:73/00:00:00:00:00/e2 tag 0 dma 4096 in
Oct 12 10:40:38 pve-02 kernel: [265121.823701]          res 51/40:00:f8:ff:ff/40:00:ff:00:00/e2 Emask 0x9 (media error)
Oct 12 10:40:38 pve-02 kernel: [265121.825847] ata2.00: status: { DRDY ERR }
Oct 12 10:40:38 pve-02 kernel: [265121.826934] ata2.00: error: { UNC }
Oct 12 10:40:39 pve-02 pmxcfs[207503]: [database] crit: unable to set WAL mode: disk I/O error#010
Oct 12 10:40:39 pve-02 pmxcfs[207503]: [database] crit: unable to set WAL mode: disk I/O error#010
Oct 12 10:40:39 pve-02 kernel: [265121.892689] ata2.00: configured for UDMA/133
Oct 12 10:40:39 pve-02 kernel: [265121.892709] sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Oct 12 10:40:39 pve-02 kernel: [265121.892713] sd 1:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current]
Oct 12 10:40:39 pve-02 kernel: [265121.892714] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed
Oct 12 10:40:39 pve-02 kernel: [265121.892717] sd 1:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 02 73 97 c0 00 00 08 00
Oct 12 10:40:39 pve-02 kernel: [265121.892718] blk_update_request: I/O error, dev sdb, sector 41129920 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Oct 12 10:40:39 pve-02 kernel: [265121.893821] ata2: EH complete
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!