Proxmox GUI showing question marks / hangs and freezes

clearwater

Member
Nov 15, 2019
9
0
6
Hi there,

Over the last week, the GUI has been showing gray question marks over each LXC, VM and storage (screenshot attached). Running
Code:
systemctl pvstatd restart
brings the GUI back to normal for ~5 - 10mins, and then it reverts back to displaying question marks. Within the last two days, two more things have started to occur:
  • Generally overnight, Proxmox will freeze (unable to access the GUI, returns a 'page not found' error). I'm able to log in via SSH but commands hang indefinitely.
  • Several times, my entire network (including independent devices on my wifi network) has lost its connection to the internet and the network when I am working to troubleshoot the above issues
I suspect this is a hardware problem, potentially an early sign of hard drive failure? I am unsure, so hoping someone can help me narrow down the issue.

Thank you

Additional details:

Syslog does not show anything out of the ordinary, except the following, occuring ~every 10 minutes (this server is not part of a cluster):
Code:
Sep 10 20:59:43 pve ceph-crash[8480]: WARNING:__main__:post /var/lib/ceph/crash/2019-11-16_02:55:22.99375[...] as client.crash.pve failed: [errno 2] error connecting to the cluster
Kernel Version: Linux 5.4.60-1-pve #1 SMP PVE 5.4.60-1 (Mon, 31 Aug 2020 10:36:22 +0200)
PVE Manager Version: pve-manager/6.2-11/22fb4983
  • r710, dual X5675, 48GB RAM, and a PERC H700 Raid Controller
  • 1 x 500GB SSD ZFS with compression (running Proxmox and VM images)
  • 2 x 2 TB 7200RPM SAS drives LVM RAID0 (for storage)
  • 1 x 2 TB 7200RPM LVM drive (for storage)
  • Running 2 Windows Server 2019 VMs, 2 Ubuntu VMs, 2 Alpine Linux LXCs
 

Attachments

  • Screenshot 2020-09-10 at 8.48.37 PM.png
    Screenshot 2020-09-10 at 8.48.37 PM.png
    136.1 KB · Views: 12
What's the output of systemctl status pvestatd, systemctl status pvedaemon and systemctl status pveproxy when only question marks are shown?
 
Hi Mira,

Thanks for your reply. Please see attached.

Thanks
 

Attachments

  • systemctl_status_pvedaemon.png
    systemctl_status_pvedaemon.png
    138.5 KB · Views: 16
  • systemctl_status_pveproxy.png
    systemctl_status_pveproxy.png
    93.9 KB · Views: 16
  • systemctl_status_pvestatd.png
    systemctl_status_pvestatd.png
    78.7 KB · Views: 14
Those look fine. Could you provide your storage config (/etc/pve/storage.cfg) as well as the journal (~10 minutes before it happens until ~5min after)?
 
Hi mira,

See below for storage.cfg output:
1600908976905.png

I will need to reboot to remove the unknown status and obtain the journal entry, so will provide that as soon as I am able.

I did have a CIFS share that I removed. When I removed that, the unknown status went away for a day, however it has since returned. And my storage always show the '?' status, even after a reboot.

Thank you
 
Hi mira,

See below for storage.cfg output:
View attachment 20041

I will need to reboot to remove the unknown status and obtain the journal entry, so will provide that as soon as I am able.

I did have a CIFS share that I removed. When I removed that, the unknown status went away for a day, however it has since returned. And my storage always show the '?' status, even after a reboot.

Thank you
Just an update, I rebooted and recorded the journal - I will send to you directly. Some of the interesting outputs include the below:

This occurs many times:
Code:
Sep 30 23:50:32 pve kernel: sd 3:0:0:0: [sdc] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Sep 30 23:50:32 pve kernel: sd 3:0:0:0: [sdc] tag#0 Sense Key : Medium Error [current]
Sep 30 23:50:32 pve kernel: sd 3:0:0:0: [sdc] tag#0 Add. Sense: Unrecovered read error
Sep 30 23:50:32 pve kernel: sd 3:0:0:0: [sdc] tag#0 CDB: Read(10) 28 00 00 00 01 30 00 00 10 00
Sep 30 23:50:32 pve kernel: blk_update_request: critical medium error, dev sdc, sector 304 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Sep 30 23:50:32 pve kernel: sd 3:0:0:0: [sdc] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Sep 30 23:50:32 pve kernel: sd 3:0:0:0: [sdc] tag#0 Sense Key : Medium Error [current]
Sep 30 23:50:32 pve kernel: sd 3:0:0:0: [sdc] tag#0 Add. Sense: Unrecovered read error
Sep 30 23:50:32 pve kernel: sd 3:0:0:0: [sdc] tag#0 CDB: Read(10) 28 00 00 00 01 30 00 00 10 00
Sep 30 23:50:32 pve kernel: blk_update_request: critical medium error, dev sdc, sector 304 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Sep 30 23:50:32 pve kernel: sd 3:0:0:0: [sdc] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Sep 30 23:50:32 pve kernel: sd 3:0:0:0: [sdc] tag#0 Sense Key : Medium Error [current]
Sep 30 23:50:32 pve kernel: sd 3:0:0:0: [sdc] tag#0 Add. Sense: Unrecovered read error
Sep 30 23:50:32 pve kernel: sd 3:0:0:0: [sdc] tag#0 CDB: Read(10) 28 00 00 00 01 30 00 00 10 00
Sep 30 23:50:32 pve kernel: blk_update_request: critical medium error, dev sdc, sector 304 op 0x0:(READ) flags 0x0 phys_seg 2 prio class 0
Sep 30 23:50:33 pve kernel: sd 3:0:0:0: [sdc] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Sep 30 23:50:33 pve kernel: sd 3:0:0:0: [sdc] tag#0 Sense Key : Medium Error [current]
Sep 30 23:50:33 pve kernel: sd 3:0:0:0: [sdc] tag#0 Add. Sense: Unrecovered read error
Sep 30 23:50:33 pve kernel: sd 3:0:0:0: [sdc] tag#0 CDB: Read(10) 28 00 00 00 01 30 00 00 10 00
Sep 30 23:50:33 pve kernel: blk_update_request: critical medium error, dev sdc, sector 304 op 0x0:(READ) flags 0x0 phys_seg 2 prio class 0

Then the following occurs constantly, non-stop (this node isn't part of a cluster):
Code:
Oct 01 01:34:43 pve ceph-crash[1726]: WARNING:__main__:post /var/lib/ceph/crash/2019-11-16_02:55:22.993758Z_c9eb5e75-0d5a-4c83-8755-28471689150a as client.crash.pve failed: [errno 2] error connecting to the cOct 01 01:34:43 pve ceph-crash[1726]: WARNING:__main__:post /var/lib/ceph/crash/2019-11-16_02:55:22.993758Z_c9eb5e75-0d5a-4c83-8755-28471689150a as client.crash failed: [errno 2] error connecting to the clustOct 01 01:35:00 pve systemd[1]: Starting Proxmox VE replication runner...
Oct 01 01:35:01 pve systemd[1]: pvesr.service: Succeeded.
Oct 01 01:35:01 pve systemd[1]: Started Proxmox VE replication runner.
Oct 01 01:35:13 pve ceph-crash[1726]: WARNING:__main__:post /var/lib/ceph/crash/2019-11-16_02:55:22.993758Z_c9eb5e75-0d5a-4c83-8755-28471689150a as client.admin failed:
Oct 01 01:36:00 pve systemd[1]: Starting Proxmox VE replication runner...
Oct 01 01:36:01 pve systemd[1]: pvesr.service: Succeeded.
Oct 01 01:36:01 pve systemd[1]: Started Proxmox VE replication runner.
Oct 01 01:37:00 pve systemd[1]: Starting Proxmox VE replication runner...
Oct 01 01:37:01 pve systemd[1]: pvesr.service: Succeeded.
Oct 01 01:37:01 pve systemd[1]: Started Proxmox VE replication runner.
Oct 01 01:38:00 pve systemd[1]: Starting Proxmox VE replication runner...
Oct 01 01:38:01 pve systemd[1]: pvesr.service: Succeeded.
Oct 01 01:38:01 pve systemd[1]: Started Proxmox VE replication runner.
Oct 01 01:39:00 pve systemd[1]: Starting Proxmox VE replication runner...
Oct 01 01:39:01 pve systemd[1]: pvesr.service: Succeeded.
Oct 01 01:39:01 pve systemd[1]: Started Proxmox VE replication runner.
Oct 01 01:40:00 pve systemd[1]: Starting Proxmox VE replication runner...
Oct 01 01:40:01 pve systemd[1]: pvesr.service: Succeeded.
Oct 01 01:40:01 pve systemd[1]: Started Proxmox VE replication runner.
Oct 01 01:41:00 pve systemd[1]: Starting Proxmox VE replication runner...
Oct 01 01:41:01 pve systemd[1]: pvesr.service: Succeeded.
Oct 01 01:41:01 pve systemd[1]: Started Proxmox VE replication runner.
Oct 01 01:42:00 pve systemd[1]: Starting Proxmox VE replication runner...
Oct 01 01:42:01 pve systemd[1]: pvesr.service: Succeeded.
Oct 01 01:42:01 pve systemd[1]: Started Proxmox VE replication runner.
Oct 01 01:43:00 pve systemd[1]: Starting Proxmox VE replication runner...
Oct 01 01:43:01 pve systemd[1]: pvesr.service: Succeeded.
Oct 01 01:43:01 pve systemd[1]: Started Proxmox VE replication runner.
Oct 01 01:44:00 pve systemd[1]: Starting Proxmox VE replication runner...
Oct 01 01:44:01 pve systemd[1]: pvesr.service: Succeeded.
Oct 01 01:44:01 pve systemd[1]: Started Proxmox VE replication runner.
Oct 01 01:45:00 pve systemd[1]: Starting Proxmox VE replication runner...
Oct 01 01:45:01 pve systemd[1]: pvesr.service: Succeeded.
Oct 01 01:45:01 pve systemd[1]: Started Proxmox VE replication runner.
Oct 01 01:45:13 pve ceph-crash[1726]: WARNING:__main__:post /var/lib/ceph/crash/2019-11-16_02:55:22.993758Z_c9eb5e75-0d5a-4c83-8755-28471689150a as client.crash.pve failed: [errno 2] error connecting to the cOct 01 01:45:13 pve ceph-crash[1726]: WARNING:__main__:post /var/lib/ceph/crash/2019-11-16_02:55:22.993758Z_c9eb5e75-0d5a-4c83-8755-28471689150a as client.crash failed: [errno 2] error connecting to the clustOct 01 01:45:43 pve ceph-crash[1726]: WARNING:__main__:post /var/lib/ceph/crash/2019-11-16_02:55:22.993758Z_c9eb5e75-0d5a-4c83-8755-28471689150a as client.admin failed:
Oct 01 01:46:00 pve systemd[1]: Starting Proxmox VE replication runner...
Oct 01 01:46:01 pve systemd[1]: pvesr.service: Succeeded.

Link to full journal output: https://drive.google.com/file/d/13rWzl-VDiiPA_JwY6ZvlcqOEawInPhFe/view?usp=sharing

Edit: I also discovered my ceph status is showing an error, showing that 1 filesystem is offline. I am not using ceph, how do I disable this, as it might be causing the issue.
Screenshot 2020-10-01 at 4.56.58 PM.png
Thanks!
 
Last edited:
If you don't use it, you can run pveceph purge (see man pveceph for more information). This will destroy all ceph related data and configuration files.
Regarding the issue with /dev/sdc, I would check the disk.
 
Thanks! It seems like /dev/sdc is a holdover from when I had esxi installed, but not sure how to remove it, since it doesn't show up in fdisk. Do you know how I can remove it?

Code:
sdc
├─sdc1                       vfat        ESXi            5AC3-F70B
├─sdc5                       vfat                        4E43-A96C
├─sdc6                       vfat                        4E43-A96C
├─sdc7
├─sdc8                       vfat                        4E43-A96D
└─sdc9
 
You can try the following two commands. Double check for correct letters!
Code:
wipefs -a /dev/sdc
dd if=/dev/zero of=/dev/sdc bs=1M count=200
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!