Proxmox GUI showing question marks / hangs and freezes

clearwater

Member
Nov 15, 2019
9
0
6
Hi there,

Over the last week, the GUI has been showing gray question marks over each LXC, VM and storage (screenshot attached). Running
Code:
systemctl pvstatd restart
brings the GUI back to normal for ~5 - 10mins, and then it reverts back to displaying question marks. Within the last two days, two more things have started to occur:
  • Generally overnight, Proxmox will freeze (unable to access the GUI, returns a 'page not found' error). I'm able to log in via SSH but commands hang indefinitely.
  • Several times, my entire network (including independent devices on my wifi network) has lost its connection to the internet and the network when I am working to troubleshoot the above issues
I suspect this is a hardware problem, potentially an early sign of hard drive failure? I am unsure, so hoping someone can help me narrow down the issue.

Thank you

Additional details:

Syslog does not show anything out of the ordinary, except the following, occuring ~every 10 minutes (this server is not part of a cluster):
Code:
Sep 10 20:59:43 pve ceph-crash[8480]: WARNING:__main__:post /var/lib/ceph/crash/2019-11-16_02:55:22.99375[...] as client.crash.pve failed: [errno 2] error connecting to the cluster
Kernel Version: Linux 5.4.60-1-pve #1 SMP PVE 5.4.60-1 (Mon, 31 Aug 2020 10:36:22 +0200)
PVE Manager Version: pve-manager/6.2-11/22fb4983
  • r710, dual X5675, 48GB RAM, and a PERC H700 Raid Controller
  • 1 x 500GB SSD ZFS with compression (running Proxmox and VM images)
  • 2 x 2 TB 7200RPM SAS drives LVM RAID0 (for storage)
  • 1 x 2 TB 7200RPM LVM drive (for storage)
  • Running 2 Windows Server 2019 VMs, 2 Ubuntu VMs, 2 Alpine Linux LXCs
 

Attachments

  • Screenshot 2020-09-10 at 8.48.37 PM.png
    Screenshot 2020-09-10 at 8.48.37 PM.png
    136.1 KB · Views: 15
What's the output of systemctl status pvestatd, systemctl status pvedaemon and systemctl status pveproxy when only question marks are shown?
 
Hi Mira,

Thanks for your reply. Please see attached.

Thanks
 

Attachments

  • systemctl_status_pvedaemon.png
    systemctl_status_pvedaemon.png
    138.5 KB · Views: 20
  • systemctl_status_pveproxy.png
    systemctl_status_pveproxy.png
    93.9 KB · Views: 19
  • systemctl_status_pvestatd.png
    systemctl_status_pvestatd.png
    78.7 KB · Views: 17
Those look fine. Could you provide your storage config (/etc/pve/storage.cfg) as well as the journal (~10 minutes before it happens until ~5min after)?
 
Hi mira,

See below for storage.cfg output:
1600908976905.png

I will need to reboot to remove the unknown status and obtain the journal entry, so will provide that as soon as I am able.

I did have a CIFS share that I removed. When I removed that, the unknown status went away for a day, however it has since returned. And my storage always show the '?' status, even after a reboot.

Thank you
 
Hi mira,

See below for storage.cfg output:
View attachment 20041

I will need to reboot to remove the unknown status and obtain the journal entry, so will provide that as soon as I am able.

I did have a CIFS share that I removed. When I removed that, the unknown status went away for a day, however it has since returned. And my storage always show the '?' status, even after a reboot.

Thank you
Just an update, I rebooted and recorded the journal - I will send to you directly. Some of the interesting outputs include the below:

This occurs many times:
Code:
Sep 30 23:50:32 pve kernel: sd 3:0:0:0: [sdc] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Sep 30 23:50:32 pve kernel: sd 3:0:0:0: [sdc] tag#0 Sense Key : Medium Error [current]
Sep 30 23:50:32 pve kernel: sd 3:0:0:0: [sdc] tag#0 Add. Sense: Unrecovered read error
Sep 30 23:50:32 pve kernel: sd 3:0:0:0: [sdc] tag#0 CDB: Read(10) 28 00 00 00 01 30 00 00 10 00
Sep 30 23:50:32 pve kernel: blk_update_request: critical medium error, dev sdc, sector 304 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Sep 30 23:50:32 pve kernel: sd 3:0:0:0: [sdc] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Sep 30 23:50:32 pve kernel: sd 3:0:0:0: [sdc] tag#0 Sense Key : Medium Error [current]
Sep 30 23:50:32 pve kernel: sd 3:0:0:0: [sdc] tag#0 Add. Sense: Unrecovered read error
Sep 30 23:50:32 pve kernel: sd 3:0:0:0: [sdc] tag#0 CDB: Read(10) 28 00 00 00 01 30 00 00 10 00
Sep 30 23:50:32 pve kernel: blk_update_request: critical medium error, dev sdc, sector 304 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Sep 30 23:50:32 pve kernel: sd 3:0:0:0: [sdc] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Sep 30 23:50:32 pve kernel: sd 3:0:0:0: [sdc] tag#0 Sense Key : Medium Error [current]
Sep 30 23:50:32 pve kernel: sd 3:0:0:0: [sdc] tag#0 Add. Sense: Unrecovered read error
Sep 30 23:50:32 pve kernel: sd 3:0:0:0: [sdc] tag#0 CDB: Read(10) 28 00 00 00 01 30 00 00 10 00
Sep 30 23:50:32 pve kernel: blk_update_request: critical medium error, dev sdc, sector 304 op 0x0:(READ) flags 0x0 phys_seg 2 prio class 0
Sep 30 23:50:33 pve kernel: sd 3:0:0:0: [sdc] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Sep 30 23:50:33 pve kernel: sd 3:0:0:0: [sdc] tag#0 Sense Key : Medium Error [current]
Sep 30 23:50:33 pve kernel: sd 3:0:0:0: [sdc] tag#0 Add. Sense: Unrecovered read error
Sep 30 23:50:33 pve kernel: sd 3:0:0:0: [sdc] tag#0 CDB: Read(10) 28 00 00 00 01 30 00 00 10 00
Sep 30 23:50:33 pve kernel: blk_update_request: critical medium error, dev sdc, sector 304 op 0x0:(READ) flags 0x0 phys_seg 2 prio class 0

Then the following occurs constantly, non-stop (this node isn't part of a cluster):
Code:
Oct 01 01:34:43 pve ceph-crash[1726]: WARNING:__main__:post /var/lib/ceph/crash/2019-11-16_02:55:22.993758Z_c9eb5e75-0d5a-4c83-8755-28471689150a as client.crash.pve failed: [errno 2] error connecting to the cOct 01 01:34:43 pve ceph-crash[1726]: WARNING:__main__:post /var/lib/ceph/crash/2019-11-16_02:55:22.993758Z_c9eb5e75-0d5a-4c83-8755-28471689150a as client.crash failed: [errno 2] error connecting to the clustOct 01 01:35:00 pve systemd[1]: Starting Proxmox VE replication runner...
Oct 01 01:35:01 pve systemd[1]: pvesr.service: Succeeded.
Oct 01 01:35:01 pve systemd[1]: Started Proxmox VE replication runner.
Oct 01 01:35:13 pve ceph-crash[1726]: WARNING:__main__:post /var/lib/ceph/crash/2019-11-16_02:55:22.993758Z_c9eb5e75-0d5a-4c83-8755-28471689150a as client.admin failed:
Oct 01 01:36:00 pve systemd[1]: Starting Proxmox VE replication runner...
Oct 01 01:36:01 pve systemd[1]: pvesr.service: Succeeded.
Oct 01 01:36:01 pve systemd[1]: Started Proxmox VE replication runner.
Oct 01 01:37:00 pve systemd[1]: Starting Proxmox VE replication runner...
Oct 01 01:37:01 pve systemd[1]: pvesr.service: Succeeded.
Oct 01 01:37:01 pve systemd[1]: Started Proxmox VE replication runner.
Oct 01 01:38:00 pve systemd[1]: Starting Proxmox VE replication runner...
Oct 01 01:38:01 pve systemd[1]: pvesr.service: Succeeded.
Oct 01 01:38:01 pve systemd[1]: Started Proxmox VE replication runner.
Oct 01 01:39:00 pve systemd[1]: Starting Proxmox VE replication runner...
Oct 01 01:39:01 pve systemd[1]: pvesr.service: Succeeded.
Oct 01 01:39:01 pve systemd[1]: Started Proxmox VE replication runner.
Oct 01 01:40:00 pve systemd[1]: Starting Proxmox VE replication runner...
Oct 01 01:40:01 pve systemd[1]: pvesr.service: Succeeded.
Oct 01 01:40:01 pve systemd[1]: Started Proxmox VE replication runner.
Oct 01 01:41:00 pve systemd[1]: Starting Proxmox VE replication runner...
Oct 01 01:41:01 pve systemd[1]: pvesr.service: Succeeded.
Oct 01 01:41:01 pve systemd[1]: Started Proxmox VE replication runner.
Oct 01 01:42:00 pve systemd[1]: Starting Proxmox VE replication runner...
Oct 01 01:42:01 pve systemd[1]: pvesr.service: Succeeded.
Oct 01 01:42:01 pve systemd[1]: Started Proxmox VE replication runner.
Oct 01 01:43:00 pve systemd[1]: Starting Proxmox VE replication runner...
Oct 01 01:43:01 pve systemd[1]: pvesr.service: Succeeded.
Oct 01 01:43:01 pve systemd[1]: Started Proxmox VE replication runner.
Oct 01 01:44:00 pve systemd[1]: Starting Proxmox VE replication runner...
Oct 01 01:44:01 pve systemd[1]: pvesr.service: Succeeded.
Oct 01 01:44:01 pve systemd[1]: Started Proxmox VE replication runner.
Oct 01 01:45:00 pve systemd[1]: Starting Proxmox VE replication runner...
Oct 01 01:45:01 pve systemd[1]: pvesr.service: Succeeded.
Oct 01 01:45:01 pve systemd[1]: Started Proxmox VE replication runner.
Oct 01 01:45:13 pve ceph-crash[1726]: WARNING:__main__:post /var/lib/ceph/crash/2019-11-16_02:55:22.993758Z_c9eb5e75-0d5a-4c83-8755-28471689150a as client.crash.pve failed: [errno 2] error connecting to the cOct 01 01:45:13 pve ceph-crash[1726]: WARNING:__main__:post /var/lib/ceph/crash/2019-11-16_02:55:22.993758Z_c9eb5e75-0d5a-4c83-8755-28471689150a as client.crash failed: [errno 2] error connecting to the clustOct 01 01:45:43 pve ceph-crash[1726]: WARNING:__main__:post /var/lib/ceph/crash/2019-11-16_02:55:22.993758Z_c9eb5e75-0d5a-4c83-8755-28471689150a as client.admin failed:
Oct 01 01:46:00 pve systemd[1]: Starting Proxmox VE replication runner...
Oct 01 01:46:01 pve systemd[1]: pvesr.service: Succeeded.

Link to full journal output: https://drive.google.com/file/d/13rWzl-VDiiPA_JwY6ZvlcqOEawInPhFe/view?usp=sharing

Edit: I also discovered my ceph status is showing an error, showing that 1 filesystem is offline. I am not using ceph, how do I disable this, as it might be causing the issue.
Screenshot 2020-10-01 at 4.56.58 PM.png
Thanks!
 
Last edited:
If you don't use it, you can run pveceph purge (see man pveceph for more information). This will destroy all ceph related data and configuration files.
Regarding the issue with /dev/sdc, I would check the disk.
 
Thanks! It seems like /dev/sdc is a holdover from when I had esxi installed, but not sure how to remove it, since it doesn't show up in fdisk. Do you know how I can remove it?

Code:
sdc
├─sdc1                       vfat        ESXi            5AC3-F70B
├─sdc5                       vfat                        4E43-A96C
├─sdc6                       vfat                        4E43-A96C
├─sdc7
├─sdc8                       vfat                        4E43-A96D
└─sdc9
 
You can try the following two commands. Double check for correct letters!
Code:
wipefs -a /dev/sdc
dd if=/dev/zero of=/dev/sdc bs=1M count=200