PVE UI and ssh stop responding

radar

Member
May 11, 2021
27
1
8
48
Hi,

I have a fresh install of a pve node with 8.3.
Right after the install completed, the web UI and ssh stop responding from time to time then get back to normal functioning.
It's not a network issue since the node responds to ping and telnet on 8006 port connects.

It became responding right now, and here are the logs of the last ~90 minutes.
Any idea what's happening here?

Thanks.
Code:
Feb 09 21:27:18 pve login[59698]: ROOT LOGIN  on '/dev/tty1'
Feb 09 22:17:01 pve CRON[68709]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Feb 09 22:17:01 pve CRON[68710]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Feb 09 22:17:01 pve CRON[68709]: pam_unix(cron:session): session closed for user root
Feb 09 22:31:27 pve smartd[587]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 112 to 111
Feb 09 22:58:47 pve pvedaemon[951]: <root@pam> successful auth for user 'root@pam'
Feb 09 22:58:51 pve pvedaemon[949]: <root@pam> successful auth for user 'root@pam'
Feb 09 22:58:51 pve pvedaemon[950]: <root@pam> starting task UPID:pve:00012A02:000D9C48:67A9251B:vncshell::root@pam:
Feb 09 22:58:51 pve pvedaemon[76290]: starting termproxy UPID:pve:00012A02:000D9C48:67A9251B:vncshell::root@pam:
Feb 09 22:58:51 pve pvedaemon[949]: <root@pam> successful auth for user 'root@pam'
Feb 09 22:58:51 pve login[76293]: pam_unix(login:session): session opened for user root(uid=0) by root(uid=0)
Feb 09 22:58:51 pve systemd-logind[588]: New session 14 of user root.
Feb 09 22:58:51 pve systemd[1]: Started session-14.scope - Session 14 of User root.
Feb 09 22:58:51 pve login[76298]: ROOT LOGIN  on '/dev/pts/0'
Feb 09 22:58:54 pve systemd[1]: session-14.scope: Deactivated successfully.
Feb 09 22:58:54 pve systemd-logind[588]: Session 14 logged out. Waiting for processes to exit.
Feb 09 22:58:54 pve systemd-logind[588]: Removed session 14.
Feb 09 22:58:54 pve pvedaemon[950]: <root@pam> end task UPID:pve:00012A02:000D9C48:67A9251B:vncshell::root@pam: OK
Feb 09 22:58:56 pve pvedaemon[950]: <root@pam> starting task UPID:pve:00012A26:000D9E61:67A92520:vncproxy:101:root@pam:
Feb 09 22:58:56 pve pvedaemon[76326]: starting lxc termproxy UPID:pve:00012A26:000D9E61:67A92520:vncproxy:101:root@pam:
Feb 09 22:58:56 pve pvedaemon[949]: <root@pam> successful auth for user 'root@pam'
Feb 09 22:59:20 pve pvedaemon[950]: <root@pam> end task UPID:pve:00012A26:000D9E61:67A92520:vncproxy:101:root@pam: OK
Feb 09 22:59:20 pve pvedaemon[76452]: starting termproxy UPID:pve:00012AA4:000DA785:67A92538:vncshell::root@pam:
Feb 09 22:59:20 pve pvedaemon[950]: <root@pam> starting task UPID:pve:00012AA4:000DA785:67A92538:vncshell::root@pam:
Feb 09 22:59:20 pve pvedaemon[949]: <root@pam> successful auth for user 'root@pam'
Feb 09 22:59:20 pve login[76455]: pam_unix(login:session): session opened for user root(uid=0) by root(uid=0)
Feb 09 22:59:20 pve systemd-logind[588]: New session 16 of user root.
Feb 09 22:59:20 pve systemd[1]: Started session-16.scope - Session 16 of User root.
Feb 09 22:59:20 pve login[76460]: ROOT LOGIN  on '/dev/pts/0'
Feb 09 22:59:22 pve systemd[1]: session-16.scope: Deactivated successfully.
Feb 09 22:59:22 pve systemd-logind[588]: Session 16 logged out. Waiting for processes to exit.
Feb 09 22:59:22 pve systemd-logind[588]: Removed session 16.
Feb 09 22:59:22 pve pvedaemon[950]: <root@pam> end task UPID:pve:00012AA4:000DA785:67A92538:vncshell::root@pam: OK
 
Hello radar! Thanks for providing us with the journal. Could you please check the journal over a longer period of time (using the --since parameter) and tell us if you find anything unusual? Because as of now, the part you provided us with doesn't show anything unusual... well, almost:

Code:
Feb 09 22:31:27 pve smartd[587]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 112 to 111

My guess is that the reported temperature is not actually in degrees Celsius, despite it reporting so. However, if this was true, this would explain the issues you are describing, where the drive tries to slow itself down to prevent overheating even more. Can you please provide us with the output of the following command: smartctl -A /dev/sda
 
Thank you for your response.
I managed to get the journal of a longer period but the host stopped responding before I can scp it. So I'll try again later today.

In the meanwhile, here is the output of smartctl where I don't see anything to worry about.
Code:
root@pve:~# smartctl -A /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.4-2-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   180   175   021    Pre-fail  Always       -       3983
  4 Start_Stop_Count        0x0032   098   098   000    Old_age   Always       -       2317
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   032   032   000    Old_age   Always       -       49995
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       870
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       146
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2170
194 Temperature_Celsius     0x0022   113   089   000    Old_age   Always       -       34
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0
 
The drive S.M.A.R.T. data looks fine, like you said. Some questions:
  1. Did you have any success with the journal? This should hopefully give us the most useful information.
  2. Is this a new server? If not, have you used Proxmox VE on it before?
  3. What hardware does the server have?
  4. Please provide the output of pveversion -v
There are some things that you could try, but if possible, I would like to have some more information from the journal (if possible).
 
Hi,
Sorry for the long delay.
1- Yes, got the journal. It's attached here.
2- Yes, new server (but old computer) with new Proxmox. Never used proxmox on it.
3- Pretty old. Intel i5 with 8Gb of ram.
4- Attached too.
Thank you very much.
 

Attachments

Thanks for the information. I quickly looked over the logs but couldn't find anything unusual, except that it sometimes reboots after you are logging in. Are you triggering the reboot yourself?

Just wondering: do you have the same issues when you are physically in front of the server? I'm just wondering whether your issues might be caused due to an IP address conflict.
 
That's a very good question. Actually, I don't connect often directly to the server but when I did, it was working perfectly.
The only thing that makes me think that it's not a network issue is that telnet always works on this specific port while the UI does not.