Node Stops Responding, but SSH Still Works

uberdome

Member
Mar 19, 2019
25
1
23
I have a 3-Node cluster running older Proxmox 6.4-13 (I'm still running several old containers that don't work with 7 out of the box).

One of the nodes (Node 2) has partially stopped responding twice in the past week. When it stops responding, I can connect to the cluster gui from either Node 1 or 3, and Node 2 shows up with a small gray "?". The containers still mostly work. At this point, I can SSH into Node 2, but I'm not sure what to check for causes that it would stop responding. Manually rebooting the node brings it back up to full function.

Any suggestions for where to start for troubleshooting?

Thank you, Chris
 
Thank you for the suggestion. When I run journalctl after the problem occurs, I basically see this on repeat:

Code:
Mar 25 13:10:48 dmo-pve2 rsyslogd[652]: action 'action-1-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check
Mar 25 13:10:48 dmo-pve2 rsyslogd[652]: file '8' write error: Read-only file system [v8.1901.0 try https://www.rsyslog.com/e/2027 ]
Mar 25 13:10:48 dmo-pve2 rsyslogd[652]: action 'action-1-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check
Mar 25 13:10:48 dmo-pve2 rsyslogd[652]: file '8' write error: Read-only file system [v8.1901.0 try https://www.rsyslog.com/e/2027 ]
Mar 25 13:10:48 dmo-pve2 rsyslogd[652]: action 'action-1-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check
Mar 25 13:10:48 dmo-pve2 rsyslogd[652]: file '8' write error: Read-only file system [v8.1901.0 try https://www.rsyslog.com/e/2027 ]
Mar 25 13:10:48 dmo-pve2 rsyslogd[652]: action 'action-1-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check
Mar 25 13:10:48 dmo-pve2 rsyslogd[652]: file '8' write error: Read-only file system [v8.1901.0 try https://www.rsyslog.com/e/2027 ]
Mar 25 13:10:48 dmo-pve2 rsyslogd[652]: action 'action-1-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check
Mar 25 13:10:48 dmo-pve2 rsyslogd[652]: file '8' write error: Read-only file system [v8.1901.0 try https://www.rsyslog.com/e/2027 ]
Mar 25 13:10:48 dmo-pve2 rsyslogd[652]: action 'action-1-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check
Mar 25 13:10:48 dmo-pve2 rsyslogd[652]: file '8' write error: Read-only file system [v8.1901.0 try https://www.rsyslog.com/e/2027 ]

After a reboot, the logs do not look like this.

I can't seem to get the last container to backup or to migrate to another node backup and migrate attempts fail. I'm working on that.
 
Can you please provide us with more logs, since the above logs you're provided are related to a rsyslogd issue. I would check the storages on the mentioned node.
 
The logs would be completely full with those messages. After a reboot, the logs would look basically fine. I was able to capture the transition, after a few tries.

It looks like I have a drive going bad:

Code:
Mar 25 18:03:39 dmo-pve2 kernel: sd 0:0:0:0: [sda] tag#26 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Mar 25 18:03:39 dmo-pve2 kernel: sd 0:0:0:0: [sda] tag#26 Sense Key : Medium Error [current]
Mar 25 18:03:39 dmo-pve2 kernel: sd 0:0:0:0: [sda] tag#26 Add. Sense: Unrecovered read error - auto reallocate failed
Mar 25 18:03:39 dmo-pve2 kernel: sd 0:0:0:0: [sda] tag#26 CDB: Read(10) 28 00 20 63 7c 60 00 00 08 00
Mar 25 18:03:39 dmo-pve2 kernel: blk_update_request: I/O error, dev sda, sector 543390818 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Mar 25 18:03:39 dmo-pve2 kernel: ata1: EH complete
Mar 25 18:03:40 dmo-pve2 kernel: EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
Mar 25 18:03:40 dmo-pve2 kernel: EXT4-fs (dm-1): Remounting filesystem read-only
Mar 25 18:03:40 dmo-pve2 rsyslogd[662]: file '8' write error: Read-only file system [v8.1901.0 try https://www.rsyslog.com/e/2027 ]
Mar 25 18:03:40 dmo-pve2 rsyslogd[662]: action 'action-1-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check for additional error messages before this
Mar 25 18:03:40 dmo-pve2 rsyslogd[662]: file '8' write error: Read-only file system [v8.1901.0 try https://www.rsyslog.com/e/2027 ]

I had this post typed but not yet sent for some reason. I have one CT that won't migrate and there was a problem with its backup. I'm trying to see if I can get this moved before replacing the drive.
 
your disk is likely failing..
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!