WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!

proxwolfe · Jan 9, 2023

Hi,

I just came back from a vacation and wanted to check on my VMs.

One is not reachable. When I try to login via the console (in the PVE GUI), it does not connect and when I open the log, I get the message above.

The other VMs are running fine but I still can't login in via the console and I see the same error message for every single VM I try.

First of all, I am wondering which remote host is meant here: I am running a three node cluster. When I log in to the GUI of the node on which the VMs are actually running, I can connect to the console. The error only comes up when I am logged in to the GUI of one of the other nodes and try to access the console.

This leads me to suspect that "remote host" refers to the other node and not the individual VM. Is that right?

Secondly, is there any legitimate way the host identification of one of the nodes would change by itself? Or could I have accidentally triggered this (prior to my vacation)?

If not, how do I rule out that someone else is doing something nasty?

Thanks!

Moayad · Jan 9, 2023

Hi,

Can you try to renew the PVE certificates using the below command:

Bash:

pvecm updatecerts --force

If the issue still occurs, please check the Syslog in order to see any interesting message when you try to log in into PVE Web GUI

proxwolfe · Jan 9, 2023

Thank you Moayad for the suggestion.

I tried this on the first node (while logged in at another node):

(re)generate node files
generate new node certificate
Can't use an undefined value as a symbol reference at /usr/share/perl5/PVE/Cluster/Setup.pm line 496.

Tried on the second node (while logged in on this node):

(re)generate node files
generate new node certificate
merge authorized SSH keys and known hosts

Tried on the third node (while logged in at another node):

(re)generate node files
generate new node certificate
merge authorized SSH keys and known hosts

Tried on the first node again (this time while logged in at the first node):

Can't get to the shell: "Connection failed (Error 500: unable to open file '/var/tmp/pve-reserved-ports.tmp.1223' - Read-only file system)"

So it seems to be an issue of the first node. But what?

Thanks

bbgeek17 · Jan 9, 2023

proxwolfe said:
So it seems to be an issue of the first node. But what?

seems like you had an outage and may have either file system corruption or disk issue. Plug this into google "linux read only file system" for troubleshooting steps.
At high level examine:
df; mount; uptime; journalctl -b0

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

proxwolfe · Jan 9, 2023

I don't think there would have been a power outage - not sure whether you mean any other sort of outage.

The particular node has been up for 8 days (while the other two nodes have been up for 33 and 45 days respectively). However, this coincides with a system update I ran prior to going on vacation on this one node (probably not the best time to do this...). So I guess that explains the different uptime.

The / file system is not full but has plenty of space. Mount also does not show anything suspicious.

However journalctl yields this:

Jan 05 19:18:43 tx1330m2-1 smartd[680]: Device: /dev/sda [SAT], 2 Currently unreadable (pending) sectors
Jan 05 19:40:46 tx1330m2-1 pmxcfs[908]: [dcdb] notice: data verification successful
Jan 05 19:47:05 tx1330m2-1 ceph-mon[1010]: [205B blob data]
Jan 05 19:47:05 tx1330m2-1 ceph-mon[1010]: PutCF( prefix = paxos key = '4859972' value size = 2440)
Jan 05 19:47:05 tx1330m2-1 ceph-mon[1010]: PutCF( prefix = paxos key = 'pending_v' value size = 8)
Jan 05 19:47:05 tx1330m2-1 ceph-mon[1010]: PutCF( prefix = paxos key = 'pending_pn' value size = 8)
Jan 05 19:47:10 tx1330m2-1 kernel: sd 0:0:0:0: [sda] tag#20 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=40s
Jan 05 19:47:10 tx1330m2-1 kernel: sd 0:0:0:0: [sda] tag#20 CDB: Write(10) 2a 00 01 2e 7f 28 00 00 08 00
Jan 05 19:47:10 tx1330m2-1 kernel: blk_update_request: I/O error, dev sda, sector 19824424 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Jan 05 19:47:10 tx1330m2-1 kernel: EXT4-fs warning (device dm-2): ext4_end_bio:344: I/O error 10 writing to inode 786450 starting block 249317)
Jan 05 19:47:10 tx1330m2-1 kernel: sd 0:0:0:0: [sda] tag#13 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=40s
Jan 05 19:47:10 tx1330m2-1 kernel: sd 0:0:0:0: [sda] tag#13 CDB: Write(10) 2a 00 04 b2 f1 b8 00 00 10 00
Jan 05 19:47:10 tx1330m2-1 kernel: blk_update_request: I/O error, dev sda, sector 78836152 op 0x1:(WRITE) flags 0x800 phys_seg 2 prio class 0
Jan 05 19:47:10 tx1330m2-1 kernel: EXT4-fs warning (device dm-2): ext4_end_bio:344: I/O error 10 writing to inode 134688 starting block 7625783)
Jan 05 19:47:10 tx1330m2-1 kernel: Buffer I/O error on device dm-2, logical block 7625783
Jan 05 19:47:10 tx1330m2-1 kernel: EXT4-fs warning (device dm-2): ext4_end_bio:344: I/O error 10 writing to inode 134688 starting block 7625784)
Jan 05 19:47:10 tx1330m2-1 kernel: sd 0:0:0:0: [sda] tag#14 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=39s
Jan 05 19:47:10 tx1330m2-1 kernel: sd 0:0:0:0: [sda] tag#14 CDB: Write(10) 2a 00 02 4c 6a 10 00 00 18 00
Jan 05 19:47:10 tx1330m2-1 kernel: blk_update_request: I/O error, dev sda, sector 38562320 op 0x1:(WRITE) flags 0x0 phys_seg 3 prio class 0
Jan 05 19:47:10 tx1330m2-1 kernel: EXT4-fs warning (device dm-2): ext4_end_bio:344: I/O error 10 writing to inode 134661 starting block 2591554)
Jan 05 19:47:10 tx1330m2-1 kernel: Buffer I/O error on device dm-2, logical block 2591554
Jan 05 19:47:10 tx1330m2-1 kernel: EXT4-fs warning (device dm-2): ext4_end_bio:344: I/O error 10 writing to inode 134661 starting block 2591555)
Jan 05 19:47:10 tx1330m2-1 kernel: sd 0:0:0:0: [sda] tag#15 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=36s
Jan 05 19:47:10 tx1330m2-1 kernel: sd 0:0:0:0: [sda] tag#15 CDB: Write(10) 2a 00 02 d5 d3 d8 00 00 80 00
Jan 05 19:47:10 tx1330m2-1 kernel: blk_update_request: I/O error, dev sda, sector 47567832 op 0x1:(WRITE) flags 0x800 phys_seg 16 prio class 0
Jan 05 19:47:10 tx1330m2-1 kernel: Buffer I/O error on device dm-2, logical block 249317
Jan 05 19:47:10 tx1330m2-1 kernel: Buffer I/O error on device dm-2, logical block 7625784
Jan 05 19:47:10 tx1330m2-1 kernel: Buffer I/O error on device dm-2, logical block 2591555
Jan 05 19:47:10 tx1330m2-1 kernel: Buffer I/O error on device dm-2, logical block 2591556
Jan 05 19:47:10 tx1330m2-1 kernel: Aborting journal on device dm-2-8.
Jan 05 19:47:10 tx1330m2-1 kernel: EXT4-fs error (device dm-2): ext4_journal_check_start:83: comm cfs_loop: Detected aborted journal
Jan 05 19:47:10 tx1330m2-1 kernel: EXT4-fs error (device dm-2): ext4_journal_check_start:83: comm rs:main Q:Reg: Detected aborted journal
Jan 05 19:47:10 tx1330m2-1 kernel: EXT4-fs error (device dm-2): ext4_journal_check_start:83: comm systemd-journal: Detected aborted journal
Jan 05 19:47:10 tx1330m2-1 kernel: EXT4-fs error (device dm-2): ext4_journal_check_start:83: comm log: Detected aborted journal
Jan 05 19:47:10 tx1330m2-1 kernel: EXT4-fs error (device dm-2): ext4_journal_check_start:83: comm safe_timer: Detected aborted journal
Jan 05 19:47:10 tx1330m2-1 kernel: EXT4-fs error (device dm-2): ext4_journal_check_start:83: comm log: Detected aborted journal
Jan 05 19:47:10 tx1330m2-1 kernel: EXT4-fs (dm-2): Remounting filesystem read-only

The two unreadable sector I see from the beginning of the journal - don't know how long they have been there. But on Jan 5 the system was remounted read-only.

/dev/sda is a relatively new ssd (no wearout yet). It only holds the OS. VMs etc reside on other disks.

So what is the best course of action? I can't interpret the error messages and don't know how bad this is.

Should I power down the node and reinstall PVE on a new drive? Or can the node be saved somehow as is?

Thanks!

bbgeek17 · Jan 9, 2023

proxwolfe said:
not sure whether you mean any other sort of outage

proxwolfe said:
Jan 05 19:47:10 tx1330m2-1 kernel: Buffer I/O error on device dm-2, logical block 249317 Jan 05 19:47:10 tx1330m2-1 kernel: Buffer I/O error on device dm-2, logical block 7625784 Jan 05 19:47:10 tx1330m2-1 kernel: Buffer I/O error on device dm-2, logical block 2591555 Jan 05 19:47:10 tx1330m2-1 kernel: Buffer I/O error on device dm-2, logical block 2591556

proxwolfe said:
Jan 05 19:47:10 tx1330m2-1 kernel: EXT4-fs (dm-2): Remounting filesystem read-only

A hardware failure is an "outage". Your disk/cable/PCI/MB has failed and caused the root filesystem to become r/o. Thats pretty catastrophic to system operation.
Sometimes new things fail right out of the box.

proxwolfe said:
Should I power down the node and reinstall PVE on a new drive? Or can the node be saved somehow as is?

You should review whether you have warranty for old disk and get a new one. As to what to do with the server - you can try many of disk duplication methods to see if the context is salvageable, or reinstall. Whatever you think is best use of your time.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

proxwolfe · Jan 9, 2023

Then I will reinstall. I think that will be quicker than trying to duplicate a potentially corrupted file system from a defective disk.

But before I do I need to resolve the read only file system issue: Given that I have a cluster, I want to migrate the VMs from the failing node to one of the other nodes to allow uninterrupted operation while rebuilding the node.

However, when I try to migrate from the GUI I get this error:

starting worker failed: unable to create output file '/var/log/pve/tasks/F/UPID:node1:xxxxxxxx:xxxxxxxx:xxxxxxxx:qmigrate:102:root@pam:' - Read-only file system (500)

bbgeek17 · Jan 9, 2023

Your system is in a state where its impossible to predict what will and will not work.
You can try to remount it RW and hope that will last through migration https://askubuntu.com/questions/175739/how-do-i-remount-a-filesystem-as-read-write
You can try to reboot and hope for the best.
You can boot from a rescue disk and manually copy VM images and config.

You will just need to try things and troubleshoot on the fly.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

proxwolfe · Jan 9, 2023

Yeah, just tried remounting to rw but that failed (forgot to capture the error message).

Then I lost the connection to the GUI (also via the other nodes).

So I logged into the node's console (which was running over with error messages) and rebooted.

Node came back up as if nothing had happened. I used this opportunity to update the certs on this node as well (which was my original problem...) and that worked as well. I can now access the consoles of all the VMs again. So that's a win.

I will watch that one node closely (given that it isn't hosting any mission critical VMs and that I run regular backups, the risk of data loss is manageable).

Thanks for all your help!

ollioddi · Nov 2, 2023

Moayad said:
Hi,

Can you try to renew the PVE certificates using the below command:

Bash:

pvecm updatecerts --force

If the issue still occurs, please check the Syslog in order to see any interesting message when you try to log in into PVE Web GUI

This fixed my issue! Thanks.

I had reinstalled a node from scratch and joined it into my cluster. I ran the command once on the "new" node. Just want to let others know.

WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!

proxwolfe

Renowned Member

Moayad

Proxmox Staff Member

proxwolfe

Renowned Member

bbgeek17

Distinguished Member

proxwolfe

Renowned Member

bbgeek17

Distinguished Member

proxwolfe

Renowned Member

bbgeek17

Distinguished Member

proxwolfe

Renowned Member

ollioddi

New Member

We value your privacy