DMESG errors on migrated VMs and CTs, then VMs crash

helojunkie · Apr 1, 2021

Hello Community -

We built a new proxmox cluster (3 node) and started migrating VMs and CTs from an existing Proxmox single node unit to the cluster. I keep having issues with these nodes crashing with weird filesystem and mount errors. They run perfectly fine on the old proxmox unit and w are seeing this issue across multiple nodes of the cluster so I do not think it is hardware related. *NEW* installs on the cluster do not have these issues. I was hoping someone could shed some light on the issue. This is happening to every migrated VM, eventually, the system goes into RO mode and crashes, generally requiring a manual fsck on the next reboot.

Code:

[Wed Mar 31 11:33:38 2021] EXT4-fs (sda1): mounted filesystem without journal. Opts: (null)
[Wed Mar 31 11:33:38 2021] IPv6: ADDRCONF(NETDEV_UP): ens18: link is not ready
[Wed Mar 31 11:33:38 2021] e1000: ens18 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[Wed Mar 31 11:33:38 2021] IPv6: ADDRCONF(NETDEV_CHANGE): ens18: link becomes ready
[Wed Mar 31 11:33:38 2021] cgroup: new mount options do not match the existing superblock, will be ignored
[Wed Mar 31 12:49:09 2021] sd 2:0:0:0: [sda] tag#14 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[Wed Mar 31 12:49:09 2021] sd 2:0:0:0: [sda] tag#14 CDB: Write(10) 2a 00 03 1d 6c a8 00 00 10 00
[Wed Mar 31 12:49:09 2021] blk_update_request: I/O error, dev sda, sector 52260008
[Wed Mar 31 12:49:09 2021] sd 2:0:0:0: [sda] tag#13 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[Wed Mar 31 12:49:09 2021] sd 2:0:0:0: [sda] tag#13 CDB: Write(10) 2a 00 03 17 06 78 00 00 08 00
[Wed Mar 31 12:49:09 2021] blk_update_request: I/O error, dev sda, sector 51840632
[Wed Mar 31 12:49:09 2021] Aborting journal on device dm-0-8.
[Wed Mar 31 12:49:09 2021] Buffer I/O error on dev dm-0, logical block 6292175, lost async page write
[Wed Mar 31 12:49:09 2021] sd 2:0:0:0: [sda] tag#12 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[Wed Mar 31 12:49:09 2021] sd 2:0:0:0: [sda] tag#12 CDB: Write(10) 2a 00 03 17 01 60 00 00 08 00
[Wed Mar 31 12:49:09 2021] blk_update_request: I/O error, dev sda, sector 51839328
[Wed Mar 31 12:49:09 2021] Buffer I/O error on dev dm-0, logical block 6292012, lost async page write
[Wed Mar 31 12:49:09 2021] sd 2:0:0:0: [sda] tag#11 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[Wed Mar 31 12:49:09 2021] sd 2:0:0:0: [sda] tag#11 CDB: Write(10) 2a 00 02 23 e3 a0 00 00 08 00
[Wed Mar 31 12:49:09 2021] blk_update_request: I/O error, dev sda, sector 35906464
[Wed Mar 31 12:49:09 2021] EXT4-fs warning (device dm-0): ext4_end_bio:330: I/O error -5 writing to inode 1583870 (offset 0 size 0 starting block 4300404)
[Wed Mar 31 12:49:09 2021] Buffer I/O error on device dm-0, logical block 4300404
[Wed Mar 31 12:49:09 2021] sd 2:0:0:0: [sda] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[Wed Mar 31 12:49:09 2021] sd 2:0:0:0: [sda] tag#10 CDB: Write(10) 2a 00 00 2c 26 c8 00 00 18 00
[Wed Mar 31 12:49:09 2021] blk_update_request: I/O error, dev sda, sector 2893512
[Wed Mar 31 12:49:09 2021] EXT4-fs warning (device dm-0): ext4_end_bio:330: I/O error -5 writing to inode 1583870 (offset 0 size 0 starting block 173785)
[Wed Mar 31 12:49:09 2021] Buffer I/O error on device dm-0, logical block 173785
[Wed Mar 31 12:49:09 2021] Buffer I/O error on device dm-0, logical block 173786
[Wed Mar 31 12:49:09 2021] Buffer I/O error on device dm-0, logical block 173787
[Wed Mar 31 12:49:09 2021] EXT4-fs error (device dm-0): ext4_journal_check_start:56: Detected aborted journal
[Wed Mar 31 12:49:09 2021] EXT4-fs (dm-0): Remounting filesystem read-only
[Wed Mar 31 12:50:49 2021] ata3.00: exception Emask 0x0 SAct 0x1000000 SErr 0x0 action 0x6 frozen
[Wed Mar 31 12:50:49 2021] ata3.00: failed command: READ FPDMA QUEUED
[Wed Mar 31 12:50:49 2021] ata3.00: cmd 60/00:c0:58:69:7b/01:00:00:00:00/40 tag 24 ncq 131072 in
                                    res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[Wed Mar 31 12:50:49 2021] ata3.00: status: { DRDY }
[Wed Mar 31 12:50:49 2021] ata3: hard resetting link
[Wed Mar 31 12:51:02 2021] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[Wed Mar 31 12:51:02 2021] ata3.00: configured for UDMA/100
[Wed Mar 31 12:51:02 2021] ata3.00: device reported invalid CHS sector 0
[Wed Mar 31 12:51:02 2021] ata3: EH complete

mira · Apr 2, 2021

Please post the output of pveversion -v of all nodes in the cluster. Please also mention which of those are affected by this problem.
In addition, how are the disks connected? Via a RAID Controller, HBA or directly on the mainboard? Does it differ in any way between old nodes and new ones?

helojunkie · Apr 2, 2021

Hi Mira -

I have a three-node cluster. Two of the nodes are Dell R820, 4 x CPU, 256GB RAM, 2 x HGST 12Gbs SAS SSD (OS) in a ZFS Mirror on an LSI 3008 IT mode HBA, 2 x Intel P4500 Series NVMe drives (4TB EACH) also in a ZFS mirror, these plug directly into the motherboard PCI slots as they are NVMe. The third node in the cluster is a HP Z820 with 2 x Xeon CPUs and 265MB RAM and is only used for testing VMs and CTs before deployment and to keep the cluster in quorum. All nodes are backend connected via a private 10Gbe Corosync network for backend traffic and another separate 10Gbe for front-end traffic.

All servers have a common shared NAS space (NFS via TrueNAS) for ISO and backups. These two Dell servers were doing other duty before moving them to proxmox and never had any hardware issues over the years. Before conversion to Proxmox they were running Ubuntu with zero issues FWIW.

The servers we were running on were also ZFS mirrored SSD drives and they were HP z820 with 2 x CPU and 256MB RAM. We have run those for almost 5 years without a single problem at all. However, they were started to get loaded down and were also not in a cluster. They were also running the 5.4-15 version of Proxmox. I still have that system running if you need more information on it. Those SSD drives were connected to an LSI 2308 SAS controller in IT mode.

So proxmox01 and proxmox02 are the Dell R820s, both of which have the issues where I take a vzdump backup, restore it to the node, it comes up and starts running with no issues, and then at some point, they end up going into RO mode, and crashing. Rebooting requires a manual fsck to get them back up and running. I eventually had to migrate everything back to the old server. Last night I migrated a few less important VMs back to the new cluster, but almost no load really. On thing to note is that even running just a couple of low-end Linux CTs (8GB ram each x 6) and two WIN10 PRO VMs with 32GB Memory) the servers are showing +300GB of ram in use which seems very, very high for me. Maybe a memory leak in 6..3-2 or is that normal? With ALL of these VMs on my one Z820 running 5.4.-15, memory utilization was about 80GB, now with just three running on one node it is over 128GB, so something seems off there as well.

Hopefully, this is enough information, I really enjoy Proxmox and am a paid subscriber as I will be on these, but not if I cannot get them stable for production use. Please let me know if you need any additional information and thank you again for your help.

proxmox01

Code:

root@proxmox01:~# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.73-1-pve)
pve-manager: 6.3-2 (running version: 6.3-2/22f57405)
pve-kernel-5.4: 6.3-1
pve-kernel-helper: 6.3-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-6
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.3-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-1
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-1
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

proxmox02

Code:

root@proxmox02:~# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.73-1-pve)
pve-manager: 6.3-2 (running version: 6.3-2/22f57405)
pve-kernel-5.4: 6.3-1
pve-kernel-helper: 6.3-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-6
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.3-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-1
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-1
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

proxmox03

Code:

root@proxmox03:~# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.73-1-pve)
pve-manager: 6.3-2 (running version: 6.3-2/22f57405)
pve-kernel-5.4: 6.3-1
pve-kernel-helper: 6.3-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-6
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.3-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-1
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-1
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

Here you can see I only have a total of 8 running VMs and I am using over 300GB of memory!

Screen Shot 2021-04-02 at 7.24.32 AM.png

mira · Apr 2, 2021

Ah, so this happens to VMs that were restored from a backup.
The dmesg messages are from the host? Is it part of the ZFS?

helojunkie · Apr 2, 2021

The dmesg is from the host, not the proxserver, and yes, this happens to a host that is stopped, backed up on one Proxmox server, and then restored to the new Proxmox cluster and restarted. All proxmox servers store (and run) the images off locally attached NVMe or SSD ZFS storage. I backup the images (vzdump, stopped, not snaps) to NFS-backed storage.

mira · Apr 6, 2021

Is there anything in the journal on the cluster nodes during such a time?
Also try updating to the latest version.

What OS and which version is running in the VMs? Are Windows VMs also affected?

helojunkie · Apr 6, 2021

Everything is the latest version, these are brand new installs. All OSs are affected across both cluster nodes but only on migrated VMs and CTs. If I create it from scratch on the node, the problem does not happen. Where is the journal, I will take a look.

mira · Apr 7, 2021

There are newer version of many packages available.

You can either use journalctl -b to view the whole journal since the last boot, or take a look at the syslogs (/var/log/syslog.*).

Search

Search

DMESG errors on migrated VMs and CTs, then VMs crash

helojunkie

Well-Known Member

mira

Proxmox Staff Member

helojunkie

Well-Known Member

mira

Proxmox Staff Member

helojunkie

Well-Known Member

mira

Proxmox Staff Member

helojunkie

Well-Known Member

mira

Proxmox Staff Member

We value your privacy