Ubuntu VM keeps crashing

novafreak69 · Aug 18, 2023

I have an Ubuntu 20.04.6 LTS VM:

Host is a Dell PowerEdge r820
4 X Intel(R) Xeon(R) CPU E5-4640 0 @ 2.40GHz
48 X 16gb dual rank 1333 MHz = 768 GB RAM

Promox - 5.11.22-4-pve #1 SMP PVE 5.11.22-8

the syslog on host:
Aug 17 23:16:24 novafreakvm kernel: mce: [Hardware Error]: Machine check events logged
Aug 17 23:16:24 novafreakvm kernel: Memory failure: 0x6887b73: Sending SIGBUS to kvm:1873892 due to hardware memory corruption
Aug 17 23:16:24 novafreakvm kernel: Memory failure: 0x6887b73: dirty LRU page still referenced by 1 users
Aug 17 23:16:24 novafreakvm kernel: Memory failure: 0x6887b73: recovery action for dirty LRU page: Failed
Aug 17 23:16:38 novafreakvm pvestatd[1734]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - got timeout
Aug 17 23:16:39 novafreakvm pvestatd[1734]: status update time (6.631 seconds)

The
journalctl -b -1 -e
Aug 17 17:43:48 plex snapd[809]: storehelpers.go:769: cannot refresh: snap has no updates available: "core", "core20", "lxd"
Aug 17 18:17:01 plex CRON[18148]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 17 18:17:01 plex CRON[18149]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 17 18:17:01 plex CRON[18148]: pam_unix(cron:session): session closed for user root
Aug 17 19:17:01 plex CRON[19278]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 17 19:17:01 plex CRON[19279]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 17 19:17:01 plex CRON[19278]: pam_unix(cron:session): session closed for user root
Aug 17 19:38:18 plex systemd[1]: Starting Daily apt download activities...
Aug 17 19:38:19 plex systemd[1]: apt-daily.service: Succeeded.
Aug 17 19:38:19 plex systemd[1]: Finished Daily apt download activities.
Aug 17 19:38:48 plex snapd[809]: storehelpers.go:769: cannot refresh: snap has no updates available: "core", "core20", "lxd"
Aug 17 20:17:01 plex CRON[20494]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 17 20:17:01 plex CRON[20495]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 17 20:17:01 plex CRON[20494]: pam_unix(cron:session): session closed for user root
Aug 17 20:32:53 plex systemd[1]: Starting Message of the Day...
Aug 17 20:32:55 plex 50-motd-news[20829]: * Strictly confined Kubernetes makes edge and IoT secure. Learn how MicroK8s
Aug 17 20:32:55 plex 50-motd-news[20829]: just raised the bar for easy, resilient and secure K8s cluster deployment.
Aug 17 20:32:55 plex 50-motd-news[20829]: https://ubuntu.com/engage/secure-kubernetes-at-the-edge
Aug 17 20:32:55 plex systemd[1]: motd-news.service: Succeeded.
Aug 17 20:32:56 plex systemd[1]: Finished Message of the Day.
Aug 17 21:17:01 plex CRON[21668]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 17 21:17:01 plex CRON[21669]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 17 21:17:01 plex CRON[21668]: pam_unix(cron:session): session closed for user root
Aug 17 22:17:01 plex CRON[22818]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 17 22:17:01 plex CRON[22819]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 17 22:17:01 plex CRON[22818]: pam_unix(cron:session): session closed for user root
Aug 17 23:11:53 plex systemd[1]: Starting Refresh fwupd metadata and update motd...
Aug 17 23:11:53 plex systemd[1]: fwupd-refresh.service: Main process exited, code=exited, status=1/FAILURE
Aug 17 23:11:53 plex systemd[1]: fwupd-refresh.service: Failed with result 'exit-code'.
Aug 17 23:11:53 plex systemd[1]: Failed to start Refresh fwupd metadata and update motd.
Aug 17 23:17:01 plex CRON[23968]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 17 23:17:01 plex CRON[23969]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 17 23:17:01 plex CRON[23968]: pam_unix(cron:session): session closed for user root
Aug 18 00:00:10 plex systemd[1]: Starting Rotate log files...
Aug 18 00:00:10 plex systemd[1]: Starting Daily man-db regeneration...
Aug 18 00:00:10 plex rsyslogd[807]: [origin software="rsyslogd" swVersion="8.2001.0" x-pid="807" x-info="https://www.rsyslog.com"] rsyslogd was HUPed
Aug 18 00:00:10 plex systemd[1]: logrotate.service: Succeeded.
Aug 18 00:00:10 plex systemd[1]: Finished Rotate log files.
Aug 18 00:00:11 plex systemd[1]: man-db.service: Succeeded.
Aug 18 00:00:11 plex systemd[1]: Finished Daily man-db regeneration.
Aug 18 00:15:49 plex systemd[1]: Starting Message of the Day...
Aug 18 00:15:50 plex 50-motd-news[25203]: * Strictly confined Kubernetes makes edge and IoT secure. Learn how MicroK8s
Aug 18 00:15:50 plex 50-motd-news[25203]: just raised the bar for easy, resilient and secure K8s cluster deployment.
Aug 18 00:15:50 plex 50-motd-news[25203]: https://ubuntu.com/engage/secure-kubernetes-at-the-edge
Aug 18 00:15:50 plex systemd[1]: motd-news.service: Succeeded.
Aug 18 00:15:50 plex systemd[1]: Finished Message of the Day.
Aug 18 00:17:01 plex CRON[25237]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 18 00:17:01 plex CRON[25238]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 18 00:17:01 plex CRON[25237]: pam_unix(cron:session): session closed for user root
Aug 18 01:17:01 plex CRON[26426]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 18 01:17:01 plex CRON[26427]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 18 01:17:01 plex CRON[26426]: pam_unix(cron:session): session closed for user root
Aug 18 02:13:48 plex snapd[809]: storehelpers.go:769: cannot refresh: snap has no updates available: "core", "core20", "lxd"
Aug 18 02:17:01 plex CRON[32886]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 18 02:17:01 plex CRON[32887]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 18 02:17:01 plex CRON[32886]: pam_unix(cron:session): session closed for user root
Aug 18 03:10:01 plex CRON[33963]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 18 03:10:01 plex CRON[33964]: (root) CMD (test -e /run/systemd/system || SERVICE_MODE=1 /sbin/e2scrub_all -A -r)
Aug 18 03:10:01 plex CRON[33963]: pam_unix(cron:session): session closed for user root
Aug 18 03:17:01 plex CRON[34093]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 18 03:17:01 plex CRON[34094]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 18 03:17:01 plex CRON[34093]: pam_unix(cron:session): session closed for user root

I have checked the HOST and no error in logs. I have 3 vms running on this host. One of them is running the exact same Ubuntu with same patch level that doies not crash. And what I mean by crash is that it is not even accessible via the console. I have to stop the VM because reboot fails in the promox.

Any help or push in the right direction would be greatly appreciated.

Thank you in advance.

cheiss · Aug 18, 2023

Hi,

Code:

Aug 17 23:16:24 novafreakvm kernel: mce: [Hardware Error]: Machine check events logged
Aug 17 23:16:24 novafreakvm kernel: Memory failure: 0x6887b73: Sending SIGBUS to kvm:1873892 due to hardware memory corruption
Aug 17 23:16:24 novafreakvm kernel: Memory failure: 0x6887b73: dirty LRU page still referenced by 1 users
Aug 17 23:16:24 novafreakvm kernel: Memory failure: 0x6887b73: recovery action for dirty LRU page: Failed

This is pretty indicative of what is happening. There seems to be some bad memory.

Run memtest86+ (reboot, then select it in the bootloader menu) and let it run for at least a few ours. With 768GB, this can take a while to get even through one pass. If it reports some bad memory area, take note of them - these can then be marked as bad such that Linux won't use them anymore.

novafreak69 · Aug 19, 2023

cheiss said:
Hi,

Code:

Aug 17 23:16:24 novafreakvm kernel: mce: [Hardware Error]: Machine check events logged Aug 17 23:16:24 novafreakvm kernel: Memory failure: 0x6887b73: Sending SIGBUS to kvm:1873892 due to hardware memory corruption Aug 17 23:16:24 novafreakvm kernel: Memory failure: 0x6887b73: dirty LRU page still referenced by 1 users Aug 17 23:16:24 novafreakvm kernel: Memory failure: 0x6887b73: recovery action for dirty LRU page: Failed

This is pretty indicative of what is happening. There seems to be some bad memory.

Run memtest86+ (reboot, then select it in the bootloader menu) and let it run for at least a few ours. With 768GB, this can take a while to get even through one pass. If it reports some bad memory area, take note of them - these can then be marked as bad such that Linux won't use them anymore.

Could it still be the hardware memory when only one of my Linux VMs is exhibiting this behavior?

novafreak69 · Aug 19, 2023

Alright, this should finish some time next week... lol

novafreak69 · Aug 21, 2023

looking good so far...

novafreak69 · Aug 23, 2023

No memory errors

cheiss · Aug 23, 2023

novafreak69 said:
No memory errors

Well, at least that can be ruled out now. Thanks for taking the time, more often than not it simply is some faulty RAM.

Now .. can you try installing rasdaemon? It's a simple daemon that can decode and properly show informations about a MCE, i.e. what caused them.

Code:

apt install rasdaemon

And then simply let the system run until this occurs again. After that, recorded errors can be viewed with

Code:

ras-mc-ctl --error-count
ras-mc-ctl --summary

There is also the Machine-check exception page on the Arch Linux wiki, which goes a bit more into details about MCEs and the rasdaemon.

novafreak69 said:
Aug 17 23:16:24 novafreakvm kernel: Memory failure: 0x6887b73: Sending SIGBUS to kvm:1873892 due to hardware memory corruption

Can you also check - the next time this occurs - whether the memory addresses are the same (or at least have some common pattern)? If yes, than I'd simply exclude the region from usage by the kernel using the memmap= kernel parameter.

novafreak69 · Aug 23, 2023

cheiss said:
Well, at least that can be ruled out now. Thanks for taking the time, more often than not it simply is some faulty RAM.

Now .. can you try installing rasdaemon? It's a simple daemon that can decode and properly show informations about a MCE, i.e. what caused them.

Code:

apt install rasdaemon

And then simply let the system run until this occurs again. After that, recorded errors can be viewed with

Code:

ras-mc-ctl --error-count ras-mc-ctl --summary

There is also the Machine-check exception page on the Arch Linux wiki, which goes a bit more into details about MCEs and the rasdaemon.

Can you also check - the next time this occurs - whether the memory addresses are the same (or at least have some common pattern)? If yes, than I'd simply exclude the region from usage by the kernel using the memmap= kernel parameter.

Aug 22 23:13:01 novafreakvm systemd[1]: Finished Proxmox VE replication runner.
Aug 22 23:13:08 novafreakvm kernel: mce: [Hardware Error]: Machine check events logged
Aug 22 23:13:08 novafreakvm kernel: Memory failure: 0xd9ebcb: Sending SIGBUS to kvm:2288 due to hardware memory corruption
Aug 22 23:13:08 novafreakvm kernel: Memory failure: 0xd9ebcb: recovery action for dirty LRU page: Recovered
Aug 22 23:13:16 novafreakvm pvestatd[1759]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - got timeout

That rules out that commonality of memory location.

novafreak69 · Aug 26, 2023

AN update... I thought that I have disabled backups for that VM, but it turned out that I hadn't Since stopping backups... the server has not crashed...

novafreak69 · Aug 26, 2023

Crashed again....
Aug 25 23:47:50 novafreakvm kernel: mce: [Hardware Error]: Machine check events logged
Aug 25 23:47:50 novafreakvm kernel: Memory failure: 0x644e057: Sending SIGBUS to kvm:1065790 due to hardware memory corruption
Aug 25 23:47:50 novafreakvm kernel: Memory failure: 0x644e057: recovery action for dirty LRU page: Recovered
Aug 25 23:47:50 novafreakvm QEMU[1065790]: kvm: warning: Guest MCE Memory Error at QEMU addr 0x7f9091441000 and GUEST addr 0x4615241000 of type BUS_MCEERR_AO injected
Aug 25 23:47:57 novafreakvm kernel: mce: [Hardware Error]: Machine check events logged
Aug 25 23:47:57 novafreakvm kernel: Memory failure: 0x642cceb: Sending SIGBUS to kvm:1065790 due to hardware memory corruption
Aug 25 23:47:57 novafreakvm kernel: Memory failure: 0x642cceb: recovery action for dirty LRU page: Recovered
Aug 25 23:47:57 novafreakvm kernel: Memory failure: 0x642cceb: already hardware poisoned
Aug 25 23:47:57 novafreakvm kernel: x86/PAT: reserve_ram_pages_type failed [mem 0x642cceb000-0x642ccebfff], track 0x2, req 0x2
Aug 25 23:47:57 novafreakvm kernel: Could not invalidate pfn=0x642cceb from 1:1 map
Aug 25 23:48:00 novafreakvm systemd[1]: Starting Proxmox VE replication runner...
Aug 25 23:48:01 novafreakvm systemd[1]: pvesr.service: Succeeded.
Aug 25 23:48:01 novafreakvm systemd[1]: Finished Proxmox VE replication runner.
Aug 25 23:48:09 novafreakvm pvestatd[1759]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - got timeout

and I am not sure that rasdaemon is going to work on a VM

# ras-mc-ctl --error-count
ras-mc-ctl: Error: No DIMMs found in /sys or new sysfs EDAC interface not found.

# ras-mc-ctl --summary
No Memory errors.

No PCIe AER errors.

No Extlog errors.

DBD::SQLite::db prepare failed: no such table: devlink_event at /usr/sbin/ras-mc-ctl line 1181.
Can't call method "execute" on an undefined value at /usr/sbin/ras-mc-ctl line 1182.

Neobin · Aug 26, 2023

novafreak69 said:
and I am not sure that rasdaemon is going to work on a VM

You have to install it on the host.

novafreak69 said:
Since stopping backups... the server has not crashed...

novafreak69 said:
Crashed again....

Was any backup running at that time?
Because of this: [1] recent report with the same server.

novafreak69 said:
Promox - 5.11.22-4-pve #1 SMP PVE 5.11.22-8

Quite old, btw...

[1] https://forum.proxmox.com/threads/p...g-during-backup-on-dell-poweredge-r820.132465

novafreak69 · Aug 28, 2023

No backups running at the time.

Thank you for pointing out that my PVE was quite old... I found and error in my repo.

Updating now...

novafreak69 · Sep 5, 2023

Neobin said:
You have to install it on the host.

Was any backup running at that time?
Because of this: [1] recent report with the same server.

Quite old, btw...

[1] https://forum.proxmox.com/threads/p...g-during-backup-on-dell-poweredge-r820.132465

Seems the update Fixed it.... THANK YOU. I had an error in my repos that I never noticed. I run By weekly updates so I assumed it was up to date... BUT NOPE...

The VM has not crashed since the Updates...

Ubuntu VM keeps crashing

novafreak69

Member

cheiss

Proxmox Staff Member

novafreak69

Member

novafreak69

Member

novafreak69

Member

novafreak69

Member

cheiss

Proxmox Staff Member

novafreak69

Member

novafreak69

Member

novafreak69

Member

Neobin

Distinguished Member

novafreak69

Member

Attachments

novafreak69

Member

We value your privacy