Ubuntu VM keeps crashing

novafreak69

Member
Nov 29, 2019
42
2
13
50
I have an Ubuntu 20.04.6 LTS VM:
VM config.JPG
Host is a Dell PowerEdge r820
4 X Intel(R) Xeon(R) CPU E5-4640 0 @ 2.40GHz
48 X 16gb dual rank 1333 MHz = 768 GB RAM


Promox - 5.11.22-4-pve #1 SMP PVE 5.11.22-8

the syslog on host:
Aug 17 23:16:24 novafreakvm kernel: mce: [Hardware Error]: Machine check events logged
Aug 17 23:16:24 novafreakvm kernel: Memory failure: 0x6887b73: Sending SIGBUS to kvm:1873892 due to hardware memory corruption
Aug 17 23:16:24 novafreakvm kernel: Memory failure: 0x6887b73: dirty LRU page still referenced by 1 users
Aug 17 23:16:24 novafreakvm kernel: Memory failure: 0x6887b73: recovery action for dirty LRU page: Failed
Aug 17 23:16:38 novafreakvm pvestatd[1734]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - got timeout
Aug 17 23:16:39 novafreakvm pvestatd[1734]: status update time (6.631 seconds)

The
journalctl -b -1 -e
Aug 17 17:43:48 plex snapd[809]: storehelpers.go:769: cannot refresh: snap has no updates available: "core", "core20", "lxd"
Aug 17 18:17:01 plex CRON[18148]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 17 18:17:01 plex CRON[18149]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 17 18:17:01 plex CRON[18148]: pam_unix(cron:session): session closed for user root
Aug 17 19:17:01 plex CRON[19278]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 17 19:17:01 plex CRON[19279]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 17 19:17:01 plex CRON[19278]: pam_unix(cron:session): session closed for user root
Aug 17 19:38:18 plex systemd[1]: Starting Daily apt download activities...
Aug 17 19:38:19 plex systemd[1]: apt-daily.service: Succeeded.
Aug 17 19:38:19 plex systemd[1]: Finished Daily apt download activities.
Aug 17 19:38:48 plex snapd[809]: storehelpers.go:769: cannot refresh: snap has no updates available: "core", "core20", "lxd"
Aug 17 20:17:01 plex CRON[20494]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 17 20:17:01 plex CRON[20495]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 17 20:17:01 plex CRON[20494]: pam_unix(cron:session): session closed for user root
Aug 17 20:32:53 plex systemd[1]: Starting Message of the Day...
Aug 17 20:32:55 plex 50-motd-news[20829]: * Strictly confined Kubernetes makes edge and IoT secure. Learn how MicroK8s
Aug 17 20:32:55 plex 50-motd-news[20829]: just raised the bar for easy, resilient and secure K8s cluster deployment.
Aug 17 20:32:55 plex 50-motd-news[20829]: https://ubuntu.com/engage/secure-kubernetes-at-the-edge
Aug 17 20:32:55 plex systemd[1]: motd-news.service: Succeeded.
Aug 17 20:32:56 plex systemd[1]: Finished Message of the Day.
Aug 17 21:17:01 plex CRON[21668]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 17 21:17:01 plex CRON[21669]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 17 21:17:01 plex CRON[21668]: pam_unix(cron:session): session closed for user root
Aug 17 22:17:01 plex CRON[22818]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 17 22:17:01 plex CRON[22819]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 17 22:17:01 plex CRON[22818]: pam_unix(cron:session): session closed for user root
Aug 17 23:11:53 plex systemd[1]: Starting Refresh fwupd metadata and update motd...
Aug 17 23:11:53 plex systemd[1]: fwupd-refresh.service: Main process exited, code=exited, status=1/FAILURE
Aug 17 23:11:53 plex systemd[1]: fwupd-refresh.service: Failed with result 'exit-code'.
Aug 17 23:11:53 plex systemd[1]: Failed to start Refresh fwupd metadata and update motd.
Aug 17 23:17:01 plex CRON[23968]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 17 23:17:01 plex CRON[23969]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 17 23:17:01 plex CRON[23968]: pam_unix(cron:session): session closed for user root
Aug 18 00:00:10 plex systemd[1]: Starting Rotate log files...
Aug 18 00:00:10 plex systemd[1]: Starting Daily man-db regeneration...
Aug 18 00:00:10 plex rsyslogd[807]: [origin software="rsyslogd" swVersion="8.2001.0" x-pid="807" x-info="https://www.rsyslog.com"] rsyslogd was HUPed
Aug 18 00:00:10 plex systemd[1]: logrotate.service: Succeeded.
Aug 18 00:00:10 plex systemd[1]: Finished Rotate log files.
Aug 18 00:00:11 plex systemd[1]: man-db.service: Succeeded.
Aug 18 00:00:11 plex systemd[1]: Finished Daily man-db regeneration.
Aug 18 00:15:49 plex systemd[1]: Starting Message of the Day...
Aug 18 00:15:50 plex 50-motd-news[25203]: * Strictly confined Kubernetes makes edge and IoT secure. Learn how MicroK8s
Aug 18 00:15:50 plex 50-motd-news[25203]: just raised the bar for easy, resilient and secure K8s cluster deployment.
Aug 18 00:15:50 plex 50-motd-news[25203]: https://ubuntu.com/engage/secure-kubernetes-at-the-edge
Aug 18 00:15:50 plex systemd[1]: motd-news.service: Succeeded.
Aug 18 00:15:50 plex systemd[1]: Finished Message of the Day.
Aug 18 00:17:01 plex CRON[25237]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 18 00:17:01 plex CRON[25238]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 18 00:17:01 plex CRON[25237]: pam_unix(cron:session): session closed for user root
Aug 18 01:17:01 plex CRON[26426]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 18 01:17:01 plex CRON[26427]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 18 01:17:01 plex CRON[26426]: pam_unix(cron:session): session closed for user root
Aug 18 02:13:48 plex snapd[809]: storehelpers.go:769: cannot refresh: snap has no updates available: "core", "core20", "lxd"
Aug 18 02:17:01 plex CRON[32886]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 18 02:17:01 plex CRON[32887]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 18 02:17:01 plex CRON[32886]: pam_unix(cron:session): session closed for user root
Aug 18 03:10:01 plex CRON[33963]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 18 03:10:01 plex CRON[33964]: (root) CMD (test -e /run/systemd/system || SERVICE_MODE=1 /sbin/e2scrub_all -A -r)
Aug 18 03:10:01 plex CRON[33963]: pam_unix(cron:session): session closed for user root
Aug 18 03:17:01 plex CRON[34093]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 18 03:17:01 plex CRON[34094]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 18 03:17:01 plex CRON[34093]: pam_unix(cron:session): session closed for user root

I have checked the HOST and no error in logs. I have 3 vms running on this host. One of them is running the exact same Ubuntu with same patch level that doies not crash. And what I mean by crash is that it is not even accessible via the console. I have to stop the VM because reboot fails in the promox.


Any help or push in the right direction would be greatly appreciated.

Thank you in advance.
 
Hi,

Code:
Aug 17 23:16:24 novafreakvm kernel: mce: [Hardware Error]: Machine check events logged
Aug 17 23:16:24 novafreakvm kernel: Memory failure: 0x6887b73: Sending SIGBUS to kvm:1873892 due to hardware memory corruption
Aug 17 23:16:24 novafreakvm kernel: Memory failure: 0x6887b73: dirty LRU page still referenced by 1 users
Aug 17 23:16:24 novafreakvm kernel: Memory failure: 0x6887b73: recovery action for dirty LRU page: Failed
This is pretty indicative of what is happening. There seems to be some bad memory.

Run memtest86+ (reboot, then select it in the bootloader menu) and let it run for at least a few ours. With 768GB, this can take a while to get even through one pass. If it reports some bad memory area, take note of them - these can then be marked as bad such that Linux won't use them anymore.
 
Hi,

Code:
Aug 17 23:16:24 novafreakvm kernel: mce: [Hardware Error]: Machine check events logged
Aug 17 23:16:24 novafreakvm kernel: Memory failure: 0x6887b73: Sending SIGBUS to kvm:1873892 due to hardware memory corruption
Aug 17 23:16:24 novafreakvm kernel: Memory failure: 0x6887b73: dirty LRU page still referenced by 1 users
Aug 17 23:16:24 novafreakvm kernel: Memory failure: 0x6887b73: recovery action for dirty LRU page: Failed
This is pretty indicative of what is happening. There seems to be some bad memory.

Run memtest86+ (reboot, then select it in the bootloader menu) and let it run for at least a few ours. With 768GB, this can take a while to get even through one pass. If it reports some bad memory area, take note of them - these can then be marked as bad such that Linux won't use them anymore.
Could it still be the hardware memory when only one of my Linux VMs is exhibiting this behavior?
 
No memory errors
Well, at least that can be ruled out now. Thanks for taking the time, more often than not it simply is some faulty RAM.

Now .. can you try installing rasdaemon? It's a simple daemon that can decode and properly show informations about a MCE, i.e. what caused them.
Code:
apt install rasdaemon

And then simply let the system run until this occurs again. After that, recorded errors can be viewed with
Code:
ras-mc-ctl --error-count
ras-mc-ctl --summary

There is also the Machine-check exception page on the Arch Linux wiki, which goes a bit more into details about MCEs and the rasdaemon.

Aug 17 23:16:24 novafreakvm kernel: Memory failure: 0x6887b73: Sending SIGBUS to kvm:1873892 due to hardware memory corruption
Can you also check - the next time this occurs - whether the memory addresses are the same (or at least have some common pattern)? If yes, than I'd simply exclude the region from usage by the kernel using the memmap= kernel parameter.
 
  • Like
Reactions: novafreak69
Well, at least that can be ruled out now. Thanks for taking the time, more often than not it simply is some faulty RAM.

Now .. can you try installing rasdaemon? It's a simple daemon that can decode and properly show informations about a MCE, i.e. what caused them.
Code:
apt install rasdaemon

And then simply let the system run until this occurs again. After that, recorded errors can be viewed with
Code:
ras-mc-ctl --error-count
ras-mc-ctl --summary

There is also the Machine-check exception page on the Arch Linux wiki, which goes a bit more into details about MCEs and the rasdaemon.


Can you also check - the next time this occurs - whether the memory addresses are the same (or at least have some common pattern)? If yes, than I'd simply exclude the region from usage by the kernel using the memmap= kernel parameter.
Aug 22 23:13:01 novafreakvm systemd[1]: Finished Proxmox VE replication runner.
Aug 22 23:13:08 novafreakvm kernel: mce: [Hardware Error]: Machine check events logged
Aug 22 23:13:08 novafreakvm kernel: Memory failure: 0xd9ebcb: Sending SIGBUS to kvm:2288 due to hardware memory corruption
Aug 22 23:13:08 novafreakvm kernel: Memory failure: 0xd9ebcb: recovery action for dirty LRU page: Recovered
Aug 22 23:13:16 novafreakvm pvestatd[1759]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - got timeout

That rules out that commonality of memory location.
 
AN update... I thought that I have disabled backups for that VM, but it turned out that I hadn't Since stopping backups... the server has not crashed...
 
Crashed again....
Aug 25 23:47:50 novafreakvm kernel: mce: [Hardware Error]: Machine check events logged
Aug 25 23:47:50 novafreakvm kernel: Memory failure: 0x644e057: Sending SIGBUS to kvm:1065790 due to hardware memory corruption
Aug 25 23:47:50 novafreakvm kernel: Memory failure: 0x644e057: recovery action for dirty LRU page: Recovered
Aug 25 23:47:50 novafreakvm QEMU[1065790]: kvm: warning: Guest MCE Memory Error at QEMU addr 0x7f9091441000 and GUEST addr 0x4615241000 of type BUS_MCEERR_AO injected
Aug 25 23:47:57 novafreakvm kernel: mce: [Hardware Error]: Machine check events logged
Aug 25 23:47:57 novafreakvm kernel: Memory failure: 0x642cceb: Sending SIGBUS to kvm:1065790 due to hardware memory corruption
Aug 25 23:47:57 novafreakvm kernel: Memory failure: 0x642cceb: recovery action for dirty LRU page: Recovered
Aug 25 23:47:57 novafreakvm kernel: Memory failure: 0x642cceb: already hardware poisoned
Aug 25 23:47:57 novafreakvm kernel: x86/PAT: reserve_ram_pages_type failed [mem 0x642cceb000-0x642ccebfff], track 0x2, req 0x2
Aug 25 23:47:57 novafreakvm kernel: Could not invalidate pfn=0x642cceb from 1:1 map
Aug 25 23:48:00 novafreakvm systemd[1]: Starting Proxmox VE replication runner...
Aug 25 23:48:01 novafreakvm systemd[1]: pvesr.service: Succeeded.
Aug 25 23:48:01 novafreakvm systemd[1]: Finished Proxmox VE replication runner.
Aug 25 23:48:09 novafreakvm pvestatd[1759]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - got timeout





and I am not sure that rasdaemon is going to work on a VM

# ras-mc-ctl --error-count
ras-mc-ctl: Error: No DIMMs found in /sys or new sysfs EDAC interface not found.



# ras-mc-ctl --summary
No Memory errors.

No PCIe AER errors.

No Extlog errors.

DBD::SQLite::db prepare failed: no such table: devlink_event at /usr/sbin/ras-mc-ctl line 1181.
Can't call method "execute" on an undefined value at /usr/sbin/ras-mc-ctl line 1182.
 
and I am not sure that rasdaemon is going to work on a VM

You have to install it on the host.

Since stopping backups... the server has not crashed...
Crashed again....

Was any backup running at that time?
Because of this: [1] recent report with the same server.

Promox - 5.11.22-4-pve #1 SMP PVE 5.11.22-8

Quite old, btw...

[1] https://forum.proxmox.com/threads/p...g-during-backup-on-dell-poweredge-r820.132465
 
No backups running at the time.

Thank you for pointing out that my PVE was quite old... I found and error in my repo.

Updating now...
 

Attachments

  • proxmox Specs.JPG
    proxmox Specs.JPG
    39.7 KB · Views: 17

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!