Corrupted Ubuntu VMs over time

n4mobile

Member
Oct 19, 2022
5
0
6
We have 6 servers in our datacenter mostly Proliant 380 gen 8 and 9 with raid card.
Running Proxmox 8.3.1 over Raid 50 on SAS drives one of our servers as 8 SATA SSD.

We encounter each month or so vm corruption 4 of our vms get corrupted almost at the same time on different pve.
They are all Ubuntu 22.04 +, our vm with ubuntu18 or 20 or Windows don't get corrupted. They reside on the local-lvm in raw format.

I was wondering if anybody encountered something similar.
 

Attachments

  • Capture d’écran, le 2025-02-17 à 08.08.30.png
    Capture d’écran, le 2025-02-17 à 08.08.30.png
    885.4 KB · Views: 21
I don't have anything in the iLO logs.
Is there somewhere in the host I could check specificly.
 
I filtered a lot of logs from journals, and from a zabbix server that froze due to corruption around 4am the march the1rst I figured that maybe a cron could be in fault.

Honestly I'm not shure wat to look for.

chart2.png

46879: Mar 01 01:28:34 pve1 pmxcfs[1852]: [dcdb] notice: data verification successful
46900: Mar 01 01:33:04 pve1 pvestatd[410294]: status update time (5.116 seconds)
46923: Mar 01 01:38:04 pve1 pvestatd[410294]: status update time (5.016 seconds)
46940: Mar 01 01:42:04 pve1 pvestatd[410294]: status update time (5.021 seconds)
46967: Mar 01 01:48:04 pve1 pvestatd[410294]: status update time (5.126 seconds)
46970: Mar 01 01:48:41 pve1 corosync[1940]: [TOTEM ] Retransmit List: e7ceb5
46971: Mar 01 01:48:41 pve1 corosync[1940]: [TOTEM ] Retransmit List: e7ceb6
46972: Mar 01 01:48:41 pve1 corosync[1940]: [TOTEM ] Retransmit List: e7ceb7
46973: Mar 01 01:48:41 pve1 corosync[1940]: [TOTEM ] Retransmit List: e7ceb8
47027: Mar 01 02:00:45 pve1 systemd[1]: Starting systemd-tmpfiles-clean.service - Cleanup of Temporary Directories...
47028: Mar 01 02:00:46 pve1 systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully.
47029: Mar 01 02:00:46 pve1 systemd[1]: Finished systemd-tmpfiles-clean.service - Cleanup of Temporary Directories.
47030: Mar 01 02:00:46 pve1 systemd[1]: run-credentials-systemd\x2dtmpfiles\x2dclean.service.mount: Deactivated successfully.
47102: Mar 01 02:17:01 pve1 CRON[4109819]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
47103: Mar 01 02:17:01 pve1 CRON[4109820]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
47104: Mar 01 02:17:01 pve1 CRON[4109819]: pam_unix(cron:session): session closed for user root
47155: Mar 01 02:28:34 pve1 pmxcfs[1852]: [dcdb] notice: data verification successful
47216: Mar 01 02:39:39 pve1 pmxcfs[1852]: [status] notice: received log
47217: Mar 01 02:39:46 pve1 pmxcfs[1852]: [status] notice: received log
47250: Mar 01 02:47:03 pve1 pvedaemon[3856471]: worker exit
47251: Mar 01 02:47:03 pve1 pvedaemon[2029]: worker 3856471 finished
47252: Mar 01 02:47:03 pve1 pvedaemon[2029]: starting 1 worker(s)
47253: Mar 01 02:47:03 pve1 pvedaemon[2029]: worker 4143059 started
47354: Mar 01 03:10:01 pve1 CRON[4168393]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
47355: Mar 01 03:10:01 pve1 CRON[4168394]: (root) CMD (test -e /run/systemd/system || SERVICE_MODE=1 /sbin/e2scrub_all -A -r)
47356: Mar 01 03:10:01 pve1 CRON[4168393]: pam_unix(cron:session): session closed for user root
47361: Mar 01 03:11:03 pve1 pmxcfs[1852]: [status] notice: received log
47362: Mar 01 03:11:11 pve1 pmxcfs[1852]: [status] notice: received log
47389: Mar 01 03:17:01 pve1 CRON[4176149]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
47390: Mar 01 03:17:01 pve1 CRON[4176153]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
47391: Mar 01 03:17:01 pve1 CRON[4176149]: pam_unix(cron:session): session closed for user root
47442: Mar 01 03:28:34 pve1 pmxcfs[1852]: [dcdb] notice: data verification successful
47485: Mar 01 03:38:29 pve1 systemd[1]: Starting pve-daily-update.service - Daily PVE download activities...
47486: Mar 01 03:38:31 pve1 pveupdate[6117]: <root@pam> starting task UPID:pve1:0000183C:2C551590:67C2C787:aptupdate::root@pam:
47488: Mar 01 03:38:35 pve1 pveupdate[6204]: update new package list: /var/lib/pve-manager/pkgupdates
47490: Mar 01 03:38:38 pve1 pveupdate[6117]: <root@pam> end task UPID:pve1:0000183C:2C551590:67C2C787:aptupdate::root@pam: OK
47491: Mar 01 03:38:38 pve1 pveupdate[6117]: ACME config found for node, but no custom certificate exists. Skipping ACME renewal until initial certificate has been deployed.
47492: Mar 01 03:38:39 pve1 pveupdate[6117]: cleanup removed 669 task logs
47493: Mar 01 03:38:39 pve1 systemd[1]: pve-daily-update.service: Deactivated successfully.
47494: Mar 01 03:38:39 pve1 systemd[1]: Finished pve-daily-update.service - Daily PVE download activities.
47495: Mar 01 03:38:39 pve1 systemd[1]: pve-daily-update.service: Consumed 7.638s CPU time.
47521: Mar 01 03:44:37 pve1 pveproxy[2041]: worker 3959477 finished
47522: Mar 01 03:44:37 pve1 pveproxy[2041]: starting 1 worker(s)
47523: Mar 01 03:44:37 pve1 pveproxy[2041]: worker 13561 started
47524: Mar 01 03:44:38 pve1 pveproxy[13560]: got inotify poll request in wrong process - disabling inotify
47526: Mar 01 03:44:40 pve1 pveproxy[13560]: worker exit
47539: Mar 01 03:47:39 pve1 pveproxy[3959476]: worker exit
47540: Mar 01 03:47:39 pve1 pveproxy[2041]: worker 3959476 finished
47541: Mar 01 03:47:39 pve1 pveproxy[2041]: starting 1 worker(s)
47542: Mar 01 03:47:39 pve1 pveproxy[2041]: worker 16895 started
47551: Mar 01 03:49:16 pve1 pveproxy[3959478]: worker exit
47552: Mar 01 03:49:16 pve1 pveproxy[2041]: worker 3959478 finished
47553: Mar 01 03:49:16 pve1 pveproxy[2041]: starting 1 worker(s)
47554: Mar 01 03:49:16 pve1 pveproxy[2041]: worker 18691 started
47602: Mar 01 04:00:09 pve1 pmxcfs[1852]: [status] notice: received log
47674: Mar 01 04:17:01 pve1 CRON[49754]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
47675: Mar 01 04:17:01 pve1 CRON[49755]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
47676: Mar 01 04:17:01 pve1 CRON[49754]: pam_unix(cron:session): session closed for user root
47727: Mar 01 04:28:34 pve1 pmxcfs[1852]: [dcdb] notice: data verification successful
47882: Mar 01 05:03:44 pve1 corosync[1940]: [KNET ] link: host: 5 link: 0 is down
47883: Mar 01 05:03:44 pve1 corosync[1940]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
47884: Mar 01 05:03:44 pve1 corosync[1940]: [KNET ] host: host: 5 has no active links
47885: Mar 01 05:03:47 pve1 corosync[1940]: [KNET ] rx: host: 5 link: 0 is up
47886: Mar 01 05:03:47 pve1 corosync[1940]: [KNET ] link: Resetting MTU for link 0 because host 5 joined
47887: Mar 01 05:03:47 pve1 corosync[1940]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
47888: Mar 01 05:03:47 pve1 corosync[1940]: [KNET ] pmtud: Global data MTU changed to: 1397
47947: Mar 01 05:17:01 pve1 CRON[115807]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
47948: Mar 01 05:17:01 pve1 CRON[115808]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
47949: Mar 01 05:17:01 pve1 CRON[115807]: pam_unix(cron:session): session closed for user root
47986: Mar 01 05:25:32 pve1 pmxcfs[1852]: [status] notice: received log
47987: Mar 01 05:25:41 pve1 pmxcfs[1852]: [status] notice: received log
48002: Mar 01 05:28:34 pve1 pmxcfs[1852]: [dcdb] notice: data verification successful
48025: Mar 01 05:33:57 pve1 pvedaemon[4026973]: worker exit
48026: Mar 01 05:33:58 pve1 pvedaemon[2029]: worker 4026973 finished
48027: Mar 01 05:33:58 pve1 pvedaemon[2029]: starting 1 worker(s)
48028: Mar 01 05:33:58 pve1 pvedaemon[2029]: worker 134407 started
48047: Mar 01 05:37:54 pve1 pvedaemon[4029201]: worker exit
48048: Mar 01 05:37:54 pve1 pvedaemon[2029]: worker 4029201 finished
48049: Mar 01 05:37:54 pve1 pvedaemon[2029]: starting 1 worker(s)
48050: Mar 01 05:37:54 pve1 pvedaemon[2029]: worker 138775 started
 
Hi,
47355: Mar 01 03:10:01 pve1 CRON[4168394]: (root) CMD (test -e /run/systemd/system || SERVICE_MODE=1 /sbin/e2scrub_all -A -r)
note that this will short-circuit, because /run/systemd/system exists (assuming this is your PVE host) and so the later half of the command is never executed, e.g.
Code:
test -e /run/systemd/system || echo "foo"
should not print anything on your host.
 
Hi,

note that this will short-circuit, because /run/systemd/system exists (assuming this is your PVE host) and so the later half of the command is never executed, e.g.
Code:
test -e /run/systemd/system || echo "foo"
should not print anything on your host.
I ran "test -e /run/systemd/system || echo "foo"" and got nothing.