[SOLVED] pvestatd.service crashing at midnight

Elfy

Well-Known Member
Dec 29, 2016
45
43
58
33
Hi all,

Kind of an ongoing issue that I have been dealing with. It would seem that my pvestatd.service randomly crashes at or very near midnight on one of my nodes. Sometimes this causes only the daemon to crash, other times it will cause the node to become unresponsive until hard-reboot.
1XIOoqi.png


The main difference between the two nodes is that the crashing node is an AMD Ryzen, should be running the latest AGESA BIOS. Both nodes are running Proxmox 5.2-6, however this crash has been happening since 5.2-0 at lest. I realize that a lot of things could be causing this crash, so any help narrowing it down is much appreciated.

Here are the logs right around the time of crashing:
Node 1 (the crashing node):
Code:
Aug 02 23:16:54 Orion rrdcached[1784]: flushing old values
Aug 02 23:16:54 Orion rrdcached[1784]: rotating journals
Aug 02 23:16:54 Orion rrdcached[1784]: started new journal /var/lib/rrdcached/journal/rrd.journal.1533273414.037578
Aug 02 23:16:54 Orion rrdcached[1784]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1533266214.037600
Aug 02 23:16:54 Orion pmxcfs[1817]: [dcdb] notice: data verification successful
Aug 02 23:17:01 Orion CRON[9493]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 02 23:17:01 Orion CRON[9494]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 02 23:17:01 Orion CRON[9493]: pam_unix(cron:session): session closed for user root
Aug 02 23:48:49 Orion systemd[1]: Starting Daily apt download activities...
Aug 02 23:48:49 Orion systemd[1]: Started Daily apt download activities.
Aug 02 23:48:49 Orion systemd[1]: apt-daily.timer: Adding 2h 40min 51.115682s random time.
Aug 02 23:48:49 Orion systemd[1]: apt-daily.timer: Adding 8h 4min 39.312242s random time.
Aug 03 00:16:54 Orion rrdcached[1784]: flushing old values
Aug 03 00:16:54 Orion rrdcached[1784]: rotating journals
Aug 03 00:16:54 Orion rrdcached[1784]: started new journal /var/lib/rrdcached/journal/rrd.journal.1533277014.037604
Aug 03 00:16:54 Orion rrdcached[1784]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1533269814.037569
Aug 03 00:16:54 Orion pmxcfs[1817]: [dcdb] notice: data verification successful
Aug 03 00:17:01 Orion CRON[13785]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 03 00:17:01 Orion CRON[13786]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 03 00:17:01 Orion CRON[13785]: pam_unix(cron:session): session closed for user root
Aug 03 00:39:50 Orion systemd[1]: pvestatd.service: Main process exited, code=killed, status=11/SEGV
Aug 03 00:39:50 Orion kernel: pvestatd[1966]: segfault at c ip 00005627fb7b9cde sp 00007ffe9c5046b0 error 4 in perl[5627fb676000+1e6000]
Aug 03 00:39:51 Orion systemd[1]: pvestatd.service: Unit entered failed state.
Aug 03 00:39:51 Orion systemd[1]: pvestatd.service: Failed with result 'signal'.
Aug 03 01:16:54 Orion rrdcached[1784]: flushing old values
Aug 03 01:16:54 Orion rrdcached[1784]: rotating journals
Aug 03 01:16:54 Orion rrdcached[1784]: started new journal /var/lib/rrdcached/journal/rrd.journal.1533280614.037596
Aug 03 01:16:54 Orion rrdcached[1784]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1533273414.037578
Aug 03 01:16:54 Orion pmxcfs[1817]: [dcdb] notice: data verification successful
Aug 03 01:17:01 Orion CRON[16801]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 03 01:17:01 Orion CRON[16802]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 03 01:17:01 Orion CRON[16801]: pam_unix(cron:session): session closed for user root
Aug 03 02:16:54 Orion rrdcached[1784]: flushing old values
Aug 03 02:16:54 Orion rrdcached[1784]: rotating journals
Aug 03 02:16:54 Orion rrdcached[1784]: started new journal /var/lib/rrdcached/journal/rrd.journal.1533284214.037602
Aug 03 02:16:54 Orion rrdcached[1784]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1533277014.037604
Aug 03 02:16:54 Orion pmxcfs[1817]: [dcdb] notice: data verification successful


Node 2 (the stable node):
Code:
Aug 02 23:16:54 Wash pmxcfs[6092]: [dcdb] notice: data verification successful
Aug 02 23:17:01 Wash CRON[20020]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 02 23:17:01 Wash CRON[20021]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 02 23:17:01 Wash CRON[20020]: pam_unix(cron:session): session closed for user root
Aug 02 23:49:38 Wash rrdcached[1667]: flushing old values
Aug 02 23:49:38 Wash rrdcached[1667]: rotating journals
Aug 02 23:49:38 Wash rrdcached[1667]: started new journal /var/lib/rrdcached/journal/rrd.journal.1533275378.649650
Aug 02 23:49:38 Wash rrdcached[1667]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1533268178.649595
Aug 03 00:16:54 Wash pmxcfs[6092]: [dcdb] notice: data verification successful
Aug 03 00:17:01 Wash CRON[5371]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 03 00:17:01 Wash CRON[5372]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 03 00:17:01 Wash CRON[5371]: pam_unix(cron:session): session closed for user root
Aug 03 00:49:38 Wash rrdcached[1667]: flushing old values
Aug 03 00:49:38 Wash rrdcached[1667]: rotating journals
Aug 03 00:49:38 Wash rrdcached[1667]: started new journal /var/lib/rrdcached/journal/rrd.journal.1533278978.649663
Aug 03 00:49:38 Wash rrdcached[1667]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1533271778.649650
Aug 03 01:16:54 Wash pmxcfs[6092]: [dcdb] notice: data verification successful
Aug 03 01:17:01 Wash CRON[23367]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 03 01:17:01 Wash CRON[23368]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 03 01:17:01 Wash CRON[23367]: pam_unix(cron:session): session closed for user root
Aug 03 01:49:38 Wash rrdcached[1667]: flushing old values
Aug 03 01:49:38 Wash rrdcached[1667]: rotating journals
Aug 03 01:49:38 Wash rrdcached[1667]: started new journal /var/lib/rrdcached/journal/rrd.journal.1533282578.649650
Aug 03 01:49:38 Wash rrdcached[1667]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1533275378.649650
Aug 03 02:06:21 Wash audit[5324]: AVC apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default-cgns" name="/" pid=5324 comm="(certbot)" flags="rw, rslave"
Aug 03 02:06:21 Wash kernel: audit: type=1400 audit(1533283581.298:234): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default-cgns" name="/" pid=5324 comm="(certbot)" flags="rw, rslave"
Aug 03 02:16:54 Wash pmxcfs[6092]: [dcdb] notice: data verification successful

One day I took a photo of the console on a hard crash. Here is the wall:
PgGsf2F.jpg


Thank you to anyone who takes the time to look this over. I couldn't get by without the support of this amazing community!

-Matt
 
While it may not be directly related, disable all the power savings in the BIOS and disable cpu scaling. At least on my workstation (also Ryzen) it produced a stable environment.
 
Thanks Alwin. I'm back from the weekend, so I will give this a try tonight!
 
Alwin, your suggestion appears to have given me a stable system so far. I disabled CPU scaling and verified that I had disabled all power saving in the BIOS. Two days uptime without any issue. I'll keep an eye out and report back here if I see this crash happen again.

Thanks!
-Matt
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!