Proxmox 5.1 crashing every few hours

krystofr

Member
Dec 30, 2016
20
1
23
churchweb.uk
Two issues - not sure if they are related:

1) I have promox setup to backup to NFS drive each night. The past few days the NFS drive was full, so the backup failed.
However when the backup fails my VM becomes inaccessible via web. Pingdom reports sites down from 3:09 to 3:14

These are the syslog entries:

Code:
Apr 25 02:59:01 willow systemd[1]: Started Proxmox VE replication runner.

Apr 25 02:59:01 willow CRON[5319]: (root) CMD (/usr/local/rtm/bin/rtm 23 > /dev/null 2> /dev/null)

Apr 25 03:00:00 willow systemd[1]: Starting Proxmox VE replication runner...

Apr 25 03:00:01 willow systemd[1]: Started Proxmox VE replication runner.

Apr 25 03:00:01 willow CRON[8304]: (root) CMD (vzdump 403 --storage ns3046957.ip-********.eu --mailnotification always --mailto *****@*****.** --mode snapshot --compress lzo --quiet 1)

Apr 25 03:00:01 willow CRON[8305]: (root) CMD (/usr/local/rtm/bin/rtm 23 > /dev/null 2> /dev/null)

Apr 25 03:00:02 willow vzdump[8304]: <root@pam> starting task UPID:willow:000020B8:05EBC740:5ADFE122:vzdump::root@pam:

Apr 25 03:00:02 willow vzdump[8376]: INFO: starting new backup job: vzdump 403 --compress lzo --quiet 1 --mode snapshot --mailnotification always --storage ns*******.ip-*******.eu --mailto ******@******.**

Apr 25 03:00:02 willow vzdump[8376]: INFO: Starting Backup of VM 403 (qemu)

Apr 25 03:00:02 willow qm[8389]: <root@pam> update VM 403: -lock backup

Apr 25 03:01:00 willow systemd[1]: Starting Proxmox VE replication runner...

Apr 25 03:01:01 willow systemd[1]: Started Proxmox VE replication runner.

Apr 25 03:01:01 willow CRON[12916]: (root) CMD (/usr/local/rtm/bin/rtm 23 > /dev/null 2> /dev/null)

Apr 25 03:02:00 willow systemd[1]: Starting Proxmox VE replication runner...

Apr 25 03:02:01 willow systemd[1]: Started Proxmox VE replication runner.

Apr 25 03:02:01 willow CRON[18611]: (root) CMD (/usr/local/rtm/bin/rtm 23 > /dev/null 2> /dev/null)

Apr 25 03:03:00 willow systemd[1]: Starting Proxmox VE replication runner...

Apr 25 03:03:01 willow systemd[1]: Started Proxmox VE replication runner.

Apr 25 03:03:01 willow CRON[24558]: (root) CMD (/usr/local/rtm/bin/rtm 23 > /dev/null 2> /dev/null)

Apr 25 03:04:00 willow systemd[1]: Starting Proxmox VE replication runner...

Apr 25 03:04:01 willow systemd[1]: Started Proxmox VE replication runner.

Apr 25 03:04:01 willow CRON[30340]: (root) CMD (/usr/local/rtm/bin/rtm 23 > /dev/null 2> /dev/null)

Apr 25 03:04:04 willow smartd[2970]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Temperature_Case changed from 74 to 73

Apr 25 03:04:10 willow rrdcached[4451]: flushing old values

Apr 25 03:04:10 willow rrdcached[4451]: rotating journals

Apr 25 03:04:10 willow rrdcached[4451]: started new journal /var/lib/rrdcached/journal/rrd.journal.1524621850.794202

Apr 25 03:04:10 willow rrdcached[4451]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1524614650.794246

Apr 25 03:05:00 willow systemd[1]: Starting Proxmox VE replication runner...

Apr 25 03:05:01 willow systemd[1]: Started Proxmox VE replication runner.

Apr 25 03:05:02 willow CRON[3605]: (root) CMD (/usr/local/rtm/bin/rtm 23 > /dev/null 2> /dev/null)

Apr 25 03:06:00 willow systemd[1]: Starting Proxmox VE replication runner...

Apr 25 03:06:01 willow systemd[1]: Started Proxmox VE replication runner.

Apr 25 03:06:01 willow CRON[10216]: (root) CMD (/usr/local/rtm/bin/rtm 23 > /dev/null 2> /dev/null)

Apr 25 03:07:00 willow systemd[1]: Starting Proxmox VE replication runner...

Apr 25 03:07:01 willow systemd[1]: Started Proxmox VE replication runner.

Apr 25 03:07:01 willow CRON[14752]: (root) CMD (/usr/local/rtm/bin/rtm 23 > /dev/null 2> /dev/null)

Apr 25 03:07:45 willow pvestatd[4781]: got timeout

Apr 25 03:07:55 willow pvestatd[4781]: got timeout

Apr 25 03:07:55 willow pvestatd[4781]: unable to activate storage 'ns*******.ip**********.eu' - directory '/mnt/pve/ns******.ip.*******.eu' does not exist or is unreachable

Apr 25 03:08:00 willow systemd[1]: Starting Proxmox VE replication runner...

Apr 25 03:08:01 willow systemd[1]: Started Proxmox VE replication runner.

Apr 25 03:08:01 willow CRON[17639]: (root) CMD (/usr/local/rtm/bin/rtm 23 > /dev/null 2> /dev/null)

Apr 25 03:08:05 willow pvestatd[4781]: got timeout

Apr 25 03:08:05 willow pvestatd[4781]: unable to activate storage 'ns********.ip-*****.eu' - directory '/mnt/pve/ns******.ip-******.eu' does not exist or is unreachable

Apr 25 03:08:15 willow pvestatd[4781]: got timeout

Apr 25 03:08:15 willow pvestatd[4781]: unable to activate storage 'ns*******.ip-*******.eu' - directory '/mnt/pve/ns******.ip-******.eu' does not exist or is unreachable

Apr 25 03:08:25 willow pvestatd[4781]: got timeout

Apr 25 03:08:25 willow pvestatd[4781]: unable to activate storage 'ns******.ip-******.eu' - directory '/mnt/pve/ns******.ip-******.eu' does not exist or is unreachable

Apr 25 03:08:35 willow pvestatd[4781]: got timeout

Apr 25 03:08:35 willow pvestatd[4781]: unable to activate storage 'ns******.ip-*******.eu' - directory '/mnt/pve/ns******.ip-******.eu' does not exist or is unreachable

Apr 25 03:08:45 willow pvestatd[4781]: got timeout

Apr 25 03:08:45 willow pvestatd[4781]: unable to activate storage 'ns******.ip-******.eu' - directory '/mnt/pve/ns*******.ip-******.eu' does not exist or is unreachable

Apr 25 03:08:55 willow pvestatd[4781]: got timeout

Apr 25 03:08:55 willow pvestatd[4781]: unable to activate storage 'ns******.ip-******.eu.eu' - directory '/mnt/pve/ns******.ip-******.eu' does not exist or is unreachable

Apr 25 03:09:00 willow systemd[1]: Starting Proxmox VE replication runner...

Apr 25 03:09:01 willow systemd[1]: Started Proxmox VE replication runner.

Apr 25 03:09:01 willow CRON[19588]: (root) CMD (/usr/local/rtm/bin/rtm 23 > /dev/null 2> /dev/null)

Apr 25 03:09:25 willow pvestatd[4781]: got timeout

Apr 25 03:09:35 willow pvestatd[4781]: got timeout

Apr 25 03:09:45 willow pvestatd[4781]: got timeout

Apr 25 03:09:55 willow pvestatd[4781]: got timeout

Apr 25 03:10:00 willow systemd[1]: Starting Proxmox VE replication runner...

Apr 25 03:10:01 willow systemd[1]: Started Proxmox VE replication runner.

Apr 25 03:10:01 willow CRON[23525]: (root) CMD (/usr/local/rtm/bin/rtm 23 > /dev/null 2> /dev/null)

Apr 25 03:10:06 willow pvestatd[4781]: got timeout

Apr 25 03:10:06 willow pvestatd[4781]: unable to activate storage 'ns******.ip-******.eu' - directory '/mnt/pve/ns******.ip-******.eu' does not exist or is unreachable

Apr 25 03:10:15 willow pvestatd[4781]: got timeout

Apr 25 03:10:15 willow pvestatd[4781]: unable to activate storage 'ns******.ip-******.eu' - directory '/mnt/pve/ns******.ip-******.eu' does not exist or is unreachable

Apr 25 03:10:25 willow pvestatd[4781]: got timeout

Apr 25 03:10:25 willow pvestatd[4781]: unable to activate storage 'ns******.ip-******.eu' - directory '/mnt/pve/ns******.ip******.eu' does not exist or is unreachable

Apr 25 03:10:35 willow pvestatd[4781]: got timeout

Apr 25 03:10:35 willow pvestatd[4781]: unable to activate storage 'ns******.ip-*******.eu' - directory '/mnt/pve/ns******.ip******eu' does not exist or is unreachable

Apr 25 03:10:45 willow pvestatd[4781]: got timeout

Apr 25 03:10:45 willow pvestatd[4781]: unable to activate storage 'ns******.ip-******.eu' - directory '/mnt/pve/ns******7.ip-******.eu' does not exist or is unreachable

Apr 25 03:10:59 willow vzdump[8376]: ERROR: Backup of VM 403 failed - vma_queue_write: write error - Broken pipe

Apr 25 03:11:00 willow systemd[1]: Starting Proxmox VE replication runner...

Apr 25 03:11:01 willow systemd[1]: Started Proxmox VE replication runner.

Apr 25 03:11:01 willow vzdump[8376]: INFO: Backup job finished with errors

Apr 25 03:11:01 willow CRON[25244]: (root) CMD (/usr/local/rtm/bin/rtm 23 > /dev/null 2> /dev/null)

Apr 25 03:11:01 willow vzdump[8376]: job errors

Apr 25 03:11:01 willow postfix/pickup[28911]: 63B1D4DA71: uid=0 from=<******@*****.**>

Apr 25 03:11:01 willow vzdump[8304]: <root@pam> end task UPID:willow:000020B8:05EBC740:5ADFE122:vzdump::root@pam: job errors

Apr 25 03:11:01 willow postfix/cleanup[25245]: 63B1D4DA71: message-id=<20180425021101.63B1D4DA71@willow>

Apr 25 03:11:01 willow postfix/qmgr[4658]: 63B1D4DA71: from=<******@******.**>, size=6072, nrcpt=1 (queue active)

Apr 25 03:11:03 willow postfix/smtp[25257]: 63B1D4DA71: to=<******@******.**>, relay=mail.ethica.io[164.132.17.220]:25, delay=1.7, delays=0.04/0.01/1.3/0.35, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as C9763813A8FD)

Apr 25 03:11:03 willow postfix/qmgr[4658]: 63B1D4DA71: removed

Apr 25 03:12:00 willow systemd[1]: Starting Proxmox VE replication runner...

Apr 25 03:12:00 willow systemd[1]: Started Proxmox VE replication runner.

Apr 25 03:12:01 willow CRON[27943]: (root) CMD (/usr/local/rtm/bin/rtm 23 > /dev/null 2> /dev/null)

Apr 25 03:13:00 willow systemd[1]: Starting Proxmox VE replication runner...

Apr 25 03:13:01 willow systemd[1]: Started Proxmox VE replication runner.

Apr 25 03:13:01 willow CRON[30487]: (root) CMD (/usr/local/rtm/bin/rtm 23 > /dev/null 2> /dev/null)

Apr 25 03:14:00 willow systemd[1]: Starting Proxmox VE replica

The only messages around that time in messages are:
Code:
Apr 25 03:00:02 willow vzdump[8304]: <root@pam> starting task UPID:willow:000020B8:05EBC740:5ADFE122:vzdump::root@pam:

Apr 25 03:00:02 willow qm[8389]: <root@pam> update VM 403: -lock backup

Why would a backup failing, cause sites to go down?

Next problem - promox has also started crashing and restarting every few hours regardless of the backup running, but this started at the same time as the above backup running out of space.

I have corrected the backup situation by enabling more room, so that issue will be resolved, but can't believe it is causing a crash when the backup is not running? but strange it started about the same time....
any ideas?

These are the syslog for the latest crash:

Code:
Apr 25 12:46:00 willow systemd[1]: Started Proxmox VE replication runner.

Apr 25 12:46:01 willow CRON[32024]: (root) CMD (/usr/local/rtm/bin/rtm 23 > /dev/null 2> /dev/null)

Apr 25 12:47:00 willow systemd[1]: Starting Proxmox VE replication runner...

Apr 25 12:47:00 willow systemd[1]: Started Proxmox VE replication runner.

Apr 25 12:47:01 willow CRON[2142]: (root) CMD (/usr/local/rtm/bin/rtm 23 > /dev/null 2> /dev/null)

Apr 25 12:48:00 willow systemd[1]: Starting Proxmox VE replication runner...

Apr 25 12:48:00 willow systemd[1]: Started Proxmox VE replication runner.

Apr 25 12:48:01 willow CRON[4983]: (root) CMD (/usr/local/rtm/bin/rtm 23 > /dev/null 2> /dev/null)

Apr 25 12:49:00 willow systemd[1]: Starting Proxmox VE replication runner...

Apr 25 12:49:00 willow systemd[1]: Started Proxmox VE replication runner.

Apr 25 12:49:01 willow CRON[8397]: (root) CMD (/usr/local/rtm/bin/rtm 23 > /dev/null 2> /dev/null)

Apr 25 12:50:00 willow systemd[1]: Starting Proxmox VE replication runner...

Apr 25 12:50:00 willow systemd[1]: Started Proxmox VE replication runner.

Apr 25 12:50:01 willow CRON[10742]: (root) CMD (/usr/local/rtm/bin/rtm 23 > /dev/null 2> /dev/null)

Apr 25 12:53:46 willow systemd-modules-load[1619]: Inserted module 'iscsi_tcp'

Apr 25 12:53:46 willow systemd-udevd[1657]: Network interface NamePolicy= disabled on kernel command line, ignoring.

Apr 25 12:53:46 willow systemd-modules-load[1619]: Inserted module 'ib_iser'

Apr 25 12:53:46 willow systemd-modules-load[1619]: Inserted module 'vhost_net'

Apr 25 12:53:46 willow systemd-sysctl[1690]: Couldn't write '0' to 'net/ipv6/conf/vmbr0/autoconf', ignoring: No such file or directory

Apr 25 12:53:46 willow systemd-sysctl[1690]: Couldn't write '0' to 'net/ipv6/conf/vmbr0/accept_ra', ignoring: No such file or directory

Apr 25 12:53:46 willow systemd-sysctl[1690]: Couldn't write '0' to 'net/ipv6/conf/vmbr0/accept_ra_defrtr', ignoring: No such file or directory

Apr 25 12:53:46 willow kernel: [    0.000000] random: get_random_bytes called from start_kernel+0x42/0x4fd with crng_init=0

Apr 25 12:53:46 willow kernel: [    0.000000] Linux version 4.13.16-2-pve (root@nora) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP PVE 4.13.16-47 (Mon, 9 Apr 2018 09:58:12 +0200) ()

Apr 25 12:53:46 willow kernel: [    0.000000] Command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-4.13.16-2-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs rootdelay=15 rootdelay=15 noquiet nosplash net.ifnames=0 biosdevname=0

Apr 25 12:53:46 willow kernel: [    0.000000] KERNEL supported cpus:

Apr 25 12:53:46 willow kernel: [    0.000000]   Intel GenuineIntel

Apr 25 12:53:46 willow kernel: [    0.000000]   AMD AuthenticAMD

Apr 25 12:53:46 willow kernel: [    0.000000]   Centaur CentaurHauls

Apr 25 12:53:46 willow kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'

Apr 25 12:53:46 willow kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'

Apr 25 12:53:46 willow kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'

Apr 25 12:53:46 willow kernel: [    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256

Apr 25 12:53:46 willow kernel: [    0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.

Apr 25 12:53:46 willow kernel: [    0.000000] e820: BIOS-provided physical RAM map:


pveversion -v
Code:
pveversion -v

proxmox-ve: 5.1-42 (running kernel: 4.13.16-2-pve)

pve-manager: 5.1-51 (running version: 5.1-51/96be5354)

pve-kernel-4.13: 5.1-44

pve-kernel-4.13.16-2-pve: 4.13.16-47

pve-kernel-4.13.16-1-pve: 4.13.16-46

pve-kernel-4.13.13-6-pve: 4.13.13-42

pve-kernel-4.13.13-5-pve: 4.13.13-38

pve-kernel-4.13.13-4-pve: 4.13.13-35

pve-kernel-4.13.13-3-pve: 4.13.13-34

pve-kernel-4.13.13-2-pve: 4.13.13-33

pve-kernel-4.13.13-1-pve: 4.13.13-31

pve-kernel-4.13.8-3-pve: 4.13.8-30

pve-kernel-4.13.8-2-pve: 4.13.8-28

pve-kernel-4.13.4-1-pve: 4.13.4-26

pve-kernel-4.10.17-4-pve: 4.10.17-24

pve-kernel-4.10.17-3-pve: 4.10.17-23

pve-kernel-4.10.17-2-pve: 4.10.17-20

pve-kernel-4.10.17-1-pve: 4.10.17-18

corosync: 2.4.2-pve4

criu: 2.11.1-1~bpo90

glusterfs-client: 3.8.8-1

ksm-control-daemon: not correctly installed

libjs-extjs: 6.0.1-2

libpve-access-control: 5.0-8

libpve-apiclient-perl: 2.0-4

libpve-common-perl: 5.0-30

libpve-guest-common-perl: 2.0-14

libpve-http-server-perl: 2.0-8

libpve-storage-perl: 5.0-18

libqb0: 1.0.1-1

lvm2: 2.02.168-pve6

lxc-pve: 3.0.0-2

lxcfs: 3.0.0-1

novnc-pve: 0.6-4

proxmox-widget-toolkit: 1.0-15

pve-cluster: 5.0-25

pve-container: 2.0-21

pve-docs: 5.1-17

pve-firewall: 3.0-8

pve-firmware: 2.0-4

pve-ha-manager: 2.0-5

pve-i18n: 1.0-4

pve-libspice-server1: 0.12.8-3

pve-qemu-kvm: 2.11.1-5

pve-xtermjs: 1.0-2

qemu-server: 5.0-25

smartmontools: 6.5+svn4324-1

spiceterm: 3.0-5

vncterm: 1.5-3

zfsutils-linux: 0.7.7-pve1~bpo9

There are no entries in messages in the run up to the crash, and then just normal boot up messages.

I really need help with this one - --- please
 
After correcting the space on the NFS backup drive so the backup process can complete. The KVM's stay available during teh backup process, AND there is no more server crashing throughout the day.

I can't imagine what could cause a failed backup process to crash the server every few hours, but the issue has gone away now the backup process completes, and only started when the backup process started failing

Any ideas how I could further fault trace this?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!