Virtual machines failed to autostart

  • Thread starter Thread starter fluke
  • Start date Start date
F

fluke

Guest
I had a power failure, UPS also failed (lucky me) and hosts went down, but when power was restored, machines did not autostart, although 'Start at boot' is checked. I had to start them manually through the web interface. Where can I check, what went wrong? VM's are on DRBD.
 
Any hint in syslog?

Not really, I can see that VM's are tried to be started before drbd is initialized. Probably when a 'dirty' shutdown occurs, drbd comes up to date more slowly and that's why VM's are not started.

Code:
Jun 16 11:07:01 proxmox1 qm[3007]: VM 102 start
Jun 16 11:07:02 proxmox1 qm[3007]: VM 101 start
Jun 16 11:07:02 proxmox1 qm[3007]: VM 108 start
Jun 16 11:07:02 proxmox1 qm[3007]: VM 105 start
Jun 16 11:07:02 proxmox1 qm[3007]: VM 107 start
Jun 16 11:07:02 proxmox1 qm[3007]: VM 113 start
Jun 16 11:07:03 proxmox1 kernel: drbd: initialized. Version: 8.3.4 (api:88/proto:86-91)
Jun 16 11:07:03 proxmox1 kernel: drbd: GIT-hash: 70a645ae080411c87b4482a135847d69dc90a6a2 build by root@oahu, 2010-01-15 11:39:31
Jun 16 11:07:03 proxmox1 kernel: drbd: registered as block device major 147
Jun 16 11:07:03 proxmox1 kernel: drbd: minor_table @ 0xffff88041e720100
Jun 16 11:07:03 proxmox1 kernel: block drbd0: Starting worker thread (from cqueue [3086])
Jun 16 11:07:03 proxmox1 kernel: block drbd0: disk( Diskless -> Attaching ) 
Jun 16 11:07:03 proxmox1 kernel: block drbd0: Found 57 transactions (3507 active extents) in activity log.
Jun 16 11:07:03 proxmox1 kernel: block drbd0: Method to ensure write ordering: barrier
Jun 16 11:07:03 proxmox1 kernel: block drbd0: max_segment_size ( = BIO size ) = 32768
Jun 16 11:07:03 proxmox1 kernel: block drbd0: drbd_bm_resize called with capacity == 1499964992
Jun 16 11:07:03 proxmox1 kernel: block drbd0: resync bitmap: bits=187495624 words=2929620
Jun 16 11:07:03 proxmox1 kernel: block drbd0: size = 715 GB (749982496 KB)
Jun 16 11:07:03 proxmox1 kernel: block drbd0: recounting of set bits took additional 1 jiffies
Jun 16 11:07:03 proxmox1 kernel: block drbd0: 8064 MB (2064384 bits) marked out-of-sync by on disk bit-map.
Jun 16 11:07:03 proxmox1 kernel: block drbd0: Marked additional 5492 MB as out-of-sync based on AL.
Jun 16 11:07:03 proxmox1 kernel: block drbd0: disk( Attaching -> Inconsistent ) 
Jun 16 11:07:03 proxmox1 kernel: block drbd1: Starting worker thread (from cqueue [3086])
Jun 16 11:07:03 proxmox1 kernel: block drbd1: disk( Diskless -> Attaching ) 
Jun 16 11:07:03 proxmox1 kernel: block drbd1: Found 57 transactions (1489 active extents) in activity log.
Jun 16 11:07:03 proxmox1 kernel: block drbd1: Method to ensure write ordering: barrier
Jun 16 11:07:03 proxmox1 kernel: block drbd1: max_segment_size ( = BIO size ) = 32768
Jun 16 11:07:03 proxmox1 kernel: block drbd1: drbd_bm_resize called with capacity == 1751224256
Jun 16 11:07:03 proxmox1 kernel: block drbd1: resync bitmap: bits=218903032 words=3420360
Jun 16 11:07:03 proxmox1 kernel: block drbd1: size = 835 GB (875612128 KB)
Jun 16 11:07:03 proxmox1 kernel: block drbd1: recounting of set bits took additional 1 jiffies
Jun 16 11:07:03 proxmox1 kernel: block drbd1: 780 MB (199680 bits) marked out-of-sync by on disk bit-map.
Jun 16 11:07:03 proxmox1 kernel: block drbd1: Marked additional 5048 MB as out-of-sync based on AL.
Jun 16 11:07:03 proxmox1 kernel: block drbd1: disk( Attaching -> Inconsistent ) 
Jun 16 11:07:03 proxmox1 kernel: padlock: VIA PadLock Hash Engine not detected.
Jun 16 11:07:03 proxmox1 modprobe: FATAL: Error inserting padlock_sha (/lib/modules/2.6.32-1-pve/kernel/drivers/crypto/padlock-sha.ko): No such device
Jun 16 11:07:03 proxmox1 kernel: block drbd0: conn( StandAlone -> Unconnected ) 
Jun 16 11:07:03 proxmox1 kernel: block drbd0: Starting receiver thread (from drbd0_worker [3089])
Jun 16 11:07:03 proxmox1 kernel: block drbd0: receiver (re)started
Jun 16 11:07:03 proxmox1 kernel: block drbd0: conn( Unconnected -> WFConnection ) 
Jun 16 11:07:03 proxmox1 kernel: block drbd1: conn( StandAlone -> Unconnected ) 
Jun 16 11:07:03 proxmox1 kernel: block drbd1: Starting receiver thread (from drbd1_worker [3100])
Jun 16 11:07:03 proxmox1 kernel: block drbd1: receiver (re)started
Jun 16 11:07:03 proxmox1 kernel: block drbd1: conn( Unconnected -> WFConnection ) 
Jun 16 11:07:03 proxmox1 kernel: block drbd0: Handshake successful: Agreed network protocol version 91
Jun 16 11:07:03 proxmox1 kernel: block drbd1: Handshake successful: Agreed network protocol version 91
Jun 16 11:07:03 proxmox1 kernel: block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC
Jun 16 11:07:03 proxmox1 kernel: block drbd0: conn( WFConnection -> WFReportParams ) 
Jun 16 11:07:03 proxmox1 kernel: block drbd0: Starting asender thread (from drbd0_receiver [3155])
Jun 16 11:07:03 proxmox1 kernel: block drbd0: data-integrity-alg: <not-used>
Jun 16 11:07:03 proxmox1 kernel: block drbd0: drbd_sync_handshake:
Jun 16 11:07:03 proxmox1 kernel: block drbd0: self 1D92E2542B92FB98:0000000000000000:32E0C65B5FE1F931:F475C4D29D7AA246 bits:3470336 flags:0
Jun 16 11:07:03 proxmox1 kernel: block drbd0: peer 8351FF986991EC6F:1D92E2542B92FB98:3D34032FE00A0DA3:32E0C65B5FE1F931 bits:2057440 flags:0
Jun 16 11:07:03 proxmox1 kernel: block drbd0: uuid_compare()=-1 by rule 50
Jun 16 11:07:03 proxmox1 kernel: block drbd0: Becoming sync target due to disk states.
Jun 16 11:07:03 proxmox1 kernel: block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) 
Jun 16 11:07:03 proxmox1 kernel: block drbd1: Peer authenticated using 20 bytes of 'sha1' HMAC
Jun 16 11:07:03 proxmox1 kernel: block drbd1: conn( WFConnection -> WFReportParams ) 
Jun 16 11:07:03 proxmox1 kernel: block drbd1: Starting asender thread (from drbd1_receiver [3160])
Jun 16 11:07:03 proxmox1 kernel: block drbd1: data-integrity-alg: <not-used>
Jun 16 11:07:03 proxmox1 kernel: block drbd1: drbd_sync_handshake:
Jun 16 11:07:03 proxmox1 kernel: block drbd1: self 02CCC78D46DED73A:0000000000000000:F1EB684061ECFD4F:B9A3251BC9BDBA4E bits:1491968 flags:0
Jun 16 11:07:03 proxmox1 kernel: block drbd1: peer 27CF7E0E3A355763:02CCC78D46DED73A:1CA78F580330B48D:F1EB684061ECFD4F bits:185496 flags:0
Jun 16 11:07:03 proxmox1 kernel: block drbd1: uuid_compare()=-1 by rule 50
Jun 16 11:07:03 proxmox1 kernel: block drbd1: Becoming sync target due to disk states.
Jun 16 11:07:03 proxmox1 kernel: block drbd1: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) 
Jun 16 11:07:03 proxmox1 kernel: block drbd0: role( Secondary -> Primary ) 
Jun 16 11:07:03 proxmox1 kernel: block drbd1: role( Secondary -> Primary ) 
Jun 16 11:07:03 proxmox1 /usr/sbin/cron[3224]: (CRON) INFO (pidfile fd = 3)
Jun 16 11:07:03 proxmox1 /usr/sbin/cron[3225]: (CRON) STARTUP (fork ok)
Jun 16 11:07:03 proxmox1 /usr/sbin/cron[3225]: (CRON) INFO (Running @reboot jobs)
Jun 16 11:07:03 proxmox1 /USR/SBIN/CRON[3234]: (root) CMD (test -x /usr/lib/atsar/atsadc && (LOGDIR=/var/log/atsar; CURDAY=`date +%d`; find $LOGDIR/atsa$CURDAY -mtime +2 -type f -exec rm {} \; 2> /dev/null; /usr/lib/atsar/atsadc $LOGDIR/atsa$CURDAY))
Jun 16 11:07:04 proxmox1 kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID ) 
Jun 16 11:07:04 proxmox1 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
Jun 16 11:07:04 proxmox1 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
Jun 16 11:07:04 proxmox1 kernel: block drbd0: conn( WFSyncUUID -> SyncTarget ) 
Jun 16 11:07:04 proxmox1 kernel: block drbd0: Began resync as SyncTarget (will sync 13881344 KB [3470336 bits set]).
Jun 16 11:07:04 proxmox1 kernel: block drbd1: conn( WFBitMapT -> WFSyncUUID ) 
Jun 16 11:07:04 proxmox1 kernel: block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1
Jun 16 11:07:04 proxmox1 kernel: block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1 exit code 0 (0x0)
Jun 16 11:07:04 proxmox1 kernel: block drbd1: conn( WFSyncUUID -> SyncTarget ) 
Jun 16 11:07:04 proxmox1 kernel: block drbd1: Began resync as SyncTarget (will sync 5967872 KB [1491968 bits set]).
Jun 16 11:07:06 proxmox1 proxwww[3281]: Starting new child 3281
Jun 16 11:07:06 proxmox1 proxwww[3282]: Starting new child 3282
Jun 16 11:07:20 proxmox1 apcupsd[3077]: apcupsd 3.14.4 (18 May 2008) debian startup succeeded
Jun 16 11:07:20 proxmox1 apcupsd[3077]: NIS server startup succeeded
Jun 16 11:08:01 proxmox1 pvemirror[2981]: starting cluster syncronization
Jun 16 11:08:11 proxmox1 pvemirror[2981]: syncing vzlist from '192.168.7.1' failed: 500 read timeout
Jun 16 11:08:12 proxmox1 pvemirror[2981]: syncing templates
Jun 16 11:08:12 proxmox1 pvemirror[2981]: cluster syncronization finished (10.67 seconds (files 0.00, config 0.00))
Jun 16 11:08:54 proxmox1 kernel: block drbd1: Resync done (total 109 sec; paused 0 sec; 54748 K/sec)
Jun 16 11:08:54 proxmox1 kernel: block drbd1: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) 
Jun 16 11:08:54 proxmox1 kernel: block drbd1: helper command: /sbin/drbdadm after-resync-target minor-1
Jun 16 11:08:54 proxmox1 kernel: block drbd1: helper command: /sbin/drbdadm after-resync-target minor-1 exit code 0 (0x0)

The rest of syslog is pvemirror sync stuff and cron jobs.
 
Last edited by a moderator:
I made a test today by turning UPS off and on. The machines fail to start with the following error:

Code:
Starting Qemu VM 102
Volume group "drbdvg-1" not found
command '/sbin/vgchange -aly drbdvg-1' failed with exit code 5

So proxmox tries to start the VM's too early. VM's on local storage start without a problem.
Could also someone point me where those types of messages are logged, because I couldn't find them in either /var/log/syslog and /var/log/messages. I had to be on the console to read those errors, which makes remote troubleshooting difficult.
 
Such errors should be in syslog (if not it is a bug).
They are not logged in syslog.

Code:
proxmox1:~# pveversion -v
pve-manager: 1.5-5 (pve-manager/1.5/4627)
running kernel: 2.6.32-1-pve
proxmox-ve-2.6.32: 1.5-4
pve-kernel-2.6.32-1-pve: 2.6.32-4
pve-kernel-2.6.24-10-pve: 2.6.24-21
pve-kernel-2.6.24-8-pve: 2.6.24-16
qemu-server: 1.1-11
pve-firmware: 1.0-3
libpve-storage-perl: 1.0-8
vncterm: 0.9-2
vzctl: 3.0.23-1pve7
vzdump: 1.2-5
vzprocps: 2.0.11-1dso2
vzquota: 3.0.11-1
pve-qemu-kvm: 0.11.1-2
ksm-control-daemon: 1.0-2
 
Without modification, pve starts before drbd (hence before lvm is up). "Start at boot" has no storage available (VMs) to boot, therefore nothing starts.

Below run on both drbd members causes drbd to start before pve and shutdown after pve.

Code:
update-rc.d -f drbd remove
update-rc.d drbd defaults 19 32
Needs more testing, but gets me past the immediate problem. Now VMs do boot at start.




I made a test today by turning UPS off and on. The machines fail to start with the following error:

Code:
Starting Qemu VM 102
Volume group "drbdvg-1" not found
command '/sbin/vgchange -aly drbdvg-1' failed with exit code 5

So proxmox tries to start the VM's too early. VM's on local storage start without a problem.
Could also someone point me where those types of messages are logged, because I couldn't find them in either /var/log/syslog and /var/log/messages. I had to be on the console to read those errors, which makes remote troubleshooting difficult.