Live Migration causing Read-Only File System

benafischer94 · Sep 9, 2014

I've been having some issues when doing live migration. I have a node cluster running on Dell PE R710s, all sharing a Drobo iSCSI LVM for hosting the VMs. I have been testing online migration and noticing that within about a half hour after a live migration the virtual machines (All ubuntu 14.04 LTS Server) are dropping to read-only mode. So far the only way I have found to recover from this state is to log onto to the web GUI stop the machine and then start it again. A restart will work for a short period of time then drop back to read-only. Once I do a full stop and start it works happily no issues.

I'm pretty new to clustering as a whole but this seems like a strange issue. I'm unsure if it is a bug with Proxmox not pausing the VM properly causing IOs to go through and then the system think there is a disk issue, or what it may be.

I'm fairly certain it's no issue with the drives since I only have the issue after live migration, no troubles if it stays on one host for any amount of time, and no trouble with cold migrations.

So is there something I'm doing wrong? Or do I need to configure something on the VMs to not drop to RO file system.

Code:

root@PM-9W199P1:~# pveversion -v
proxmox-ve-2.6.32: 3.2-126 (running kernel: 2.6.32-29-pve)
pve-manager: 3.2-4 (running version: 3.2-4/e24a91c1)
pve-kernel-2.6.32-29-pve: 2.6.32-126
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.5-1
pve-cluster: 3.0-12
qemu-server: 3.1-16
pve-firmware: 1.1-3
libpve-common-perl: 3.0-18
libpve-access-control: 3.0-11
libpve-storage-perl: 3.0-19
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-6
vzctl: 4.0-1pve5
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.7-8
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.2-1

I have also confirmed that all cluster nodes are running the same version by comparing the above. It is identical on all three.

Any help would be greatly appreciated

udo · Sep 9, 2014

Hi,
any hints in the logfiles of the nodes (/var/log/syslog)?

BTW. your version is outdated - I guess you haven't the enterprise subscription and don't enable the no-subscrition repository?!

Code:

cat /etc/apt/sources.list.d/pve-no-subscription.list 
# PVE pve-no-subscription repository provided by proxmox.com, NOT recommended for production use
deb http://download.proxmox.com/debian wheezy pve-no-subscription

Udo

benafischer94 · Sep 10, 2014

Hi Udo,

Here's the log ten minutes before the migration when all is ticking away happily:

Code:

Sep 10 09:30:18 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:30:18 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:30:38 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:30:38 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:30:58 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:30:58 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:31:02 PM-BMH89P1 pmxcfs[2289]: [status] notice: received log
Sep 10 09:31:15 PM-BMH89P1 pmxcfs[2289]: [status] notice: received log
Sep 10 09:31:15 PM-BMH89P1 pmxcfs[2289]: [status] notice: received log
Sep 10 09:31:18 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:31:18 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:31:25 PM-BMH89P1 pmxcfs[2289]: [status] notice: received log
Sep 10 09:31:38 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:31:38 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:31:40 PM-BMH89P1 pmxcfs[2289]: [status] notice: received log
Sep 10 09:31:43 PM-BMH89P1 pvedaemon[2644]: <root@pam> successful auth for user 'root@pam'
Sep 10 09:31:58 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:31:58 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:32:01 PM-BMH89P1 pvedaemon[197305]: stop VM 104: UPID:PM-BMH89P1:000302B9:0281A518:541052D1:qmstop:104:root@pam:
Sep 10 09:32:01 PM-BMH89P1 pvedaemon[2645]: <root@pam> starting task UPID:PM-BMH89P1:000302B9:0281A518:541052D1:qmstop:104:root@pam:
Sep 10 09:32:01 PM-BMH89P1 kernel: vmbr0: port 2(tap104i0) entering disabled state
Sep 10 09:32:01 PM-BMH89P1 kernel: vmbr0: port 2(tap104i0) entering disabled state
Sep 10 09:32:01 PM-BMH89P1 pmxcfs[2289]: [status] notice: received log
Sep 10 09:32:02 PM-BMH89P1 ntpd[2211]: Deleting interface #7 tap104i0, fe80::3084:cbff:fefc:b36c#123, interface stats: received=0, sent=0, dropped=0, active_time=419972 secs
Sep 10 09:32:02 PM-BMH89P1 ntpd[2211]: peers refreshed
Sep 10 09:32:02 PM-BMH89P1 pvedaemon[2645]: <root@pam> end task UPID:PM-BMH89P1:000302B9:0281A518:541052D1:qmstop:104:root@pam: OK
Sep 10 09:32:10 PM-BMH89P1 pvedaemon[2644]: <root@pam> starting task UPID:PM-BMH89P1:000302CC:0281A8A3:541052DA:qmstart:104:root@pam:
Sep 10 09:32:10 PM-BMH89P1 pvedaemon[197324]: start VM 104: UPID:PM-BMH89P1:000302CC:0281A8A3:541052DA:qmstart:104:root@pam:
Sep 10 09:32:10 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:32:10 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:32:11 PM-BMH89P1 kernel: device tap104i0 entered promiscuous mode
Sep 10 09:32:11 PM-BMH89P1 kernel: vmbr0: port 2(tap104i0) entering forwarding state
Sep 10 09:32:11 PM-BMH89P1 pmxcfs[2289]: [status] notice: received log
Sep 10 09:32:11 PM-BMH89P1 pvedaemon[2644]: <root@pam> end task UPID:PM-BMH89P1:000302CC:0281A8A3:541052DA:qmstart:104:root@pam: OK
Sep 10 09:32:14 PM-BMH89P1 ntpd[2211]: Listen normally on 9 tap104i0 fe80::2066:8eff:fee8:edc UDP 123
Sep 10 09:32:14 PM-BMH89P1 ntpd[2211]: peers refreshed
Sep 10 09:32:15 PM-BMH89P1 pvedaemon[2645]: <root@pam> successful auth for user 'root@pam'
Sep 10 09:32:22 PM-BMH89P1 kernel: tap104i0: no IPv6 routers present
Sep 10 09:32:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:32:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:32:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:32:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:33:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:33:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:33:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:33:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:33:35 PM-BMH89P1 pmxcfs[2289]: [dcdb] notice: data verification successful
Sep 10 09:33:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:33:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:33:50 PM-BMH89P1 pmxcfs[2289]: [status] notice: received log
Sep 10 09:33:50 PM-BMH89P1 pmxcfs[2289]: [status] notice: received log
Sep 10 09:33:52 PM-BMH89P1 pvedaemon[2644]: <root@pam> successful auth for user 'root@pam'
Sep 10 09:34:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:34:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:34:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:34:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:34:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:34:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:35:03 PM-BMH89P1 init: Trying to re-exec init
Sep 10 09:35:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:35:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:35:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:35:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:35:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:35:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:36:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:36:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:36:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:36:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:36:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:36:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:37:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:37:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:37:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:37:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:37:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:37:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:38:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:38:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:38:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:38:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:38:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:38:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:39:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:39:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:39:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:39:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:39:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:39:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:40:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:40:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:40:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:40:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through

Initial Host after the migration:

Code:

Sep 10 09:40:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:40:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:40:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:40:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:41:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:41:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:41:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:41:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:41:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:41:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:42:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:42:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:42:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:42:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:42:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:42:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:42:56 PM-BMH89P1 pvedaemon[2645]: <root@pam> starting task UPID:PM-BMH89P1:00031879:0282A4F1:54105560:qmigrate:104:root@pam:
Sep 10 09:42:57 PM-BMH89P1 pmxcfs[2289]: [status] notice: received log
Sep 10 09:42:58 PM-BMH89P1 pmxcfs[2289]: [status] notice: received log
Sep 10 09:43:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:43:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:43:15 PM-BMH89P1 pmxcfs[2289]: [status] notice: received log
Sep 10 09:43:15 PM-BMH89P1 pmxcfs[2289]: [status] notice: received log
Sep 10 09:43:15 PM-BMH89P1 kernel: vmbr0: port 2(tap104i0) entering disabled state
Sep 10 09:43:15 PM-BMH89P1 kernel: vmbr0: port 2(tap104i0) entering disabled state
Sep 10 09:43:15 PM-BMH89P1 pmxcfs[2289]: [status] notice: received log
Sep 10 09:43:15 PM-BMH89P1 pmxcfs[2289]: [status] notice: received log
Sep 10 09:43:16 PM-BMH89P1 ntpd[2211]: Deleting interface #9 tap104i0, fe80::2066:8eff:fee8:edc#123, interface stats: received=0, sent=0, dropped=0, active_time=662 secs
Sep 10 09:43:16 PM-BMH89P1 ntpd[2211]: peers refreshed
Sep 10 09:43:18 PM-BMH89P1 pvedaemon[2645]: <root@pam> end task UPID:PM-BMH89P1:00031879:0282A4F1:54105560:qmigrate:104:root@pam: OK
Sep 10 09:43:25 PM-BMH89P1 pmxcfs[2289]: [status] notice: received log
Sep 10 09:43:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:43:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:43:44 PM-BMH89P1 rrdcached[2267]: flushing old values
Sep 10 09:43:44 PM-BMH89P1 rrdcached[2267]: rotating journals
Sep 10 09:43:44 PM-BMH89P1 rrdcached[2267]: started new journal /var/lib/rrdcached/journal/rrd.journal.1410356624.297883
Sep 10 09:43:44 PM-BMH89P1 rrdcached[2267]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1410349424.297701
Sep 10 09:43:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:43:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:44:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:44:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:44:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:44:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:44:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:44:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:45:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:45:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:45:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:45:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:45:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:45:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:46:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:46:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:46:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:46:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:46:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:46:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:47:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:47:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:47:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:47:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:47:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:47:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:48:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:48:08 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:48:20 PM-BMH89P1 pmxcfs[2289]: [status] notice: received log
Sep 10 09:48:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:48:28 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:48:29 PM-BMH89P1 pmxcfs[2289]: [status] notice: received log
Sep 10 09:48:30 PM-BMH89P1 pmxcfs[2289]: [status] notice: received log
Sep 10 09:48:35 PM-BMH89P1 pmxcfs[2289]: [status] notice: received log
Sep 10 09:48:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:48:48 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through

New host after migration:

Code:

Sep 10 09:40:58 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:41:18 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:41:18 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:41:38 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:41:38 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:41:58 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:41:58 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:42:18 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:42:18 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:42:38 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:42:38 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:42:56 PM-9W199P1 pmxcfs[171297]: [status] notice: received log
Sep 10 09:42:57 PM-9W199P1 qm[941637]: <root@pam> starting task UPID:PM-9W199P1:000E5E46:049ECB68:54105561:qmstart:104:root@pam:
Sep 10 09:42:57 PM-9W199P1 qm[941638]: start VM 104: UPID:PM-9W199P1:000E5E46:049ECB68:54105561:qmstart:104:root@pam:
Sep 10 09:42:57 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:42:57 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:42:58 PM-9W199P1 kernel: device tap104i0 entered promiscuous mode
Sep 10 09:42:58 PM-9W199P1 kernel: vmbr0: port 3(tap104i0) entering forwarding state
Sep 10 09:42:58 PM-9W199P1 qm[941637]: <root@pam> end task UPID:PM-9W199P1:000E5E46:049ECB68:54105561:qmstart:104:root@pam: OK
Sep 10 09:43:02 PM-9W199P1 ntpd[2373]: Listen normally on 13 tap104i0 fe80::6888:d4ff:fe6d:c49 UDP 123
Sep 10 09:43:02 PM-9W199P1 ntpd[2373]: peers refreshed
Sep 10 09:43:08 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:43:08 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:43:09 PM-9W199P1 kernel: tap104i0: no IPv6 routers present
Sep 10 09:43:15 PM-9W199P1 qm[941716]: <root@pam> starting task UPID:PM-9W199P1:000E5E95:049ED24B:54105573:qmresume:104:root@pam:
Sep 10 09:43:15 PM-9W199P1 qm[941717]: resume VM 104: UPID:PM-9W199P1:000E5E95:049ED24B:54105573:qmresume:104:root@pam:
Sep 10 09:43:15 PM-9W199P1 qm[941716]: <root@pam> end task UPID:PM-9W199P1:000E5E95:049ED24B:54105573:qmresume:104:root@pam: OK
Sep 10 09:43:15 PM-9W199P1 pvedaemon[713987]: <root@pam> end task UPID:PM-9W199P1:000E4649:049DF5BD:5410533E:vncproxy:104:root@pam: OK
Sep 10 09:43:15 PM-9W199P1 pvedaemon[606979]: <root@pam> starting task UPID:PM-9W199P1:000E5E96:049ED288:54105573:vncproxy:104:root@pam:
Sep 10 09:43:15 PM-9W199P1 pvedaemon[941718]: starting vnc proxy UPID:PM-9W199P1:000E5E96:049ED288:54105573:vncproxy:104:root@pam:
Sep 10 09:43:18 PM-9W199P1 pmxcfs[171297]: [status] notice: received log
Sep 10 09:43:25 PM-9W199P1 pvedaemon[941718]: command '/bin/nc -l -p 5900 -w 10 -c '/usr/sbin/qm vncproxy 104 2>/dev/null'' failed: exit code 1
Sep 10 09:43:25 PM-9W199P1 pvedaemon[606979]: <root@pam> end task UPID:PM-9W199P1:000E5E96:049ED288:54105573:vncproxy:104:root@pam: command '/bin/nc -l -p 5900 -w 10 -c '/usr/sbin/qm vncproxy 104 2>/dev/null'' failed: exit code 1
Sep 10 09:43:28 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:43:28 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:43:48 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:43:48 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:44:08 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:44:08 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:44:28 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:44:28 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:44:48 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:44:48 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:45:08 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:45:08 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:45:28 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:45:28 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:45:48 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:45:48 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:46:08 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:46:08 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:46:28 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:46:28 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:46:48 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:46:48 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:47:08 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:47:08 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:47:28 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:47:28 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:47:48 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:47:48 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:48:08 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:48:08 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:48:20 PM-9W199P1 pvedaemon[713987]: <root@pam> successful auth for user 'root@pam'
Sep 10 09:48:28 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:48:28 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:48:29 PM-9W199P1 pvedaemon[942296]: starting vnc proxy UPID:PM-9W199P1:000E60D8:049F4D35:541056AD:vncproxy:104:root@pam:
Sep 10 09:48:29 PM-9W199P1 pvedaemon[606979]: <root@pam> starting task UPID:PM-9W199P1:000E60D8:049F4D35:541056AD:vncproxy:104:root@pam:
Sep 10 09:48:30 PM-9W199P1 pvedaemon[713987]: <root@pam> successful auth for user 'root@pam'
Sep 10 09:48:35 PM-9W199P1 pvedaemon[606979]: <root@pam> successful auth for user 'root@pam'
Sep 10 09:48:48 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:48:48 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:49:08 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:49:08 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:49:28 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:49:28 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:49:31 PM-9W199P1 fenced[171063]: fencing node PM-CR8B9P1 still retrying
Sep 10 09:49:48 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:49:48 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through
Sep 10 09:50:08 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Got wrong page
Sep 10 09:50:08 PM-9W199P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through

As soon as I notice it go into read only I will post the output again. I'm assuming the sdb errors are nothing to be worried about since the VMs aren't located on any local disks but rather on the iSCSI LVM.

As for not using the the other repo, I tried staying away since the nice big disclaimer about not using in prod since that is exactly what we want to do but at this time a subscription isn't possible...

Thanks for the help!

udo · Sep 10, 2014

benafischer94 said:
Hi Udo,

Here's the log ten minutes before the migration when all is ticking away happily:

Code:

Sep 10 09:30:18 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Got wrong page Sep 10 09:30:18 PM-BMH89P1 kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through

"got wrong page" looks not good for me!

As soon as I notice it go into read only I will post the output again. I'm assuming the sdb errors are nothing to be worried about since the VMs aren't located on any local disks but rather on the iSCSI LVM.

also an iscsi-disk is for the system an normal disk like sdX.
look with

Code:

pvs
vgs

As for not using the the other repo, I tried staying away since the nice big disclaimer about not using in prod since that is exactly what we want to do but at this time a subscription isn't possible...

and what do you think you are running?? An outdated version which also was in the no-subscrition-repository - but time ago...

I would not say, that you are safer with the version from the iso...

Udo

benafischer94 · Sep 10, 2014

and what do you think you are running?? An outdated version which also was in the no-subscrition-repository - but time ago...

I would not say, that you are safer with the version from the iso...

That comes from my mis understanding during initial setup then. I was under the assumption that the iso releases were...more stable but a version behind of the subscription version and the other repo was more of a nightly build.

And then for the iSCSI it was again mis-understanding how they are viewed as this is my first outing with iSCSI...

After digging some more I may have found my issue. It appears that Drobos don't play nicely with LVM. So back to the drawing board for shared storage.

Thanks again for all the help!

udo · Sep 10, 2014

benafischer94 said:
After digging some more I may have found my issue. It appears that Drobos don't play nicely with LVM. So back to the drawing board for shared storage.

Hi,
hmm, I think not that this is the right reason. An iSCSI-device should work with any kind of data. Where is the different if I wrote data for an filesystem or for an logical volume?
Blocksize/syncing/caching?!

Have you testet your network-connection? Is the MTU on both systems the same?

Udo

benafischer94 · Sep 11, 2014

It seems to be from some of the Drobo forums I found that something in their firmware doesn't play well with LVM and is known to cause corruption when used. I checked MTU on the access switch the Drobo is connected to and then on the access the servers are connected to and then the core between then and they are all the same. I'm not having any network issues being reported and everything else seems to be working properly.

I also have another machine running some DBs and using another Drobo with an ext3 FS and it hasn't had any issues and the server is connected to the same switch as the Proxmox Cluster and then the Drobo is connected to the same switch as the other one. So it doesn't seem to be a networking issue. I may be trying to reach out to Data Robotics and see if they offer any information the issue with LVM and Drobos.

Search

Search

Live Migration causing Read-Only File System

benafischer94

New Member

udo

Distinguished Member

benafischer94

New Member

udo

Distinguished Member

benafischer94

New Member

udo

Distinguished Member

benafischer94

New Member

We value your privacy