KVM Loosing disk drives

PretoX

Well-Known Member
Apr 5, 2016
44
10
48
38
Hi guys,
proxmox-ve: 4.4-76 (running kernel: 4.4.21-1-pve)
pve-manager: 4.4-2 (running version: 4.4-2/80259e05)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.4.13-1-pve: 4.4.13-56
pve-kernel-4.4.35-1-pve: 4.4.35-76
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.4.13-2-pve: 4.4.13-58
pve-kernel-4.4.21-1-pve: 4.4.21-71
pve-kernel-4.2.8-1-pve: 4.2.8-41
pve-kernel-4.4.19-1-pve: 4.4.19-66
pve-kernel-4.4.10-1-pve: 4.4.10-54
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-102
pve-firmware: 1.1-10
libpve-common-perl: 4.0-84
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-70
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.0-9
pve-container: 1.0-89
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.6-2
lxcfs: 2.0.5-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80
Sorry I have no text logs yet, only screenshots.
my problem is, sometimes once in 2-3 days, VM's crash with messages shown on pictures attached:
1.png 2.png 3.png 4.png 5.png
First: we got messages about e1000 errors, so I've changed nic's to RTL ones and errors disappeared.
Second: From what I can see at least swap drive was lost. I do also indicate cpu loaded with WA queue before VM panic, it also points to drives issue, changing scheduler from cfq didn't help.

And yesterday I got freebsd VM crashed:
6.png
We are running on zfs and using pve-zsync to backup. My thoughts was pve-zsync is locking HW raid as we make backups over bonded interfaces and use max backup speed so I lowered it to 20 MB/sec, and still got a crash.

I could assume Cloudlinux 6.8/Centos 6 and kernel 2.6 issue as I saw messages about bad work of kernel 2.6 with kvm but they are pretty old, like several years and marked as solved.

PLZ help.

I just thought, can storage type changing from zfs to qcow2 files help?
 
Freebsd seems to be complaining that disk IO is taking too long.

On Linux I see the OOM killer trying too free up ram so that vm was likely doing lots of disk IO to swap. I have disabled swap on nearly every vm because when they start using swap it causes IO issues and performance goes to crap. Linux vms I've added zram and if I see OOM events I add ram or change settings to prevent the excess usage. If your Proxmox host is using swap, that can cause all sorts of strange issues especially for vms who's ram got swapped to disk. My software testers complained every Monday because the idle testing servers got swapped to disk over the weekend and response times were horrible. Removed swap and all sorts of random performance issues were eliminated, but I do have the luxury of most servers having 64gb ram or more.

Looks like maybe your storage is not performing at the level required by your VMs.
Maybe the sync process is partly to blame, maybe ZFS needs tuned or maybe your hardware is simply not fast enough.

What does pveperf output when you run it against your zfs storage?
 
What does pveperf output when you run it against your zfs storage?

Thank you for your reply. This is a brand new dell server. And no, my VM's do not user swap thus it is enabled
RAID-6 + zfs with a caching ssd

# pveperf /VM-DATA
CPU BOGOMIPS: 153621.60
REGEX/SECOND: 1841587
HD SIZE: 2794.25 GB (VM-DATA)
FSYNCS/SECOND: 2079.90

which is only slightly less than ssd drives I have

# pveperf
CPU BOGOMIPS: 153621.60
REGEX/SECOND: 2037255
HD SIZE: 155.36 GB (rpool/ROOT/pve-1)
FSYNCS/SECOND: 2244.64

Just got another VM crashe, this time I have full log. No backups running during a crash, part of logs:
<4>[249920.223451] sd 2:0:1:0: [sdb] ABORT operation started
<4>[249920.223508] sd 2:0:1:0: ABORT operation failed.
<4>[249920.223617] sd 2:0:1:0: [sdb] DEVICE RESET operation started
<6>[249920.223978] scsi target2:0:1: control msgout: c.
<5>[249920.224039] scsi target2:0:1: has been reset
<4>[249920.224101] sd 2:0:1:0: DEVICE RESET operation complete.
<4>[249982.003452] sd 2:0:0:0: [sda] ABORT operation started
<4>[249982.003462] sd 2:0:0:0: ABORT operation failed.
<4>[249982.003699] sd 2:0:0:0: [sda] DEVICE RESET operation started
<4>[249982.003728] sd 2:0:0:0: DEVICE RESET operation complete.
<6>[249982.004219] scsi target2:0:0: control msgout: c.
<5>[249982.004307] scsi target2:0:0: has been reset
<4>[249982.004821] sd 2:0:0:0: [sda] BUS RESET operation started
<4>[249982.007173] sd 2:0:0:0: BUS RESET operation complete.
<4>[249982.007196] sym0: SCSI BUS reset detected.
<5>[249982.017121] sym0: SCSI BUS has been reset.
<4>[249982.018274] sym0: unknown interrupt(s) ignored, ISTAT=0x1 DSTAT=0x80 SIST=0x0
<4>[249992.012966] ------------[ cut here ]------------


I'll upload full dump to cloudlinux support, maybe they can help. If crashdump will help, I can upload it here too
 
Maybe you can just post the vm.conf files.
agent: 1
balloon: 0
boot: cdn
bootdisk: scsi0
cores: 4
ide2: none,media=cdrom
memory: 32768
name: plesk2
net0: rtl8139=62:64:64:35:xx:xx,bridge=vmbr0
net1: rtl8139=3A:35:33:30:xx:xx,bridge=vmbr0
numa: 0
onboot: 1
ostype: l26
parent: Test
protection: 1
scsi0: VM-DATA:vm-102-disk-1,size=100G
scsi1: VM-DATA:vm-102-disk-2,size=200G
smbios1: uuid=31f1b79b-d7fe-4fc2-8baf-239acad853b9
sockets: 2
tablet: 0

I tried switching balooning off, but got crash again this night.

I also opened support ticket to Cloudlinux support team but I'm pretty sure it's proxmox+kvm issue...
 
It's not related to btrfs, simply accessing the harddrive, even via dd, will have errors.
I just have read/writes on this btrfs file system, so showing up btrfs is failing on doing this.
 
So CL guys proposed new kernel:

We have released a kernel that include needed patch, please consider updating and rebooting to start using it:

yum install kernel-2.6.32-673.26.1.lve1.4.21.el6 kmod-lve-1.4-21.el6 --enablerepo=cloudlinux-updates-testing

But it didn't help. I blame this for now https://github.com/zfsonlinux/zfs/issues/4345
Testing limits for arc
 
I recently converted a vm to virtio scsi and it crashed during vzdump backup. The kvm process vanished.

was this by chance with the 4.4.35-1 kernel?

@PretoX: so if I understand correctly, this is an issue with VMs using outdated kernels?
 
was this by chance with the 4.4.35-1 kernel?

@PretoX: so if I understand correctly, this is an issue with VMs using outdated kernels?

No, my last post was with a new test kernel built according to my CloudLinux support case. They made kernel updates based on our crashdumps

VM crashed again today on 4.4.35-2-pve
This was not happening on 4.3

# pveversion -v
proxmox-ve: 4.4-78 (running kernel: 4.4.35-2-pve)
pve-manager: 4.4-5 (running version: 4.4-5/c43015a5)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.4.13-1-pve: 4.4.13-56
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.4.13-2-pve: 4.4.13-58
pve-kernel-4.4.35-2-pve: 4.4.35-78
pve-kernel-4.4.21-1-pve: 4.4.21-71
pve-kernel-4.2.8-1-pve: 4.2.8-41
pve-kernel-4.4.19-1-pve: 4.4.19-66
pve-kernel-4.4.10-1-pve: 4.4.10-54
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-102
pve-firmware: 1.1-10
libpve-common-perl: 4.0-85
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-71
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.0-10
pve-container: 1.0-90
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.6-5
lxcfs: 2.0.5-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80
 
No, my last post was with a new test kernel built according to my CloudLinux support case. They made kernel updates based on our crashdumps

the patch you linked has been included in vanilla linux since 4.3.. so you are talking about the guest kernel here?

VM crashed again today on 4.4.35-2-pve
This was not happening on 4.3

I assume you mean PVE 4.3 here? (this "closeness" of kernel and PVE version numbers can sometimes be confusing, sorry).

If so, I would suggest testing with an older qemu version (e.g., 2.6.x) to narrow down possible culprits.

Do you see anything in the host logs? Are you monitoring the memory and I/O situation on the host?
 
the patch you linked has been included in vanilla linux since 4.3.. so you are talking about the guest kernel here?



I assume you mean PVE 4.3 here? (this "closeness" of kernel and PVE version numbers can sometimes be confusing, sorry).

If so, I would suggest testing with an older qemu version (e.g., 2.6.x) to narrow down possible culprits.

Do you see anything in the host logs? Are you monitoring the memory and I/O situation on the host?

Yes, sorry, patch is for guest CloudLinux kernel.

Yes, I do monitor ram (about 90Gb out of 190 used) and io delay is on 16% level during vm going down.

as far as I remember the VM was stable on: Nov 23, 2016 not sure what kernel was latest at that time.

Here's the vm bootup log, during those messages VM is completely stuck.
Feb 1 10:13:22 PM1-BNE2 pvedaemon[35464]: start VM 102: UPID:pM1-BNE2:00008A88:04249E8F:58912822:qmstart:102:pretox@pam:
Feb 1 10:13:22 PM1-BNE2 systemd[1]: Starting 102.scope.
Feb 1 10:13:22 PM1-BNE2 systemd[1]: Started 102.scope.
Feb 1 10:13:22 PM1-BNE2 kernel: [695059.736954] device tap102i0 entered promiscuous mode
Feb 1 10:13:22 PM1-BNE2 kernel: [695059.742538] vmbr0: port 28(tap102i0) entered forwarding state
Feb 1 10:13:22 PM1-BNE2 kernel: [695059.742561] vmbr0: port 28(tap102i0) entered forwarding state
Feb 1 10:13:23 PM1-BNE2 kernel: [695060.176056] device tap102i1 entered promiscuous mode
Feb 1 10:13:23 PM1-BNE2 kernel: [695060.181259] vmbr0: port 29(tap102i1) entered forwarding state
Feb 1 10:13:23 PM1-BNE2 kernel: [695060.181281] vmbr0: port 29(tap102i1) entered forwarding state
Feb 1 10:13:25 PM1-BNE2 kernel: [695062.317207] kvm: zapping shadow pages for mmio generation wraparound
Feb 1 10:13:25 PM1-BNE2 kernel: [695062.319280] kvm: zapping shadow pages for mmio generation wraparound
Feb 1 10:13:55 PM1-BNE2 kernel: [695092.288111] kvm [35472]: vcpu0 unhandled rdmsr: 0xce
Feb 1 10:13:55 PM1-BNE2 kernel: [695092.427009] kvm [35472]: vcpu1 unhandled rdmsr: 0xce
Feb 1 10:13:55 PM1-BNE2 kernel: [695092.458670] kvm [35472]: vcpu2 unhandled rdmsr: 0xce
Feb 1 10:13:55 PM1-BNE2 kernel: [695092.490319] kvm [35472]: vcpu3 unhandled rdmsr: 0xce
Feb 1 10:13:55 PM1-BNE2 kernel: [695092.522057] kvm [35472]: vcpu4 unhandled rdmsr: 0xce
Feb 1 10:13:55 PM1-BNE2 kernel: [695092.553692] kvm [35472]: vcpu5 unhandled rdmsr: 0xce
Feb 1 10:13:55 PM1-BNE2 kernel: [695092.585286] kvm [35472]: vcpu6 unhandled rdmsr: 0xce
Feb 1 10:13:55 PM1-BNE2 kernel: [695092.617008] kvm [35472]: vcpu7 unhandled rdmsr: 0xce
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!