BUG: GFS2 Issue Kernel Panic when delete on a new fresh filesystem.

t.lamprecht · Aug 4, 2015

Code:

root@34compiler:~# mount -t gfs2 /dev/vdb1 /gfs/
root@34compiler:~# mkfs.gfs2 -p lock_nolock -t test:testfs -j 1 /dev/vdb1 
It appears to contain an existing filesystem (gfs2)
This will destroy any data on /dev/vdb1
Are you sure you want to proceed? [y/n]y
Device:                    /dev/vdb1
Block size:                4096
Device size:               32.00 GB (8388347 blocks)
Filesystem size:           32.00 GB (8388346 blocks)
Journals:                  1
Resource groups:           129
Locking protocol:          "lock_nolock"
Lock table:                "test:testfs"
UUID:                      55618364-f327-8437-fdc5-2ad07bcba095
root@34compiler:~# umount /gfs 
root@34compiler:~# mount -t gfs2 /dev/vdb1 /gfs/
root@34compiler:~# cd /gfs/
root@34compiler:/gfs# ls
root@34compiler:/gfs# touch test
root@34compiler:/gfs# rm test 
root@34compiler:/gfs# ls
root@34compiler:/gfs# dd if=/dev/zero of=test.file count=1 bs=100M
1+0 records in
1+0 records out
104857600 bytes (105 MB) copied, 0.855919 s, 123 MB/s
root@34compiler:/gfs# ls
test.file
root@34compiler:/gfs# rm test.file 
root@34compiler:/gfs# ls
root@34compiler:/gfs# 
root@34compiler:/gfs# uname -a
Linux 34compiler 3.10.0-11-pve #1 SMP Tue Jul 21 08:59:46 CEST 2015 x86_64 GNU/Linux

Saw it, updated it to latest and tried again. Works still.

I use a VM with a virtual disk.
But now also tried your way with a zeroed file, works too...

robynhub · Aug 4, 2015

I don't know what to say. I will try to figure out what wrong. And i will try with a real device...

STFU.

t.lamprecht · Aug 4, 2015

robynhub said:
STFU.

Anger doesn't really helps here, i think... But a Kernel panic is really weird...

I also set up a fresh machine again and try it another time.

robynhub · Aug 4, 2015

Hey, I dont want to be offensive!!! STFU meaning is "stay tuned for updates"!!!

mir · Aug 4, 2015

Could the problem be inherent in the pool itself in such a way that it needs to be recreated with newer gfs-tools that comes with the new kernel?

t.lamprecht · Aug 4, 2015

Haha, sorry was a bit confused, no worries...

Yeah I have to say that i used my self compiled 3.1.8 gfs utils version to test it now. On the new installed i use the 3.1.3 gfs utils (the one available from our repo) and test it with those, but when I remember correctly I already tested it a week ago with 3.1.3 on the 3.10 kernel.

t.lamprecht · Aug 4, 2015

So, 2.6.32 is still confirmed, doesn't work.
I updated it to 3.10 installed the gfs-utils from the proxmox repo, version 3.1.3

Code:

root@34a:/gfs# gfs2_edit -v
gfs2_edit version master (built Mar 15 2013 08:54:07)

and tried it again on a virtual disk, everything works without problems... Tried formatting, mounting, writing and deleting.
No error, everything as expected.

robynhub · Aug 4, 2015

t.lamprecht said:
I use a VM with a virtual disk.
But now also tried your way with a zeroed file, works too...

I've tryed again with a real device (i've used a 32 Gb usb pen) and the issue isn't shown.
But the zeroed file is still there. I've delayed the printk messages and this is the panic:

Seems solved with a real device. I've already mounted the production fs without issues. I'm still a bit scared..

t.lamprecht · Aug 4, 2015

I could finally reproduce your error!! Used the old gfs-utils again to make a GFS2 partition on a zeroed file and received the panic.
I think the older gfs-utils have some problem with a file-filesystem...

robynhub · Aug 4, 2015

t.lamprecht said:
So, 2.6.32 is still confirmed, doesn't work.
I updated it to 3.10 installed the gfs-utils from the proxmox repo, version 3.1.3

It's possible to have a more updated version of gfs-utils? The version 3.1.3 it's full of bugs! I saw many errors in the locking protocol when trying to double mount the filesystem (lock the whole cluster!!!).
Maybe i can downgrade to 3.1.0. I need just to figure out how...

Ty for the answer.

nanonettr · Aug 11, 2015

Hello, I wanted to test latest gfs2-utils package from git and build it.
local locks are ok but i could not mount with dlm, server just hangs...

Anybody had a problem like this?

Code:

root@node01:~# pveversion
pve-manager/3.4-9/4b51d87a (running kernel: 3.10.0-11-pve)
root@node01:~# aptitude show gfs2-utils
Package: gfs2-utils
New: yes
State: installed
Automatically installed: no
Version: 3.1.8-1
Priority: optional
Section: admin
Maintainer: Proxmox Support Team

Architecture: amd64
Uncompressed Size: 886 k
Depends: libblkid1 (>= 2.17.2), libc6 (>= 2.10), libncurses5 (>= 5.5-5~), libtinfo5, zlib1g (>= 1:1.2.3.3), psmisc, python, corosync
Description: Global file system 2 tools

t.lamprecht · Aug 12, 2015

What does the log says about dlm when your trying to mount it?

nanonettr · Aug 15, 2015

t.lamprecht said:
What does the log says about dlm when your trying to mount it?

I am not sure which log i need to look. Here are the parts that i think relevant.

Boot of the node;
/var/log/boot

Code:

Sat Aug 15 20:00:14 2015: Starting cluster: 
Sat Aug 15 20:00:14 2015:    Checking if cluster has been disabled at boot... [  OK  ]
Sat Aug 15 20:00:14 2015:    Checking Network Manager... [  OK  ]
Sat Aug 15 20:00:14 2015:    Global setup... [  OK  ]
Sat Aug 15 20:00:14 2015:    Loading kernel modules... [  OK  ]
Sat Aug 15 20:00:14 2015:    Mounting configfs... [  OK  ]
Sat Aug 15 20:00:14 2015:    Starting cman... [  OK  ]
Sat Aug 15 20:00:19 2015:    Waiting for quorum... [  OK  ]
Sat Aug 15 20:00:19 2015:    Starting fenced... [  OK  ]
Sat Aug 15 20:00:19 2015:    Starting dlm_controld... [  OK  ]
Sat Aug 15 20:00:20 2015:    Tuning DLM kernel config... [  OK  ]
Sat Aug 15 20:00:20 2015:    Unfencing self... [  OK  ]
Sat Aug 15 20:00:20 2015:    Joining fence domain... [  OK  ]

/var/log/syslog

Code:

Aug 15 20:00:14 node01 kernel: [   13.379420] DLM installed
Aug 15 20:00:19 node01 fenced[3550]: fenced 1364188437 started
Aug 15 20:00:19 node01 dlm_controld[3565]: dlm_controld 1364188437 started...
Aug 15 20:00:21 node01 kernel: [   20.448509] dlm: Using TCP for communications
Aug 15 20:00:21 node01 kernel: [   20.479404] dlm: connecting to 2

cluster/corosync.log

Code:

Aug 15 20:00:15 corosync [CPG   ] chosen downlist: sender r(0) ip(10.0.0.101) ; members(old:0 left:0)
Aug 15 20:00:15 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Aug 15 20:00:15 corosync [CLM   ] CLM CONFIGURATION CHANGE
Aug 15 20:00:15 corosync [CLM   ] New Configuration:
Aug 15 20:00:15 corosync [CLM   ]       r(0) ip(10.0.0.101) 
Aug 15 20:00:15 corosync [CLM   ] Members Left:
Aug 15 20:00:15 corosync [CLM   ] Members Joined:
Aug 15 20:00:15 corosync [CLM   ] CLM CONFIGURATION CHANGE
Aug 15 20:00:15 corosync [CLM   ] New Configuration:
Aug 15 20:00:15 corosync [CLM   ]       r(0) ip(10.0.0.101) 
Aug 15 20:00:15 corosync [CLM   ]       r(0) ip(10.0.0.102) 
Aug 15 20:00:15 corosync [CLM   ]       r(0) ip(10.0.0.103) 
Aug 15 20:00:15 corosync [CLM   ] Members Left:
Aug 15 20:00:15 corosync [CLM   ] Members Joined:
Aug 15 20:00:15 corosync [CLM   ]       r(0) ip(10.0.0.102) 
Aug 15 20:00:15 corosync [CLM   ]       r(0) ip(10.0.0.103) 
Aug 15 20:00:15 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug 15 20:00:15 corosync [CMAN  ] quorum regained, resuming activity
Aug 15 20:00:15 corosync [QUORUM] This node is within the primary component and will provide service.
Aug 15 20:00:15 corosync [QUORUM] Members[2]: 1 2
Aug 15 20:00:15 corosync [QUORUM] Members[2]: 1 2
Aug 15 20:00:15 corosync [QUORUM] Members[3]: 1 2 3
Aug 15 20:00:15 corosync [QUORUM] Members[3]: 1 2 3
Aug 15 20:00:15 corosync [CPG   ] chosen downlist: sender r(0) ip(10.0.0.102) ; members(old:2 left:0)
Aug 15 20:00:15 corosync [MAIN  ] Completed service synchronization, ready to provide service.

cluster/rgmanager.log

Code:

Aug 15 20:00:21 rgmanager I am node #1
Aug 15 20:00:24 rgmanager Services Initialized
Aug 15 20:00:25 rgmanager State change: Local UP
Aug 15 20:00:25 rgmanager State change: node02 UP

When mount command given;
/var/log/syslog

Code:

Aug 15 20:31:07 node01 kernel: [ 1867.510883] GFS2 installed
Aug 15 20:31:07 node01 kernel: [ 1867.511537] GFS2: fsid=cloud:ssdraid: Trying to join cluster "lock_dlm", "cloud:ssdraid"
Aug 15 20:31:07 node01 kernel: [ 1867.526602] GFS2: fsid=cloud:ssdraid: dlm lockspace ops not used
Aug 15 20:31:07 node01 kernel: [ 1867.526606] GFS2: fsid=cloud:ssdraid: Joined cluster. Now mounting FS...

The terminal that I issued mount command is active and connected. Server is running (not a panic or anything else).
But the command not returned to the prompt. It was waiting...

So I connected to another ssh session and tried to run strace on mount's pid and strace stopped immediately with the following output;

Code:

open("/usr/share/locale/locale.alias", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=2570, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1a58b40000
read(3, "# Locale name alias data base.\n#"..., 4096) = 2570
read(3, "", 4096)                       = 0
close(3)                                = 0
munmap(0x7f1a58b40000, 4096)            = 0
open("/usr/share/locale/en_US.UTF-8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en_US.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en_US/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en.UTF-8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
write(2, "mount: Interrupted system call", 30) = 30
write(2, "\n", 1)                       = 1
exit_group(32)                          = ?
Process 12300 detached

And the first terminal returned to the prompt with;

Code:

mount: Interrupted system call

The only log i could found was;
/var/log/syslog

Code:

Aug 15 20:49:15 node01 kernel: [ 2956.539473] dlm: ssdraid: group leave failed -512 0
Aug 15 20:49:15 node01 dlm_controld[3565]: open "/sys/kernel/dlm/ssdraid/control" error -1 2
Aug 15 20:49:15 node01 dlm_controld[3565]: open "/sys/kernel/dlm/ssdraid/event_done" error -1 2

Also i want to note that my company don't have any subscription. We are using proxmox internally for testing purposes.
I just wanted to report this because this is the way how open source projects get better.
You don't have to do anything if you don't want to.

PS: there is a problem in your forum software.
My messages saved as a single line if i dont use html tag br, line breaks not working.
I am using Linux with Chromium v44 web browser.

mir · Aug 15, 2015

Would you please copy/paste inside code tags (

Code:

....

)? Your message is completely unreadable.

nanonettr · Aug 15, 2015

mir said:
Would you please copy/paste inside code tags ()? Your message is completely unreadable.

post re-formatted.

mir · Aug 15, 2015

nanonettr said:
PS: there is a problem in your forum software.
My messages saved as a single line if i dont use html tag br, line breaks not working.
I am using Linux with Chromium v44 web browser.

Avoid using https and your problem is solved. This has been an issue since several years.

nanonettr · Aug 15, 2015

mir said:
Avoid using https and your problem is solved. This has been an issue since several years.

I did not know that.

Thanks

t.lamprecht · Aug 17, 2015

The build from git for PVE3.4 is currently broken, with the software stack available we cannot use gfs2-utils in version 3.1.8.
We will roll it back to a working state, use the packages from our repos in the meanwhile which should be stable and working on the 3.10 kernel from PVE3.4.

The master branch which targets PVE4 (beta atm.) works fine, as far as I've tested it.
Thanks for noticing us.

Tops · Sep 8, 2015

I have been testing Proxmox 3.4 (no subscription) for a few months now with GFS2. I got the same issue.
kernel 2.6.32-37
gfs2-utils 3.1.3-1
After mounting the FS I can initially do a file delete but only once. After that I get an I/O error even with a simple ls to the directory.
syslog shows "kernel: GFS2: fsid= Number of entries corrupt in dirip i_entries g.offset"
It does run for weeks without any issues as long nobody deletes anything in the FS.
I tried upgrading the kernel to pve-kernel-3.10.0-11-pve but after reboot the FC firmware fails
"kernel: [ 761.478011] qla2xxx [0000:08:03.0]-803b:3: Firmware ready **** FAILED ****."
I recreated the zone mapping on the switch -no effect. I downgraded the kernel back to 2.6.32-37 and
everything went back to normal with the same bug for GFS2.

I really think GFS2 is a practical way to implement HA. I hope this gets fixed soon.

t.lamprecht · Sep 8, 2015

GFS2 and HA is a bit a problem, as it may introduce an single point of failure to your system, and such a failure and HA aren't good friends

Yes AFAIK deleting isn't the real problem "only" the trigger, writing causes the issue. If it was a simple fix, it would be done already but for now it looks like the newer OpenVZ kernel won't fix it anytime.

The 3.10 Kernel works for sure with GFS2, but you'll loose OpenVZ container capability.
Also GFS2 on PVE4 (currently in beta) works also as expected.

Did you install the latest Qlogic drivers? what does modinfo qla2xxx outputs?

BUG: GFS2 Issue Kernel Panic when delete on a new fresh filesystem.

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Famous Member

Proxmox Staff Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Member

Proxmox Staff Member

Member

Famous Member

Member

Famous Member

Member

Proxmox Staff Member

New Member

Proxmox Staff Member

We value your privacy