BUG: GFS2 Issue Kernel Panic when delete on a new fresh filesystem.

Code:
root@34compiler:~# mount -t gfs2 /dev/vdb1 /gfs/
root@34compiler:~# mkfs.gfs2 -p lock_nolock -t test:testfs -j 1 /dev/vdb1 
It appears to contain an existing filesystem (gfs2)
This will destroy any data on /dev/vdb1
Are you sure you want to proceed? [y/n]y
Device:                    /dev/vdb1
Block size:                4096
Device size:               32.00 GB (8388347 blocks)
Filesystem size:           32.00 GB (8388346 blocks)
Journals:                  1
Resource groups:           129
Locking protocol:          "lock_nolock"
Lock table:                "test:testfs"
UUID:                      55618364-f327-8437-fdc5-2ad07bcba095
root@34compiler:~# umount /gfs 
root@34compiler:~# mount -t gfs2 /dev/vdb1 /gfs/
root@34compiler:~# cd /gfs/
root@34compiler:/gfs# ls
root@34compiler:/gfs# touch test
root@34compiler:/gfs# rm test 
root@34compiler:/gfs# ls
root@34compiler:/gfs# dd if=/dev/zero of=test.file count=1 bs=100M
1+0 records in
1+0 records out
104857600 bytes (105 MB) copied, 0.855919 s, 123 MB/s
root@34compiler:/gfs# ls
test.file
root@34compiler:/gfs# rm test.file 
root@34compiler:/gfs# ls
root@34compiler:/gfs# 
root@34compiler:/gfs# uname -a
Linux 34compiler 3.10.0-11-pve #1 SMP Tue Jul 21 08:59:46 CEST 2015 x86_64 GNU/Linux
Saw it, updated it to latest and tried again. Works still.

I use a VM with a virtual disk.
But now also tried your way with a zeroed file, works too...
 
I don't know what to say. I will try to figure out what wrong. And i will try with a real device...

STFU.
 
Could the problem be inherent in the pool itself in such a way that it needs to be recreated with newer gfs-tools that comes with the new kernel?
 
Haha, sorry was a bit confused, no worries...:D

Yeah I have to say that i used my self compiled 3.1.8 gfs utils version to test it now. On the new installed i use the 3.1.3 gfs utils (the one available from our repo) and test it with those, but when I remember correctly I already tested it a week ago with 3.1.3 on the 3.10 kernel.
 
So, 2.6.32 is still confirmed, doesn't work.
I updated it to 3.10 installed the gfs-utils from the proxmox repo, version 3.1.3
Code:
root@34a:/gfs# gfs2_edit -v
gfs2_edit version master (built Mar 15 2013 08:54:07)
and tried it again on a virtual disk, everything works without problems... Tried formatting, mounting, writing and deleting.
No error, everything as expected.
 
I use a VM with a virtual disk.
But now also tried your way with a zeroed file, works too...

I've tryed again with a real device (i've used a 32 Gb usb pen) and the issue isn't shown.
But the zeroed file is still there. I've delayed the printk messages and this is the panic:

20150804_131534.jpg

Seems solved with a real device. I've already mounted the production fs without issues. I'm still a bit scared..
 
I could finally reproduce your error!! Used the old gfs-utils again to make a GFS2 partition on a zeroed file and received the panic.
I think the older gfs-utils have some problem with a file-filesystem...
 
So, 2.6.32 is still confirmed, doesn't work.
I updated it to 3.10 installed the gfs-utils from the proxmox repo, version 3.1.3

It's possible to have a more updated version of gfs-utils? The version 3.1.3 it's full of bugs! I saw many errors in the locking protocol when trying to double mount the filesystem (lock the whole cluster!!!).
Maybe i can downgrade to 3.1.0. I need just to figure out how...

Ty for the answer.
 
Last edited:
Hello, I wanted to test latest gfs2-utils package from git and build it.
local locks are ok but i could not mount with dlm, server just hangs...

Anybody had a problem like this?

Code:
root@node01:~# pveversion
pve-manager/3.4-9/4b51d87a (running kernel: 3.10.0-11-pve)
root@node01:~# aptitude show gfs2-utils
Package: gfs2-utils
New: yes
State: installed
Automatically installed: no
Version: 3.1.8-1
Priority: optional
Section: admin
Maintainer: Proxmox Support Team

Architecture: amd64
Uncompressed Size: 886 k
Depends: libblkid1 (>= 2.17.2), libc6 (>= 2.10), libncurses5 (>= 5.5-5~), libtinfo5, zlib1g (>= 1:1.2.3.3), psmisc, python, corosync
Description: Global file system 2 tools
 
Last edited:
What does the log says about dlm when your trying to mount it?
I am not sure which log i need to look. Here are the parts that i think relevant.

Boot of the node;
/var/log/boot
Code:
Sat Aug 15 20:00:14 2015: Starting cluster: 
Sat Aug 15 20:00:14 2015:    Checking if cluster has been disabled at boot... [  OK  ]
Sat Aug 15 20:00:14 2015:    Checking Network Manager... [  OK  ]
Sat Aug 15 20:00:14 2015:    Global setup... [  OK  ]
Sat Aug 15 20:00:14 2015:    Loading kernel modules... [  OK  ]
Sat Aug 15 20:00:14 2015:    Mounting configfs... [  OK  ]
Sat Aug 15 20:00:14 2015:    Starting cman... [  OK  ]
Sat Aug 15 20:00:19 2015:    Waiting for quorum... [  OK  ]
Sat Aug 15 20:00:19 2015:    Starting fenced... [  OK  ]
Sat Aug 15 20:00:19 2015:    Starting dlm_controld... [  OK  ]
Sat Aug 15 20:00:20 2015:    Tuning DLM kernel config... [  OK  ]
Sat Aug 15 20:00:20 2015:    Unfencing self... [  OK  ]
Sat Aug 15 20:00:20 2015:    Joining fence domain... [  OK  ]
/var/log/syslog
Code:
Aug 15 20:00:14 node01 kernel: [   13.379420] DLM installed
Aug 15 20:00:19 node01 fenced[3550]: fenced 1364188437 started
Aug 15 20:00:19 node01 dlm_controld[3565]: dlm_controld 1364188437 started...
Aug 15 20:00:21 node01 kernel: [   20.448509] dlm: Using TCP for communications
Aug 15 20:00:21 node01 kernel: [   20.479404] dlm: connecting to 2
cluster/corosync.log
Code:
Aug 15 20:00:15 corosync [CPG   ] chosen downlist: sender r(0) ip(10.0.0.101) ; members(old:0 left:0)
Aug 15 20:00:15 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Aug 15 20:00:15 corosync [CLM   ] CLM CONFIGURATION CHANGE
Aug 15 20:00:15 corosync [CLM   ] New Configuration:
Aug 15 20:00:15 corosync [CLM   ]       r(0) ip(10.0.0.101) 
Aug 15 20:00:15 corosync [CLM   ] Members Left:
Aug 15 20:00:15 corosync [CLM   ] Members Joined:
Aug 15 20:00:15 corosync [CLM   ] CLM CONFIGURATION CHANGE
Aug 15 20:00:15 corosync [CLM   ] New Configuration:
Aug 15 20:00:15 corosync [CLM   ]       r(0) ip(10.0.0.101) 
Aug 15 20:00:15 corosync [CLM   ]       r(0) ip(10.0.0.102) 
Aug 15 20:00:15 corosync [CLM   ]       r(0) ip(10.0.0.103) 
Aug 15 20:00:15 corosync [CLM   ] Members Left:
Aug 15 20:00:15 corosync [CLM   ] Members Joined:
Aug 15 20:00:15 corosync [CLM   ]       r(0) ip(10.0.0.102) 
Aug 15 20:00:15 corosync [CLM   ]       r(0) ip(10.0.0.103) 
Aug 15 20:00:15 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug 15 20:00:15 corosync [CMAN  ] quorum regained, resuming activity
Aug 15 20:00:15 corosync [QUORUM] This node is within the primary component and will provide service.
Aug 15 20:00:15 corosync [QUORUM] Members[2]: 1 2
Aug 15 20:00:15 corosync [QUORUM] Members[2]: 1 2
Aug 15 20:00:15 corosync [QUORUM] Members[3]: 1 2 3
Aug 15 20:00:15 corosync [QUORUM] Members[3]: 1 2 3
Aug 15 20:00:15 corosync [CPG   ] chosen downlist: sender r(0) ip(10.0.0.102) ; members(old:2 left:0)
Aug 15 20:00:15 corosync [MAIN  ] Completed service synchronization, ready to provide service.
cluster/rgmanager.log
Code:
Aug 15 20:00:21 rgmanager I am node #1
Aug 15 20:00:24 rgmanager Services Initialized
Aug 15 20:00:25 rgmanager State change: Local UP
Aug 15 20:00:25 rgmanager State change: node02 UP
When mount command given;
/var/log/syslog
Code:
Aug 15 20:31:07 node01 kernel: [ 1867.510883] GFS2 installed
Aug 15 20:31:07 node01 kernel: [ 1867.511537] GFS2: fsid=cloud:ssdraid: Trying to join cluster "lock_dlm", "cloud:ssdraid"
Aug 15 20:31:07 node01 kernel: [ 1867.526602] GFS2: fsid=cloud:ssdraid: dlm lockspace ops not used
Aug 15 20:31:07 node01 kernel: [ 1867.526606] GFS2: fsid=cloud:ssdraid: Joined cluster. Now mounting FS...
The terminal that I issued mount command is active and connected. Server is running (not a panic or anything else).
But the command not returned to the prompt. It was waiting...

So I connected to another ssh session and tried to run strace on mount's pid and strace stopped immediately with the following output;
Code:
open("/usr/share/locale/locale.alias", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=2570, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1a58b40000
read(3, "# Locale name alias data base.\n#"..., 4096) = 2570
read(3, "", 4096)                       = 0
close(3)                                = 0
munmap(0x7f1a58b40000, 4096)            = 0
open("/usr/share/locale/en_US.UTF-8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en_US.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en_US/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en.UTF-8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
write(2, "mount: Interrupted system call", 30) = 30
write(2, "\n", 1)                       = 1
exit_group(32)                          = ?
Process 12300 detached
And the first terminal returned to the prompt with;
Code:
mount: Interrupted system call
The only log i could found was;
/var/log/syslog
Code:
Aug 15 20:49:15 node01 kernel: [ 2956.539473] dlm: ssdraid: group leave failed -512 0
Aug 15 20:49:15 node01 dlm_controld[3565]: open "/sys/kernel/dlm/ssdraid/control" error -1 2
Aug 15 20:49:15 node01 dlm_controld[3565]: open "/sys/kernel/dlm/ssdraid/event_done" error -1 2
Also i want to note that my company don't have any subscription. We are using proxmox internally for testing purposes.
I just wanted to report this because this is the way how open source projects get better.
You don't have to do anything if you don't want to.

PS: there is a problem in your forum software.
My messages saved as a single line if i dont use html tag br, line breaks not working.
I am using Linux with Chromium v44 web browser.
 
Last edited:
PS: there is a problem in your forum software.
My messages saved as a single line if i dont use html tag br, line breaks not working.
I am using Linux with Chromium v44 web browser.
Avoid using https and your problem is solved. This has been an issue since several years.
 
The build from git for PVE3.4 is currently broken, with the software stack available we cannot use gfs2-utils in version 3.1.8.
We will roll it back to a working state, use the packages from our repos in the meanwhile which should be stable and working on the 3.10 kernel from PVE3.4.

The master branch which targets PVE4 (beta atm.) works fine, as far as I've tested it.
Thanks for noticing us.
 
I have been testing Proxmox 3.4 (no subscription) for a few months now with GFS2. I got the same issue.
kernel 2.6.32-37
gfs2-utils 3.1.3-1
After mounting the FS I can initially do a file delete but only once. After that I get an I/O error even with a simple ls to the directory.
syslog shows "kernel: GFS2: fsid= Number of entries corrupt in dirip i_entries g.offset"
It does run for weeks without any issues as long nobody deletes anything in the FS.
I tried upgrading the kernel to pve-kernel-3.10.0-11-pve but after reboot the FC firmware fails
"kernel: [ 761.478011] qla2xxx [0000:08:03.0]-803b:3: Firmware ready **** FAILED ****."
I recreated the zone mapping on the switch -no effect. I downgraded the kernel back to 2.6.32-37 and
everything went back to normal with the same bug for GFS2.

I really think GFS2 is a practical way to implement HA. I hope this gets fixed soon.
 
GFS2 and HA is a bit a problem, as it may introduce an single point of failure to your system, and such a failure and HA aren't good friends :)

Yes AFAIK deleting isn't the real problem "only" the trigger, writing causes the issue. If it was a simple fix, it would be done already but for now it looks like the newer OpenVZ kernel won't fix it anytime.

The 3.10 Kernel works for sure with GFS2, but you'll loose OpenVZ container capability.
Also GFS2 on PVE4 (currently in beta) works also as expected.

Did you install the latest Qlogic drivers? what does modinfo qla2xxx outputs?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!