pmxcfs serious bug

Faye · Jan 5, 2012

Hi,

Today I ran into a problem which took me a lot of manual hackery to recover from, hopefully you can fix it quickly and get the fix into the wild:

High-level steps:
create node1
create several containers and vms
create node2
cluster
fail containers and vms onto node2
restart node2
config filesystem will not mount on node2. Manually running pmxcfs shows that it is working with a .conf file which was migrated from the older node and says that the parent is not a directory.

The problem appears to be that in recreating the config filesystem it performs an ordered walk of the tree by inode. The inodes for nodes/ node2/ pve.. qemu.. etc are all greater in sequence than the nodes for container1 2 etc. and so the directories have not yet been created. The hack is to manually find some free inodes in the sqlite db and move your directories there fixing up parents as you go. Then copy the db to the other node(s) manually before restarting everything again.

# pveversion -v
pve-manager: 2.0-18 (pve-manager/2.0/16283a5a)
running kernel: 2.6.32-6-pve
proxmox-ve-2.6.32: 2.0-55
pve-kernel-2.6.32-6-pve: 2.6.32-55
lvm2: 2.02.88-2pve1
clvm: 2.02.88-2pve1
corosync-pve: 1.4.1-1
openais-pve: 1.1.4-1
libqb: 0.6.0-1
redhat-cluster-pve: 3.1.8-3
pve-cluster: 1.0-17
qemu-server: 2.0-13
pve-firmware: 1.0-14
libpve-common-perl: 1.0-11
libpve-access-control: 1.0-5
libpve-storage-perl: 2.0-9
vncterm: 1.0-2
vzctl: 3.0.29-3pve8
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-1
ksm-control-daemon: 1.1-1

Also, I recommend adding the cluster ips/hosts into the hosts files on all the other cluster members in case you're running DNS on a VM running on the cluster...

Thanks

tom · Jan 5, 2012

do you run a two node cluster? did you configure this special two node setup, how? if not, it looks like you lost the quorum and this seems the cause of your issues.

Faye · Jan 5, 2012

followed the instructions "Setting up a cluster" - added the first, added the second node tada. I am not expecting HA, just common control.

I disagree with your statement, since I managed to fix it by correcting the ordering of directories and config by inode.
Feel free to try it out, I think it is reproducible.

tom · Jan 5, 2012

pls post the error logs (dmesg). there must be a reason why the pmxfs does not mount. I assume its because you did not set the special flag for a two node setup in the cluster.conf.

Faye · Jan 5, 2012

root@p02v01:~# pmxcfs
critical: [database] parent is not a directory (inode = 00000000000002DE, parent = 000000000000054F, name = '111.conf') (database.c:401:bdb_backend_load_index)
critical: [database] DB load failed (database.c:445:bdb_backend_load_index)
critical: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db' (pmxcfs.c:763:main)
notice: exit proxmox configuration filesystem (-1)

tom · Jan 5, 2012

pls post full error log from dmesg.

Faye · Jan 5, 2012

Oops missed this bit too.

sqlite> select * from tree where type=4;
2|0|2|0|1322118286|4|priv|
3|0|3|0|1322118286|4|nodes|
4|3|4|0|1322118286|4|sfop01v01|
5|4|5|0|1322118286|4|qemu-server|
6|4|6|0|1322118286|4|openvz|
7|4|7|0|1322118286|4|priv|
44|2|44|0|1322517100|4|lock|
1357|3|1357|1|1325632127|4|p02v01|
1358|1357|1358|1|1325632127|4|qemu-server|
1359|1357|1359|1|1325632127|4|openvz|
1360|1357|1360|1|1325632127|4|priv|
1597|3|1597|1|1325706813|4|p01v01|
1598|1597|1598|1|1325706823|4|openvz|
1601|1597|1601|1|1325706823|4|priv|
1606|1597|1606|1|1325706823|4|qemu-server|
sqlite> select * from tree where name like "111.conf"
...> ;
734|1359|1607|1|1325711212|8|111.conf|ONBOOT="no"

Updating the inode to match the version at 1607 meant that this file passed, but it failed on the next one.
Updating the directory inodes to be earlier in sequence fixed all the files.

It then mounted.

Faye · Jan 5, 2012

I don't have time to do that right now, will go hunting later.

Faye · Jan 5, 2012

It's pretty clear to me from the code that is why it fails. It processes the tree row by row, and expects parent nodes to exist first, which is fine on one box, you must create the directory before you can populate it, however it fails once you have more than one which was created AFTER other items.

Sorry I didn't have time to waste trawling logs, hope this helps.

71 static const char *sql_load_all =
72 "SELECT inode, parent, version, writer, mtime, type, name, data FROM tree;";
325 sqlite3_stmt *stmt = bdb->stmt_load_all;
326
327 while ((rc = sqlite3_step(stmt)) == SQLITE_ROW) {
328
329 memdb_tree_entry_t *te;
330
331 guint64 inode = sqlite3_column_int64(stmt, 0);
332 const char *name = (const char *)sqlite3_column_text(stmt, 6);
333 int namelen = sqlite3_column_bytes(stmt, 6);
334 if (name == NULL || namelen == 0) {
335 cfs_critical("inode has no name (inode = %016zX)", inode);
336 goto fail;
337 }
338 te = g_malloc0(sizeof(memdb_tree_entry_t) + namelen + 1);
339 strcpy(te->name, name);
340
341 te->inode = inode;
342 te->parent = sqlite3_column_int64(stmt, 1);
343 te->version = sqlite3_column_int64(stmt, 2);
344 te->writer = sqlite3_column_int64(stmt, 3) & 0x0ffffffff;
345 te->mtime = sqlite3_column_int64(stmt, 4) & 0x0ffffffff;
346 te->type = sqlite3_column_int64(stmt, 5) & 255;
388 if (!(pte = g_hash_table_lookup(index, &te->parent))) {
389
390 /* allocate placeholder (type == 0)
391 * this is simply replaced if we find a real inode later
392 */
393 pte = g_malloc0(sizeof(memdb_tree_entry_t));
394 pte->inode = te->parent;
395 pte->data.entries = g_hash_table_new(g_str_hash, g_str_equal);
396 g_hash_table_replace(index, &pte->inode, pte);
397
398 } else if (pte->type != DT_DIR) {
399 cfs_critical("parent is not a directory "
400 "(inode = %016zX, parent = %016zX, name = '%s')",
401 te->inode, te->parent, te->name);
402 memdb_tree_entry_free(te);
403 goto fail;
404 }

Faye · Jan 5, 2012

I suppose the question is,
if (!(pte = g_hash_table_lookup(index, &te->parent))) {

Why is it that returns null when the item is in the db? It looks like that is what causes the problem. Ordering the entries works around the issue.

Well it's because that is looking up against the already loaded entries (sorry, newbie to debugging glib, most of my time's spent with Java devs), so yes, back to the problem, if it already exists, we create a placeholder, if it doesn't then we fail. There's no way in the code to deal with out of order items and when you start having more than one node that becomes really easy to create eg. you have a 3 node cluster, you extend it to 4. you migrate a machine to it, you stop it (or it stops) and it won't start back up. You could simulate it by having a the following

inode

arent:version:type:name
1:2:2:1:"test"
2:0:1:4:"dir"

or perhaps even:

2:1:2:1:"test"
1:0:1:4:"dir"

but more likely:

1:0:1:4:"dir"
2:1:2:1:"file"
3:0:3:4:"dir2"

Now migrate from dir to dir2 and you may well see the issue.

Faye · Jan 5, 2012

Test...
sqlite> insert into tree values (1,0,1,4,"dir");
--my first node
sqlite> insert into tree values (2,1,2,1,"file");
--my first guest
sqlite> insert into tree values (3,0,3,4,"dir2");
--my second node

sqlite> select * from tree;
1|0|1|4|dir
2|1|2|1|file
3|0|3|4|dir2

sqlite> update tree set parent=3,version=4 where node=2;
--migrate my guest
sqlite> select * from tree;

1|0|1|4|dir
2|3|4|1|file
3|0|3|4|dir2

This is why it fails. There's an assumption that updates reorder the result, perhaps?

sqlite> update tree set name="workingfile" where node=2;
sqlite> select * from tree;
1|0|1|4|dir
2|3|4|1|workingfile
3|0|3|4|dir2
sqlite> update tree set node=4 where node=2;
sqlite> select * from tree;
1|0|1|4|dir
4|3|4|1|workingfile
3|0|3|4|dir2
sqlite> delete from tree where node=4;
sqlite> insert into tree values (2,1,5,1,"I bet you think I'm going where 4 was");
sqlite> select * from tree;
1|0|1|4|dir
3|0|3|4|dir2
2|1|5|1|I bet you think I'm going where 4 was

Sorry, had to add these, they made me laugh. It does look as though it's ordered by first insert time, but I bet as soon as you rely on that it will change to become something more inscrutable.

dietmar · Jan 5, 2012

Faye said:
create node1
create several containers and vms
create node2
cluster
fail containers and vms onto node2
restart node2
config filesystem will not mount on node2. Manually running pmxcfs shows that it is working with a .conf file which was migrated from the older node and says that the parent is not a directory.

It would be great to have a low level test case for that. Do you think you can create a test case in check_memdb.c?

Faye · Jan 5, 2012

I'm going to look into what's required. It's been a while since I called myself a developer, but who knows, I may not be entirely rusty. I've got a snapshot right now, I'll clone the repo and take a look. I have quite a lot on locally, but this is valuable so I'll see what I can do for the project. No promises yet!

Faye

dietmar · Jan 6, 2012

Faye said:
I'm going to look into what's required. It's been a while since I called myself a developer, but who knows, I may not be entirely rusty. I've got a snapshot right now, I'll clone the repo and take a look. I have quite a lot on locally, but this is valuable so I'll see what I can do for the project. No promises yet!

Also read http://pve.proxmox.com/wiki/Developer_Documentation before you start.

dietmar · Jan 9, 2012

OK, I have uploaded a fix - please can you test?

https://git.proxmox.com/?p=pve-cluster.git;a=summary

Faye · Jan 9, 2012

Trying to build from source:

dpkg-checkbuilddeps: Unmet build dependencies: libsqlite3-dev libfuse-dev libcorosync-pve-dev libqb-dev libglib2.0-dev librrd-dev check
dpkg-buildpackage: warning: Build dependencies/conflicts unsatisfied; aborting.
dpkg-buildpackage: warning: (Use -d flag to override.)
make: *** [pve-cluster_1.0-19_amd64.deb] Error 3
# apt-get install libsqlite3-dev libfuse-dev libcorosync-pve-dev libqb-dev libglib2.0-dev librrd-dev check
Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package libcorosync-pve-dev
E: Unable to locate package libqb-dev

What repos do I need to add to get these please?

Thanks

dietmar · Jan 9, 2012

Faye said:
What repos do I need to add to get these please?

Oh, they are missing in our repository (another bug).

Faye · Jan 10, 2012

dietmar said:
Oh, they are missing in our repository (another bug).

heh. Let me know when I can test and I'll gladly do so.

Search

Search

pmxcfs serious bug

Faye

Member

tom

Proxmox Staff Member

Faye

Member

tom

Proxmox Staff Member

Faye

Member

tom

Proxmox Staff Member

Faye

Member

Faye

Member

Faye

Member

Faye

Member

Faye

Member

dietmar

Proxmox Staff Member

Faye

Member

dietmar

Proxmox Staff Member

dietmar

Proxmox Staff Member

Faye

Member

dietmar

Proxmox Staff Member

Faye

Member