Cluster master failed... And now?

copymaster · Jan 12, 2010

I know the solution for the case that a node can become master through pveca -m.

In my cluster there is every now and then the situation that the servers on the cluster are not accessible i think it depends on the following described error.

But in my situation the master seems not to be "really" down. i get several errors in /var/log/kern.log saying that a problem with a sata drive exists (I think) i already posted the errorlog:

Code:

Jan 12 08:04:05 Donald kernel: sd 7:0:0:0: [sdb] 2147518464 512-byte hardware sectors (1099529 MB)
Jan 12 08:04:05 Donald kernel: sd 7:0:0:0: [sdb] Write Protect is off
Jan 12 08:04:05 Donald kernel: sd 7:0:0:0: [sdb] Mode Sense: bd 00 00 08
Jan 12 08:04:05 Donald kernel: sd 7:0:0:0: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
Jan 12 08:05:07 Donald kernel: ata2.00: qc timeout (cmd 0xa0)
Jan 12 08:05:07 Donald kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Jan 12 08:05:07 Donald kernel: ata2.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Jan 12 08:05:07 Donald kernel:         cdb 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
Jan 12 08:05:07 Donald kernel:         res 51/20:03:00:00:00/00:00:00:00:00/a0 Emask 0x5 (timeout)
Jan 12 08:05:07 Donald kernel: ata2.00: status: { DRDY ERR }
Jan 12 08:05:12 Donald kernel: ata2: port is slow to respond, please be patient (Status 0xd0)
Jan 12 08:05:17 Donald kernel: ata2: device not ready (errno=-16), forcing hardreset
Jan 12 08:05:17 Donald kernel: ata2: soft resetting link
Jan 12 08:05:18 Donald kernel: ata2.01: NODEV after polling detection
Jan 12 08:05:18 Donald kernel: ata2.00: configured for UDMA/25
Jan 12 08:05:18 Donald kernel: ata2: EH complete
Jan 12 08:07:00 Donald kernel: ata2.00: qc timeout (cmd 0xa0)
Jan 12 08:07:00 Donald kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Jan 12 08:07:00 Donald kernel: ata2.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Jan 12 08:07:00 Donald kernel:         cdb 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
Jan 12 08:07:00 Donald kernel:         res 51/20:03:00:00:00/00:00:00:00:00/a0 Emask 0x5 (timeout)
Jan 12 08:07:00 Donald kernel: ata2.00: status: { DRDY ERR }
Jan 12 08:07:05 Donald kernel: ata2: port is slow to respond, please be patient (Status 0xd0)
Jan 12 08:07:10 Donald kernel: ata2: device not ready (errno=-16), forcing hardreset
Jan 12 08:07:10 Donald kernel: ata2: soft resetting link
Jan 12 08:07:10 Donald kernel: ata2.01: NODEV after polling detection
Jan 12 08:07:11 Donald kernel: ata2.00: configured for UDMA/25
Jan 12 08:07:11 Donald kernel: ata2: EH complete

I googled this error and it seems that a new installation may cure the pain of the server. But now the question:

What is the procedure if i want to reinstall a cluster master??
Say, i want to shut the master down, reinstall it and then bring it back as master again.

Is that possible?? And what is the right action to do??

tom · Jan 12, 2010

copymaster said:

I know the solution for the case that a node can become master through pveca -m.

In my cluster there is every now and then the situation that the servers on the cluster are not accessible i think it depends on the following described error.

But in my situation the master seems not to be "really" down. i get several errors in /var/log/kern.log saying that a problem with a sata drive exists (I think) i already posted the errorlog:

Code:

Jan 12 08:04:05 Donald kernel: sd 7:0:0:0: [sdb] 2147518464 512-byte hardware sectors (1099529 MB)
Jan 12 08:04:05 Donald kernel: sd 7:0:0:0: [sdb] Write Protect is off
Jan 12 08:04:05 Donald kernel: sd 7:0:0:0: [sdb] Mode Sense: bd 00 00 08
Jan 12 08:04:05 Donald kernel: sd 7:0:0:0: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
Jan 12 08:05:07 Donald kernel: ata2.00: qc timeout (cmd 0xa0)
Jan 12 08:05:07 Donald kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Jan 12 08:05:07 Donald kernel: ata2.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Jan 12 08:05:07 Donald kernel:         cdb 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
Jan 12 08:05:07 Donald kernel:         res 51/20:03:00:00:00/00:00:00:00:00/a0 Emask 0x5 (timeout)
Jan 12 08:05:07 Donald kernel: ata2.00: status: { DRDY ERR }
Jan 12 08:05:12 Donald kernel: ata2: port is slow to respond, please be patient (Status 0xd0)
Jan 12 08:05:17 Donald kernel: ata2: device not ready (errno=-16), forcing hardreset
Jan 12 08:05:17 Donald kernel: ata2: soft resetting link
Jan 12 08:05:18 Donald kernel: ata2.01: NODEV after polling detection
Jan 12 08:05:18 Donald kernel: ata2.00: configured for UDMA/25
Jan 12 08:05:18 Donald kernel: ata2: EH complete
Jan 12 08:07:00 Donald kernel: ata2.00: qc timeout (cmd 0xa0)
Jan 12 08:07:00 Donald kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Jan 12 08:07:00 Donald kernel: ata2.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Jan 12 08:07:00 Donald kernel:         cdb 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
Jan 12 08:07:00 Donald kernel:         res 51/20:03:00:00:00/00:00:00:00:00/a0 Emask 0x5 (timeout)
Jan 12 08:07:00 Donald kernel: ata2.00: status: { DRDY ERR }
Jan 12 08:07:05 Donald kernel: ata2: port is slow to respond, please be patient (Status 0xd0)
Jan 12 08:07:10 Donald kernel: ata2: device not ready (errno=-16), forcing hardreset
Jan 12 08:07:10 Donald kernel: ata2: soft resetting link
Jan 12 08:07:10 Donald kernel: ata2.01: NODEV after polling detection
Jan 12 08:07:11 Donald kernel: ata2.00: configured for UDMA/25
Jan 12 08:07:11 Donald kernel: ata2: EH complete

I googled this error and it seems that a new installation may cure the pain of the server. But now the question:

What is the procedure if i want to reinstall a cluster master??
Say, i want to shut the master down, reinstall it and then bring it back as master again.

Is that possible?? And what is the right action to do??

promote a node to master.
install the new server (check the hardware/disksystem)
join the new one to the cluster
then promote the node to the master again
tell all other server to sync. from this master

copymaster · Jan 19, 2010

Thanks, i did like you advised...

i promoted one node to be master. Then the former Master disappeared.
I reinstalled the server and brought it up again. I added the server into the cluster.
Then i promoted it again as master with pveca -m

Then the node which was master before, disappeared.

I tried to tell the nodes to sync from the master---- Now DISASTER!
i think the ssh-keys are completely mixed up and my cluster is broken..

i have 3 servers, and here's the pveca -l output from each of them:

Donald (192.168.0.70)

Code:

ID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
 3 : 192.168.0.72    N     S    2 days 03:56   1.26    59%     2%
 4 : 192.168.0.70    M     A           00:22   0.03     2%     1%

Tick (192.168.0.71) was master during reinstallation of Donald

Code:

ID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
 2 : 192.168.0.71    M     A    2 days 03:57   0.64    61%     1%
 3 : 192.168.0.72    N     S    2 days 03:56   1.32    59%     2%
 4 : 192.168.0.70    N     A           00:22   0.01     2%     1%

Trick (192.168.0.72)

Code:

CID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
 1 : 192.168.0.70    M     ERROR: 500 Can't connect to 127.0.0.1:50000 (connect: Verbindungsaufbau abgelehnt)

 2 : 192.168.0.71    N     A    2 days 03:58   0.46    61%     1%
 3 : 192.168.0.72    N     S    2 days 03:57   1.64    59%     2%

when i log into the new master , i have no access to the point "storage" which is essential as there are all VM's

And if i choose "cluster" i can only see Master and Tick as a node, which is in state NOSYNC

HEEELP please!

A pveca -s on node Trick gives:

Code:

syncing master configuration from '192.168.0.70'
syncing master configuration from '192.168.0.70' failed (rsync --rsh=ssh -l root -o BatchMode=yes -lpgoq 192.168.0.70:/etc/pve/* /etc/cron.d/vzdump /etc/pve/master/ --exclude *~) : command 'rsync --rsh=ssh -l root -o BatchMode=yes -lpgoq 192.168.0.70:/etc/pve/* /etc/cron.d/vzdump /etc/pve/master/ --exclude *~' failed with exit code 255:
Host key verification failed.
rsync: connection unexpectedly closed (0 bytes received so far) [receiver]
rsync error: unexplained error (code 255) at io.c(635) [receiver=3.0.3]

tom · Jan 19, 2010

copymaster said:
Thanks, i did like you advised...

i promoted one node to be master. Then the former Master disappeared.
I reinstalled the server and brought it up again. I added the server into the cluster.
Then i promoted it again as master with pveca -m

Then the node which was master before, disappeared.

I tried to tell the nodes to sync from the master---- Now DISASTER!
i think the ssh-keys are completely mixed up and my cluster is broken..

i have 3 servers, and here's the pveca -l output from each of them:

Donald (192.168.0.70)

Code:

ID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK 3 : 192.168.0.72 N S 2 days 03:56 1.26 59% 2% 4 : 192.168.0.70 M A 00:22 0.03 2% 1%

Tick (192.168.0.71) was master during reinstallation of Donald

Code:

ID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK 2 : 192.168.0.71 M A 2 days 03:57 0.64 61% 1% 3 : 192.168.0.72 N S 2 days 03:56 1.32 59% 2% 4 : 192.168.0.70 N A 00:22 0.01 2% 1%

Trick (192.168.0.72)

Code:

CID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK 1 : 192.168.0.70 M ERROR: 500 Can't connect to 127.0.0.1:50000 (connect: Verbindungsaufbau abgelehnt) 2 : 192.168.0.71 N A 2 days 03:58 0.46 61% 1% 3 : 192.168.0.72 N S 2 days 03:57 1.64 59% 2%

when i log into the new master , i have no access to the point "storage" which is essential as there are all VM's

And if i choose "cluster" i can only see Master and Tick as a node, which is in state NOSYNC

HEEELP please!

do you installed the old server with the same IP/name? so the other server has a wrong known.host file.

files to check (on all server):
/etc/pve/cluster.cfg
/root/.ssh/known_hosts

fix the known_hosts file (remove the old entry, or just delete the whole file). if you can ssh from each server to another without specifying a password, the cluster communication works again.

copymaster · Jan 19, 2010

ok i removed the known_host fiels from all nodes and ssh'ed into each server from each other.

the connection works again, i can manually do a pveca -s

but the cluster.cfg is different on every host.
the new master only shows one node.

can i just edit the cluster.cfg to show up all nodes and copy this config to all other nodes?

By the way: the "storage" point in the webinterface still doesn't work

tom · Jan 19, 2010

copymaster said:
ok i removed the known_host fiels from all nodes and ssh'ed into each server from each other.

the connection works again, i can manually do a pveca -s

but the cluster.cfg is different on every host.
the new master only shows one node.

can i just edit the cluster.cfg to show up all nodes and copy this config to all other nodes?

By the way: the "storage" point in the webinterface still doesn't work

the cluster.cfg has to equal on all hosts.

storage: I did not understand this, details?

post also your /etc/pve/storage.cfg (master), this file is synchronized to all nodes.

copymaster · Jan 20, 2010

Hi Tom,

I got it back up again in the meantime. I just copied a working storage.conf to all nodes and synced the cluster.conf

After that i ssh'ed from each machine into each other and made a pveca -s -h <master>

I think thats it. unfortunately this didn't solve my problem with the ata2 error.
I changed all drives but error remains, i reinstalled, error remains.

I think the only part of the server which can be the "black man" is the storage controller....
i will have a look at this today...

Search

Search

Cluster master failed... And now?

copymaster

Member

tom

Proxmox Staff Member

copymaster

Member

tom

Proxmox Staff Member

copymaster

Member

tom

Proxmox Staff Member

copymaster

Member

We value your privacy