13
1 MHA Failover 过程解析 DBA Team 二零一三年三月 文档修订版历史 日期 版本 说明 作者 审阅 2013-03-27 邱伟胜

Mha procedure

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Mha procedure

1

MHA Failover 过程解析

DBA Team

二零一三年三月

文档修订版历史

日期 版本 说明 作者 审阅

2013-03-27 邱伟胜

Page 2: Mha procedure

2

目录

目录

1.MHA 场景:.................................................................................................................32.MHA 切换过程.............................................................................................................3

2.1 Phase 1: Configuration Check Phase...................................................32.2 Phase 2: Dead Master Shutdown Phase.................................................32.3 Phase 3: Master Recovery Phase...........................................................32.4 Phase 4: Slaves Recovery Phase...........................................................92.5 Phase 5: New master cleanup phase...................................................12

Page 3: Mha procedure

3

1.MHA1.MHA1.MHA1.MHA场景:场景:场景:场景:

在下面的集群中,通过手工控制,模拟出 master 和各个 slave 不一致。如 master

上表 qwsh 有四条记录,而 10.0.0.75 上只有一条记录:

10.0.0.13 (current master)

+--10.0.0.74

+--10.0.0.11

+--10.0.0.75

Server Role Table Column Rows

10.0.0.13 Master Qwsh Aa int 1,2,3,4

10.0.0.11 Slave Qwsh Aa int 1,2,3

10.0.0.74 Slave(candidate master) Qwsh Aa int 1,2

10.0.0.75 slave Qwsh Aa int 1

2.MHA2.MHA2.MHA2.MHA切换过程切换过程切换过程切换过程

以下通过 manual failover 来详细解析一下过程:

2.12.12.12.1 PhasePhasePhasePhase 1:1:1:1: ConfigurationConfigurationConfigurationConfiguration CheckCheckCheckCheck Phase..Phase..Phase..Phase..

主要是检查各节点的状态:

一是 dead 与 alive;

二是 Primary candidate for the new Master 等

2.22.22.22.2 PhasePhasePhasePhase 2:2:2:2: DeadDeadDeadDead MasterMasterMasterMaster ShutdownShutdownShutdownShutdown Phase..Phase..Phase..Phase..

一是检查是否可以 ssh 到 Dead Master

二是对 Dead Master 做一些处理,如 Disable VIP,Shutdown 主机等

Page 4: Mha procedure

4

2.32.32.32.3 PhasePhasePhasePhase 3:3:3:3: MasterMasterMasterMaster RecoveryRecoveryRecoveryRecovery Phase..Phase..Phase..Phase..

2.3.12.3.12.3.12.3.1 Phase 3.1: Getting Latest Slaves Phase..

根据各 slave 的同步情况得到 Latest slaves(mysql-bin.000034:250773)和

Oldest slaves(mysql-bin.000034:250405)

2.3.22.3.22.3.22.3.2 PhasePhasePhasePhase 3.2:3.2:3.2:3.2: SavingSavingSavingSaving DeadDeadDeadDead Master'sMaster'sMaster'sMaster's BinlogBinlogBinlogBinlog Phase..Phase..Phase..Phase..

如果Dead Master仍是可以ssh到,获取lasted slave 与 master 之间的bin log

(start mysql-bin.000034:250773)

save_binary_logs --command=save --start_file=mysql-bin.000034

--start_pos=250773 --binlog_dir=/data/mysql/arch

--output_file=/var/tmp/saved_master_binlog_from_10.0.0.13_3306_201303

25143805.binlog --handle_raw_binlog=1 --disable_log_bin=0

--manager_version=0.55

如下为对应的 bin log 的内容:

[root@db-13~]# mysqlbinlog

/var/tmp/saved_master_binlog_from_10.0.0.13_3306_20130325143805.binlo

g

/*!40019 SET @@session.max_insert_delayed_threads=0*/;

/*!50003 SET

@OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;

DELIMITER /*!*/;

# at 4

#130325 10:40:31 server id 1 end_log_pos 107 Start: binlog v 4, server

v 5.5.27-log created 130325 10:40:31 at startup

ROLLBACK/*!*/;

BINLOG '

H7lPUQ8BAAAAZwAAAGsAAAAAAAQANS41LjI3LWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAA

AAAAAAAAAAAAAAAAAAAfuU9REzgNAAgAEgAEBAQEEgAAVAAEGggAAAAICAgCAA==

'/*!*/;

# at 107

#130325 14:18:47 server id 1 end_log_pos 250841 Query

thread_id=21 exec_time=0 error_code=0

SET TIMESTAMP=1364192327/*!*/;

SET @@session.pseudo_thread_id=21/*!*/;

Page 5: Mha procedure

5

SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0,

@@session.unique_checks=1, @@session.autocommit=1/*!*/;

SET @@session.sql_mode=0/*!*/;

SET @@session.auto_increment_increment=1,

@@session.auto_increment_offset=1/*!*/;

/*!\C utf8 *//*!*/;

SET

@@session.character_set_client=33,@@session.collation_connection=33,@

@session.collation_server=33/*!*/;

SET @@session.lc_time_names=0/*!*/;

SET @@session.collation_database=DEFAULT/*!*/;

BEGIN

/*!*/;

# at 175

#130325 14:18:47 server id 1 end_log_pos 250930 Query

thread_id=21 exec_time=0 error_code=0

use test/*!*/;

SET TIMESTAMP=1364192327/*!*/;

insert into qwsh values(4)

/*!*/;

# at 264

#130325 14:18:47 server id 1 end_log_pos 250957 Xid = 2425

COMMIT/*!*/;

# at 291

#130325 14:19:42 server id 1 end_log_pos 250976 Stop

DELIMITER ;

# End of log file

ROLLBACK /* added by mysqlbinlog */;

/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;

2.3.32.3.32.3.32.3.3 PhasePhasePhasePhase 3.3:3.3:3.3:3.3: DeterminingDeterminingDeterminingDetermining NewNewNewNew MasterMasterMasterMaster Phase..Phase..Phase..Phase..

检查 latest slave 是否有所有的 relay log 用来修复其他的 slave(oldest pos:

mysql-bin.000034:250405)。然后根据候选规则,选出新的主库(会检查是否有

设置 candidate_master=1 和 no_master=1 等):

apply_diff_relay_logs --command=find --latest_mlf=mysql-bin.000034

--latest_rmlp=250773 --target_mlf=mysql-bin.000034

--target_rmlp=250405 --server_id=3 --workdir=/var/tmp

--timestamp=20130325143805 --manager_version=0.55

Page 6: Mha procedure

6

--relay_log_info=/data/mysql/data/relay-log.info

--relay_dir=/data/mysql/data/

2.3.42.3.42.3.42.3.4 PhasePhasePhasePhase 3.4:3.4:3.4:3.4: NewNewNewNew MasterMasterMasterMaster DiffDiffDiffDiff LogLogLogLog GenerationGenerationGenerationGeneration Phase..Phase..Phase..Phase..

候选 master 与 lasted slave 比较,是否要生产差异 log (10.0.0.74 received

relay logs up to: mysql-bin.000034:250589 , the latest slave(10.0.0.11)

up to: mysql-bin.000034:250773 )

apply_diff_relay_logs --command=generate_and_send --scp_user=root

--scp_host=10.0.0.74 --latest_mlf=mysql-bin.000034

--latest_rmlp=250773 --target_mlf=mysql-bin.000034

--target_rmlp=250589 --server_id=3

--diff_file_readtolatest=/var/tmp/relay_from_read_to_latest_10.0.0.74

_3306_20130325143805.binlog --workdir=/var/tmp

--timestamp=20130325143805 --handle_raw_binlog=1 --disable_log_bin=0

--manager_version=0.55

--relay_log_info=/data/mysql/data/relay-log.info

--relay_dir=/data/mysql/data/

如下为对应的 bin log 的内容:

[root@db-11~]#mysqlbinlog

/var/tmp/relay_from_read_to_latest_10.0.0.74_3306_20130325143805.binl

og

/*!40019 SET @@session.max_insert_delayed_threads=0*/;

/*!50003 SET

@OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;

DELIMITER /*!*/;

# at 4

#130325 11:03:52 server id 3 end_log_pos 107 Start: binlog v 4, server

v 5.5.27-log created 130325 11:03:52

BINLOG '

mL5PUQ8DAAAAZwAAAGsAAAAAAAQANS41LjI3LWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAAVAAEGggAAAAICAgCAA==

'/*!*/;

# at 107

#700101 8:00:00 server id 1 end_log_pos 0 Rotate to

mysql-bin.000034 pos: 107

# at 150

#130325 10:40:31 server id 1 end_log_pos 0 Start: binlog v 4, server

Page 7: Mha procedure

7

v 5.5.27-log created 130325 10:40:31

BINLOG '

H7lPUQ8BAAAAZwAAAAAAAAAAAAQANS41LjI3LWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAAVAAEGggAAAAICAgCAA==

'/*!*/;

# at 253

#130325 14:12:19 server id 1 end_log_pos 250657 Query

thread_id=21 exec_time=0 error_code=0

SET TIMESTAMP=1364191939/*!*/;

SET @@session.pseudo_thread_id=21/*!*/;

SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0,

@@session.unique_checks=1, @@session.autocommit=1/*!*/;

SET @@session.sql_mode=0/*!*/;

SET @@session.auto_increment_increment=1,

@@session.auto_increment_offset=1/*!*/;

/*!\C utf8 *//*!*/;

SET

@@session.character_set_client=33,@@session.collation_connection=33,@

@session.collation_server=33/*!*/;

SET @@session.lc_time_names=0/*!*/;

SET @@session.collation_database=DEFAULT/*!*/;

BEGIN

/*!*/;

# at 321

#130325 14:12:19 server id 1 end_log_pos 250746 Query

thread_id=21 exec_time=0 error_code=0

use test/*!*/;

SET TIMESTAMP=1364191939/*!*/;

insert into qwsh values(3)

/*!*/;

# at 410

#130325 14:12:19 server id 1 end_log_pos 250773 Xid = 2424

COMMIT/*!*/;

# at 437

#130325 14:12:36 server id 3 end_log_pos 250938 Stop

DELIMITER ;

# End of log file

ROLLBACK /* added by mysqlbinlog */;

/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;

Page 8: Mha procedure

8

2.3.52.3.52.3.52.3.5 PhasePhasePhasePhase 3.5:3.5:3.5:3.5: MasterMasterMasterMaster LogLogLogLog ApplyApplyApplyApply Phase..Phase..Phase..Phase..

一是 Waiting until all relay logs are applied。

二是合并 lasted slave 和 dead master 的日志,因为有些日志的 events 可能

不完整,合并过程中要检查:All apply target binary logs are concatinated

at /var/tmp/total_binlog_for_10.0.0.74_3306.20130325143805.binlog .

以下是对应的 log 内容:

[mysql@db-74 ~]$ mysqlbinlog

/var/tmp/total_binlog_for_10.0.0.74_3306.20130325143805.binlog

/*!40019 SET @@session.max_insert_delayed_threads=0*/;

/*!50003 SET

@OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;

DELIMITER /*!*/;

# at 4

#130325 11:03:52 server id 3 end_log_pos 107 Start: binlog v 4, server

v 5.5.27-log created 130325 11:03:52

BINLOG '

mL5PUQ8DAAAAZwAAAGsAAAAAAAQANS41LjI3LWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAAVAAEGggAAAAICAgCAA==

'/*!*/;

# at 107

#700101 8:00:00 server id 1 end_log_pos 0 Rotate to

mysql-bin.000034 pos: 107

# at 150

#130325 10:40:31 server id 1 end_log_pos 0 Start: binlog v 4, server

v 5.5.27-log created 130325 10:40:31

BINLOG '

H7lPUQ8BAAAAZwAAAAAAAAAAAAQANS41LjI3LWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAAVAAEGggAAAAICAgCAA==

'/*!*/;

# at 253

#130325 14:12:19 server id 1 end_log_pos 250657 Query

thread_id=21 exec_time=0 error_code=0

SET TIMESTAMP=1364191939/*!*/;

SET @@session.pseudo_thread_id=21/*!*/;

SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0,

@@session.unique_checks=1, @@session.autocommit=1/*!*/;

SET @@session.sql_mode=0/*!*/;

Page 9: Mha procedure

9

SET @@session.auto_increment_increment=1,

@@session.auto_increment_offset=1/*!*/;

/*!\C utf8 *//*!*/;

SET

@@session.character_set_client=33,@@session.collation_connection=33,@

@session.collation_server=33/*!*/;

SET @@session.lc_time_names=0/*!*/;

SET @@session.collation_database=DEFAULT/*!*/;

BEGIN

/*!*/;

# at 321

#130325 14:12:19 server id 1 end_log_pos 250746 Query

thread_id=21 exec_time=0 error_code=0

use test/*!*/;

SET TIMESTAMP=1364191939/*!*/;

insert into qwsh values(3)

/*!*/;

# at 410

#130325 14:12:19 server id 1 end_log_pos 250773 Xid = 2424

COMMIT/*!*/;

# at 437

#130325 14:12:36 server id 3 end_log_pos 250938 Stop

# at 456

#130325 14:18:47 server id 1 end_log_pos 250841 Query

thread_id=21 exec_time=0 error_code=0

SET TIMESTAMP=1364192327/*!*/;

BEGIN

/*!*/;

# at 524

#130325 14:18:47 server id 1 end_log_pos 250930 Query

thread_id=21 exec_time=0 error_code=0

SET TIMESTAMP=1364192327/*!*/;

insert into qwsh values(4)

/*!*/;

# at 613

#130325 14:18:47 server id 1 end_log_pos 250957 Xid = 2425

COMMIT/*!*/;

# at 640

#130325 14:19:42 server id 1 end_log_pos 250976 Stop

DELIMITER ;

# End of log file

ROLLBACK /* added by mysqlbinlog */;

/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;

Page 10: Mha procedure

10

三是记录新的 master 的 log file 和 pos:

All other slaves should start replication from here. Statement should be:

CHANGE MASTER TO MASTER_HOST='10.0.0.74', MASTER_PORT=3306,

MASTER_LOG_FILE='mysql-bin.000003', MASTER_LOG_POS=475,

MASTER_USER='repl', MASTER_PASSWORD='xxx';

四是 Executing master IP activate script;

五是 Set read_only=0 on the new master

2.42.42.42.4 PhasePhasePhasePhase 4:4:4:4: SlavesSlavesSlavesSlaves RecoveryRecoveryRecoveryRecovery Phase..Phase..Phase..Phase..

2.4.12.4.12.4.12.4.1 PhasePhasePhasePhase 4.1:4.1:4.1:4.1: StartingStartingStartingStarting ParallelParallelParallelParallel SlaveSlaveSlaveSlave DiffDiffDiffDiff LogLogLogLog GenerationGenerationGenerationGeneration Phase..Phase..Phase..Phase..

判断各个 slave 与 lastest slave 是否存在 relay log 差异,在 latest slave

上执行如下命令,生成差异 relay log 文件,并通过 scp 拷贝到对应的从库上:

(Server 10.0.0.75 received relay logs up to: mysql-bin.000034:250405.

Need to get diffs from the latest slave(10.0.0.11) up to:

mysql-bin.000034:250773)

apply_diff_relay_logs --command=generate_and_send --scp_user=root

--scp_host=10.0.0.75 --latest_mlf=mysql-bin.000034

--latest_rmlp=250773 --target_mlf=mysql-bin.000034

--target_rmlp=250405 --server_id=3

--diff_file_readtolatest=/var/tmp/relay_from_read_to_latest_10.0.0.75

_3306_20130325143805.binlog --workdir=/var/tmp

--timestamp=20130325143805 --handle_raw_binlog=1 --disable_log_bin=0

--manager_version=0.55

--relay_log_info=/data/mysql/data/relay-log.info

--relay_dir=/data/mysql/data/

2.4.22.4.22.4.22.4.2 PhasePhasePhasePhase 4.2:4.2:4.2:4.2: StartingStartingStartingStarting ParallelParallelParallelParallel SlaveSlaveSlaveSlave LogLogLogLog ApplyApplyApplyApply Phase..Phase..Phase..Phase..

一是 Waiting until all relay logs are applied

二是检查是否有最新的 relay log,然后合并后应用

10.0.0.11 有 lasted relay log:

Page 11: Mha procedure

11

apply_diff_relay_logs --command=apply --slave_user='root'

--slave_host=10.0.0.11 --slave_ip=10.0.0.11 --slave_port=3306

--apply_files=/var/tmp/saved_master_binlog_from_10.0.0.13_3306_201303

25143805.binlog --workdir=/var/tmp --target_version=5.5.27-log

--timestamp=20130325143805 --handle_raw_binlog=1 --disable_log_bin=0

--manager_version=0.55 --slave_pass=xxx

10.0.0.75 没有最新的 relay log,需要合并 relay log 和 dead master 的 bin

log:

apply_diff_relay_logs --command=apply --slave_user='root'

--slave_host=10.0.0.75 --slave_ip=10.0.0.75 --slave_port=3306

--apply_files=/var/tmp/relay_from_read_to_latest_10.0.0.75_3306_20130

325143805.binlog,/var/tmp/saved_master_binlog_from_10.0.0.13_3306_201

30325143805.binlog --workdir=/var/tmp --target_version=5.5.27-log

--timestamp=20130325143805 --handle_raw_binlog=1 --disable_log_bin=0

--manager_version=0.55 --slave_pass=xxx

以下是对应 log 的内容:

[mysql@db-75 data]$ mysqlbinlog

/var/tmp/total_binlog_for_10.0.0.75_3306.20130325143805.binlog

/*!40019 SET @@session.max_insert_delayed_threads=0*/;

/*!50003 SET

@OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;

DELIMITER /*!*/;

# at 4

#130325 11:03:52 server id 3 end_log_pos 107 Start: binlog v 4, server

v 5.5.27-log created 130325 11:03:52

BINLOG '

mL5PUQ8DAAAAZwAAAGsAAAAAAAQANS41LjI3LWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAAVAAEGggAAAAICAgCAA==

'/*!*/;

# at 107

#700101 8:00:00 server id 1 end_log_pos 0 Rotate to

mysql-bin.000034 pos: 107

# at 150

#130325 10:40:31 server id 1 end_log_pos 0 Start: binlog v 4, server

v 5.5.27-log created 130325 10:40:31

BINLOG '

H7lPUQ8BAAAAZwAAAAAAAAAAAAQANS41LjI3LWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAAVAAEGggAAAAICAgCAA==

'/*!*/;

Page 12: Mha procedure

12

# at 253

#130325 14:09:57 server id 1 end_log_pos 250473 Query

thread_id=21 exec_time=0 error_code=0

SET TIMESTAMP=1364191797/*!*/;

SET @@session.pseudo_thread_id=21/*!*/;

SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0,

@@session.unique_checks=1, @@session.autocommit=1/*!*/;

SET @@session.sql_mode=0/*!*/;

SET @@session.auto_increment_increment=1,

@@session.auto_increment_offset=1/*!*/;

/*!\C utf8 *//*!*/;

SET

@@session.character_set_client=33,@@session.collation_connection=33,@

@session.collation_server=33/*!*/;

SET @@session.lc_time_names=0/*!*/;

SET @@session.collation_database=DEFAULT/*!*/;

BEGIN

/*!*/;

# at 321

#130325 14:09:57 server id 1 end_log_pos 250562 Query

thread_id=21 exec_time=0 error_code=0

use test/*!*/;

SET TIMESTAMP=1364191797/*!*/;

insert into qwsh values(2)

/*!*/;

# at 410

#130325 14:09:57 server id 1 end_log_pos 250589 Xid = 2423

COMMIT/*!*/;

# at 437

#130325 14:12:19 server id 1 end_log_pos 250657 Query

thread_id=21 exec_time=0 error_code=0

SET TIMESTAMP=1364191939/*!*/;

BEGIN

/*!*/;

# at 505

#130325 14:12:19 server id 1 end_log_pos 250746 Query

thread_id=21 exec_time=0 error_code=0

SET TIMESTAMP=1364191939/*!*/;

insert into qwsh values(3)

/*!*/;

# at 594

#130325 14:12:19 server id 1 end_log_pos 250773 Xid = 2424

COMMIT/*!*/;

Page 13: Mha procedure

13

# at 621

#130325 14:12:36 server id 3 end_log_pos 250938 Stop

# at 640

#130325 14:18:47 server id 1 end_log_pos 250841 Query

thread_id=21 exec_time=0 error_code=0

SET TIMESTAMP=1364192327/*!*/;

BEGIN

/*!*/;

# at 708

#130325 14:18:47 server id 1 end_log_pos 250930 Query

thread_id=21 exec_time=0 error_code=0

SET TIMESTAMP=1364192327/*!*/;

insert into qwsh values(4)

/*!*/;

# at 797

#130325 14:18:47 server id 1 end_log_pos 250957 Xid = 2425

COMMIT/*!*/;

# at 824

#130325 14:19:42 server id 1 end_log_pos 250976 Stop

DELIMITER ;

# End of log file

ROLLBACK /* added by mysqlbinlog */;

/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;

三是 Executed CHANGE MASTER

2.52.52.52.5 PhasePhasePhasePhase 5:5:5:5: NewNewNewNew mastermastermastermaster cleanupcleanupcleanupcleanup phase..phase..phase..phase..

Resetting slave info on the new master