web-dev-qa-db-ja.com

ノードがgaleraクラスターに参加できません。状態転送中に失敗する

ノードがクラスターに参加しているように見えて失敗します... rsyncとxtrabackupの両方を試してみましたが、状態の転送中に失敗しました。本当にシンプルなものが足りないので指をかけられないようです。どうぞよろしくお願いいたします。

ノードに関する詳細情報

マスター-10.XXX.XXX.161 node1-10.XXX.XXX.160

インストールされたパッケージ:MariaDB-compat MariaDB-common MariaDB-devel MariaDB-shared MariaDB-client MariaDB-test MariaDB-Galera-server(v5.5.29-1)galera(v23.2.4-1.rhel6)percona-xtrabackup(v2.1.6 -702.rhel6)

ノード1の構成

[mysqld]
wsrep_cluster_address = gcomm://10.XXX.XXX.161
wsrep_provider = /usr/lib64/galera/libgalera_smm.so
wsrep_provider_options = gcache.size=4G; gcache.page_size=1G
wsrep_cluster_name = galera_cluster
default_storage_engine = InnoDB
innodb_autoinc_lock_mode = 2
innodb_locks_unsafe_for_binlog = 1
wsrep_sst_method = xtrabackup
wsrep_sst_auth = root:rootpassword
wsrep_node_name=1

マスターの設定

[mysqld]
wsrep_cluster_address = gcomm://
wsrep_provider = /usr/lib64/galera/libgalera_smm.so
wsrep_provider_options = gcache.size=4G; gcache.page_size=1G
wsrep_cluster_name = galera_cluster
default_storage_engine = InnoDB
innodb_autoinc_lock_mode = 2
innodb_locks_unsafe_for_binlog = 1
wsrep_sst_method = rsync
wsrep_slave_threads = 4
wsrep_sst_auth = root:rootpassword
wsrep_node_name = 2

node1ログファイル

131203 16:09:03 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
131203 16:09:03 mysqld_safe WSREP: Running position recovery with --log_error=/tmp/tmp.f2EedjRjbQ
131203 16:09:08 mysqld_safe WSREP: Recovered position 359350ee-5c63-11e3-0800-6673d15135cd:2188
131203 16:09:08 [Note] WSREP: wsrep_start_position var submitted: '359350ee-5c63-11e3-0800-6673d15135cd:2188'
131203 16:09:08 [Note] WSREP: Read nil XID from storage engines, skipping position init
131203 16:09:08 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
131203 16:09:08 [Note] WSREP: wsrep_load(): Galera 23.2.4(r147) by Codership Oy <[email protected]]]> loaded succesfully.
131203 16:09:08 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1
131203 16:09:08 [Note] WSREP: Reusing existing '/var/lib/mysql//galera.cache'.
131203 16:09:08 [Note] WSREP: Passing config to GCS: base_Host = 10.XXX.XXX.160; base_port = 4567; cert.log_conflicts = no; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 1G; gcache.size = 4G; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3
131203 16:09:08 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
131203 16:09:08 [Note] WSREP: wsrep_sst_grab()
131203 16:09:08 [Note] WSREP: Start replication
131203 16:09:08 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
131203 16:09:08 [Note] WSREP: protonet asio version 0
131203 16:09:08 [Note] WSREP: backend: asio
131203 16:09:08 [Note] WSREP: GMCast version 0
131203 16:09:08 [Note] WSREP: (8814b4ba-5c67-11e3-0800-91035d554a96, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
131203 16:09:08 [Note] WSREP: (8814b4ba-5c67-11e3-0800-91035d554a96, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
131203 16:09:08 [Note] WSREP: EVS version 0
131203 16:09:08 [Note] WSREP: PC version 0
131203 16:09:08 [Note] WSREP: gcomm: connecting to group 'galera_cluster', peer '10.XXX.XXX.161:'
131203 16:09:09 [Note] WSREP: declaring 7a9a87e8-5c67-11e3-0800-8cb6cba8f65a stable
131203 16:09:09 [Note] WSREP: Node 7a9a87e8-5c67-11e3-0800-8cb6cba8f65a state prim
131203 16:09:09 [Note] WSREP: view(view_id(PRIM,7a9a87e8-5c67-11e3-0800-8cb6cba8f65a,2) memb {
     7a9a87e8-5c67-11e3-0800-8cb6cba8f65a,
     8814b4ba-5c67-11e3-0800-91035d554a96,
} joined {
} left {
} partitioned {
})
131203 16:09:09 [Note] WSREP: gcomm: connected
131203 16:09:09 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
131203 16:09:09 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
131203 16:09:09 [Note] WSREP: Opened channel 'galera_cluster'
131203 16:09:09 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
131203 16:09:09 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
131203 16:09:09 [Note] WSREP: Waiting for SST to complete.
131203 16:09:09 [Note] WSREP: STATE EXCHANGE: sent state msg: 8861cdd5-5c67-11e3-0800-cc70fcc5f515
131203 16:09:09 [Note] WSREP: STATE EXCHANGE: got state msg: 8861cdd5-5c67-11e3-0800-cc70fcc5f515 from 0 (2)
131203 16:09:09 [Note] WSREP: STATE EXCHANGE: got state msg: 8861cdd5-5c67-11e3-0800-cc70fcc5f515 from 1 (1)
131203 16:09:09 [Note] WSREP: Quorum results:
     version    = 2,
     component  = PRIMARY,
     conf_id    = 1,
     members    = 1/2 (joined/total),
     act_id     = 2521,
     last_appl. = -1,
     protocols  = 0/4/2 (gcs/repl/appl),
     group UUID = 359350ee-5c63-11e3-0800-6673d15135cd
131203 16:09:09 [Note] WSREP: Flow-control interval: [23, 23]
131203 16:09:09 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 2521)
131203 16:09:09 [Note] WSREP: State transfer required:
     Group state: 359350ee-5c63-11e3-0800-6673d15135cd:2521
     Local state: 00000000-0000-0000-0000-000000000000:-1
131203 16:09:09 [Note] WSREP: New cluster view: global state: 359350ee-5c63-11e3-0800-6673d15135cd:2521, view# 2: Primary, number of nodes: 2, my index: 1, protocol version 2
131203 16:09:09 [Warning] WSREP: Gap in state sequence. Need state transfer.
131203 16:09:11 [Note] WSREP: Running: 'wsrep_sst_xtrabackup --role 'joiner' --address '10.XXX.XXX.160' --auth 'root:rootpassword' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --parent '13175''
131203 16:09:11 [Note] WSREP: Prepared SST request: xtrabackup|10.162.143.160:4444/xtrabackup_sst
131203 16:09:11 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
131203 16:09:11 [Note] WSREP: Assign initial position for certification: 2521, protocol version: 2
131203 16:09:11 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (359350ee-5c63-11e3-0800-6673d15135cd): 1 (Operation not permitted)
      at galera/src/replicator_str.cpp:prepare_for_IST():442. IST will be unavailable.
131203 16:09:11 [Note] WSREP: Node 1 (1) requested state transfer from '*any*'. Selected 0 (2)(SYNCED) as donor.
131203 16:09:11 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 2525)
131203 16:09:11 [Note] WSREP: Requesting state transfer: success, donor: 0
tar: dbexport/db.opt: Cannot open: Permission denied
tar: Exiting with failure status due to previous errors
131203 16:10:22 [Note] WSREP: 0 (2): State transfer to 1 (1) complete.
131203 16:10:22 [Note] WSREP: Member 0 (2) synced with group.
WSREP_SST: [ERROR] Error while getting st data from donor node:  0, 2 (20131203 16:10:22.379)
131203 16:10:22 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup --role 'joiner' --address '10.XXX.XXX.160' --auth 'root:rootpassword' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --parent '13175': 32 (Broken pipe)
131203 16:10:22 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
131203 16:10:22 [ERROR] WSREP: SST failed: 32 (Broken pipe)
131203 16:10:22 [ERROR] Aborting
131203 16:10:24 [Note] WSREP: Closing send monitor...
131203 16:10:24 [Note] WSREP: Closed send monitor.
131203 16:10:24 [Note] WSREP: gcomm: terminating thread
131203 16:10:24 [Note] WSREP: gcomm: joining thread
131203 16:10:24 [Note] WSREP: gcomm: closing backend
131203 16:10:25 [Note] WSREP: view(view_id(NON_PRIM,7a9a87e8-5c67-11e3-0800-8cb6cba8f65a,2) memb {
     8814b4ba-5c67-11e3-0800-91035d554a96,
} joined {
} left {
} partitioned {
     7a9a87e8-5c67-11e3-0800-8cb6cba8f65a,
})
131203 16:10:25 [Note] WSREP: view((empty))
131203 16:10:25 [Note] WSREP: gcomm: closed
131203 16:10:25 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
131203 16:10:25 [Note] WSREP: Flow-control interval: [16, 16]
131203 16:10:25 [Note] WSREP: Received NON-PRIMARY.
131203 16:10:25 [Note] WSREP: Shifting JOINER -> OPEN (TO: 2607)
131203 16:10:25 [Note] WSREP: Received self-leave message.
131203 16:10:25 [Note] WSREP: Flow-control interval: [0, 0]
131203 16:10:25 [Note] WSREP: Received SELF-LEAVE. Closing connection.
131203 16:10:25 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 2607)
131203 16:10:25 [Note] WSREP: RECV thread exiting 0: Success
131203 16:10:25 [Note] WSREP: recv_thread() joined.
131203 16:10:25 [Note] WSREP: Closing slave action queue.
131203 16:10:25 [Note] WSREP: Service disconnected.
131203 16:10:25 [Note] WSREP: rollbacker thread exiting
131203 16:10:26 [Note] WSREP: Some threads may fail to exit.
131203 16:10:26 [Note] /usr/sbin/mysqld: Shutdown complete
Error in my_thread_global_end(): 2 threads didn't exit
131203 16:10:31 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended

マスターログファイル

131203 16:08:47 [Warning] Recovery from master pos 103358630 and file mysql-bin.000131.
131203 16:08:47 [Note] Event Scheduler: Loaded 0 events
131203 16:08:47 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
131203 16:08:47 [Note] WSREP: Assign initial position for certification: 2497, protocol version: 2
131203 16:08:47 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.29-MariaDB-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MariaDB Server, wsrep_23.7.3.rXXXX
131203 16:08:47 [Note] WSREP: Synchronized with group, ready for connections
131203 16:08:47 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
131203 16:09:09 [Note] WSREP: declaring 8814b4ba-5c67-11e3-0800-91035d554a96 stable
131203 16:09:09 [Note] WSREP: Node 7a9a87e8-5c67-11e3-0800-8cb6cba8f65a state prim
131203 16:09:09 [Note] WSREP: view(view_id(PRIM,7a9a87e8-5c67-11e3-0800-8cb6cba8f65a,2) memb {
     7a9a87e8-5c67-11e3-0800-8cb6cba8f65a,
     8814b4ba-5c67-11e3-0800-91035d554a96,
} joined {
} left {
} partitioned {
})
131203 16:09:09 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
131203 16:09:09 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 8861cdd5-5c67-11e3-0800-cc70fcc5f515
131203 16:09:09 [Note] WSREP: STATE EXCHANGE: sent state msg: 8861cdd5-5c67-11e3-0800-cc70fcc5f515
131203 16:09:09 [Note] WSREP: STATE EXCHANGE: got state msg: 8861cdd5-5c67-11e3-0800-cc70fcc5f515 from 0 (2)
131203 16:09:09 [Note] WSREP: STATE EXCHANGE: got state msg: 8861cdd5-5c67-11e3-0800-cc70fcc5f515 from 1 (1)
131203 16:09:09 [Note] WSREP: Quorum results:
     version    = 2,
     component  = PRIMARY,
     conf_id    = 1,
     members    = 1/2 (joined/total),
     act_id     = 2521,
     last_appl. = 2517,
     protocols  = 0/4/2 (gcs/repl/appl),
     group UUID = 359350ee-5c63-11e3-0800-6673d15135cd
131203 16:09:09 [Note] WSREP: Flow-control interval: [23, 23]
131203 16:09:09 [Note] WSREP: New cluster view: global state: 359350ee-5c63-11e3-0800-6673d15135cd:2521, view# 2: Primary, number of nodes: 2, my index: 0, protocol version 2
131203 16:09:09 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
131203 16:09:09 [Note] WSREP: Assign initial position for certification: 2521, protocol version: 2
131203 16:09:11 [Note] WSREP: Node 1 (1) requested state transfer from '*any*'. Selected 0 (2)(SYNCED) as donor.
131203 16:09:11 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 2525)
131203 16:09:11 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
131203 16:09:11 [Note] WSREP: Running: 'wsrep_sst_xtrabackup --role 'donor' --address '10.XXX.XXX.160:4444/xtrabackup_sst' --auth 'root:rootpassword' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --gtid '359350ee-5c63-11e3-0800-6673d15135cd:2525''
131203 16:09:11 [Note] WSREP: sst_donor_thread signaled with 0
131203 16:10:20 [Note] WSREP: Provider paused at 359350ee-5c63-11e3-0800-6673d15135cd:2604
131203 16:10:22 [Note] WSREP: Provider resumed.
131203 16:10:22 [Note] WSREP: 0 (2): State transfer to 1 (1) complete.
131203 16:10:22 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 2606)
131203 16:10:22 [Note] WSREP: Member 0 (2) synced with group.
131203 16:10:22 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 2606)
131203 16:10:22 [Note] WSREP: Synchronized with group, ready for connections
131203 16:10:22 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
131203 16:10:25 [Note] WSREP: Node 7a9a87e8-5c67-11e3-0800-8cb6cba8f65a state prim
131203 16:10:25 [Note] WSREP: view(view_id(PRIM,7a9a87e8-5c67-11e3-0800-8cb6cba8f65a,3) memb {
     7a9a87e8-5c67-11e3-0800-8cb6cba8f65a,
} joined {
} left {
} partitioned {
     8814b4ba-5c67-11e3-0800-91035d554a96,
})
131203 16:10:25 [Note] WSREP: forgetting 8814b4ba-5c67-11e3-0800-91035d554a96 (tcp://10.XXX.XXX.160:4567)
131203 16:10:25 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1
131203 16:10:25 [Note] WSREP: STATE_EXCHANGE: sent state UUID: b5dda52e-5c67-11e3-0800-4b2360dd84f9
131203 16:10:25 [Note] WSREP: STATE EXCHANGE: sent state msg: b5dda52e-5c67-11e3-0800-4b2360dd84f9
131203 16:10:25 [Note] WSREP: STATE EXCHANGE: got state msg: b5dda52e-5c67-11e3-0800-4b2360dd84f9 from 0 (2)
131203 16:10:25 [Note] WSREP: Quorum results:
     version    = 2,
     component  = PRIMARY,
     conf_id    = 2,
     members    = 1/1 (joined/total),
     act_id     = 2607,
     last_appl. = 2597,
     protocols  = 0/4/2 (gcs/repl/appl),
     group UUID = 359350ee-5c63-11e3-0800-6673d15135cd
131203 16:10:25 [Note] WSREP: Flow-control interval: [16, 16]
131203 16:10:25 [Note] WSREP: New cluster view: global state: 359350ee-5c63-11e3-0800-6673d15135cd:2607, view# 3: Primary, number of nodes: 1, my index: 0, protocol version 2
131203 16:10:25 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
131203 16:10:25 [Note] WSREP: Assign initial position for certification: 2607, protocol version: 2
131203 16:10:30 [Note] WSREP:  cleaning up 8814b4ba-5c67-11e3-0800-91035d554a96 (tcp://10..XXX.XXX.160:4567)
2
nimesh

ここでも同様の問題がありました。ノード2と3はNode 1でパーティション分割され、それらを起動しようとしたときにSSTが失敗しました(大量のDB更新があったため、300M gcacheはすでにロールオーバーされていましたおよびISTは不可能でした)。

解決策はノード1を再起動することで、これによりWSREPノードがクリアされました。次に、(非常に長い)SSTを実行して問題なく起動したノード2を起動します。次に、同様の方法でノード3を起動し、3つすべてを起動してメンバーにします。

ただし、このソリューションは、メインノードが再起動されると、サービスが短時間(1分未満)停止することを意味し、許容できない場合があります。

再起動せずに、残っているノードでダウンしたノードの「パーティション化された」ステータスを一掃する方法をまだ見つけていません。

あなたと同じように、WSREPにはPercona Xtrabackupを使用しています。私は問題がデータベース接続を必要とするこれにあると信じています。 rsyncを使用している場合、SSTは期待どおりに完了することがあります。

1
Steve Shipway

Galeraクラスターはクォーラムベースのシステムであり、クォーラムアルゴリズムがプライマリコンポーネントの選択に失敗すると、Galeraクラスターはスプリットブレイン条件の影響を受けます。これは、たとえば、メインスイッチに障害が発生した場合に、バックアップスイッチのないクラスタで発生する可能性があります。ただし、最も可能性の高いスプリットブレイン状況は、2ノードクラスタで単一ノードに障害が発生した場合です。したがって、Galeraクラスターの最小構成は3ノードであることを強くお勧めします。以下の状態転送セクションでは、3が最小推奨ノード数である理由をもう1つ検討します。

だから、あなたが持っていることを確認してください

wsrep_cluster_address = gcomm://192.168.1.1,192.168.1.2,192.168.1.3 #where node1 = 192.168.1.1およびnode2 = 192.168.1.2など.

すべてのノードで、この変更後にすべてのノードを1つずつ再起動します

典型的なmy.cnf Galeraクラスターの場合、次のようになります。 http://www.codership.com/wiki/doku.php?id=mysql_galera_configuration

この情報がお役に立てば幸いです

1
Mahesh Patil