ペースメーカーの故障-タイムアウトは失敗カウントをリセットしません

Question

Centos7でPacemaker1.1.13とCorosync2.3.4を使用しています。

マスター/スレーブリソースに問題があります。私のリソースにはメタ属性があります：

移行しきい値= 1

失敗タイムアウト= 10秒

ただし、リソースがダウンした場合、リソースを開始する試みは1回だけです。ドキュメントによると、属性failure-timeout = 10sは10秒ごとにfailcountをリセットする必要がありますが、それは発生しないため、リソースは開始されません。

この問題について何か知っていますか？多分私は何か間違ったことをしているのですか？以下に「PCステータス」を送信しています。

Cluster Name: webcluster Corosync Nodes: 10.121.100.101 10.121.100.102 Pacemaker Nodes: pm-node1 pm-node2 Resources: Master: Services-master Meta Attrs: failure-timeout=10s Group: Services Meta Attrs: migration-threshold=1 Resource: Test (class=ocf provider=scooty type=test) Operations: start interval=0s timeout=20 (Test-start-interval-0s) stop interval=0s timeout=20 (Test-stop-interval-0s) monitor interval=10 role=Master timeout=20 (Test-monitor-interval-10) monitor interval=11 role=Slave timeout=20 (Test-monitor-interval-11) Stonith Devices: Fencing Levels: Location Constraints: Ordering Constraints: Colocation Constraints: Resources Defaults: migration-threshold: 1 failure-timeout: 10 Operations Defaults: No defaults set Cluster Properties: cluster-infrastructure: corosync cluster-name: webcluster dc-version: 1.1.13-10.el7_2.4-44eb2dd have-watchdog: false last-lrm-refresh: 1475145002 no-quorum-policy: ignore start-failure-is-fatal: false stonith-enabled: false

Matt Kereczman · Accepted Answer

障害の種類に応じて、failure-timeoutそれをクリーンアップするのに十分ではないかもしれません。開始および停止操作の失敗は「致命的」と見なされ、失敗タイムアウトによって自動的にクリーンアップされることはありません。

開始操作の失敗で問題が発生した場合は、クラスタープロパティを設定できますstart-failure-is-fatal=false。フェンシング/ STONITHデバイスは、停止障害から回復する唯一の方法です。

お役に立てば幸いです。