OvirtとGlusterFSの部分的な障害からどのように回復すべきですか？

Question

ホストされているエンジンアプライアンスで3ノードのOvirt 4.3.7クラスターを管理しています。ノードはglusterfsノードでもあります。システムは次のとおりです。

ovirt1（192.168.40.193のノード）
ovirt2（192.168.40.194のノード）
ovirt3（192.168.40.195のノード）
ovirt-engine（192.168.40.196のエンジン）

サービスovirt-ha-agentとovirt-ha-brokerはovirt1とovirt3で継続的に再起動しており、これは正常ではないように見えます（この問題について最初に気付いたのは、これらのシステムで満杯のこれらのサービスのログでした）。

GUIコンソールからのすべての兆候は、overt-engineがovirt3で実行されていることです。私は明白なエンジンをovirt2に移行しようとしましたが、それ以上の説明なしで失敗しました。

ユーザーは、3つのノードすべてで問題なくVMを作成、起動、停止できます。

各ノードでgluster-eventaapi statusおよびhosted-engine --vm-statusから次の出力が表示されます。

ovirt1：

[root@ovirt1 ~]# gluster-eventsapi status Webhooks: http://ovirt-engine.low.mdds.tcs-sec.com:80/ovirt-engine/services/glusterevents +---------------+-------------+-----------------------+ | NODE | NODE STATUS | GLUSTEREVENTSD STATUS | +---------------+-------------+-----------------------+ | 192.168.5.194 | UP | OK | | 192.168.5.195 | UP | OK | | localhost | UP | OK | +---------------+-------------+-----------------------+ [root@ovirt1 ~]# hosted-engine --vm-status The hosted engine configuration has not been retrieved from shared storage. Please ensure that ovirt-ha-agent is running and the storage server is reachable.

ovirt2：

[root@ovirt2 ~]# gluster-eventsapi status Webhooks: http://ovirt-engine.low.mdds.tcs-sec.com:80/ovirt-engine/services/glusterevents +---------------+-------------+-----------------------+ | NODE | NODE STATUS | GLUSTEREVENTSD STATUS | +---------------+-------------+-----------------------+ | 192.168.5.195 | UP | OK | | 192.168.5.193 | UP | OK | | localhost | UP | OK | +---------------+-------------+-----------------------+ [root@ovirt2 ~]# hosted-engine --vm-status --== Host ovirt2.low.mdds.tcs-sec.com (id: 1) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : ovirt2.low.mdds.tcs-sec.com Host ID : 1 Engine status : {"reason": "vm not running on this Host", "health": "bad", "vm": "down_unexpected", "detail": "unknown"} Score : 0 stopped : False Local maintenance : False crc32 : e564d06b local_conf_timestamp : 9753700 Host timestamp : 9753700 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=9753700 (Wed Mar 25 17:45:50 2020) Host-id=1 score=0 vm_conf_refresh_time=9753700 (Wed Mar 25 17:45:50 2020) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False timeout=Thu Apr 23 21:29:10 1970 --== Host ovirt3.low.mdds.tcs-sec.com (id: 3) status ==-- conf_on_shared_storage : True Status up-to-date : False Hostname : ovirt3.low.mdds.tcs-sec.com Host ID : 3 Engine status : unknown stale-data Score : 3400 stopped : False Local maintenance : False crc32 : 620c8566 local_conf_timestamp : 1208310 Host timestamp : 1208310 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=1208310 (Mon Dec 16 21:14:24 2019) Host-id=3 score=3400 vm_conf_refresh_time=1208310 (Mon Dec 16 21:14:24 2019) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False

ovirt3：

[root@ovirt3 ~]# gluster-eventsapi status Webhooks: http://ovirt-engine.low.mdds.tcs-sec.com:80/ovirt-engine/services/glusterevents +---------------+-------------+-----------------------+ | NODE | NODE STATUS | GLUSTEREVENTSD STATUS | +---------------+-------------+-----------------------+ | 192.168.5.193 | DOWN | NOT OK: N/A | | 192.168.5.194 | UP | OK | | localhost | UP | OK | +---------------+-------------+-----------------------+ [root@ovirt3 ~]# hosted-engine --vm-status The hosted engine configuration has not been retrieved from shared storage. Please ensure that ovirt-ha-agent is running and the storage server is reachable.

これまでに行った手順は次のとおりです。

ovirt-ha-agentおよびovirt-ha-brokerサービスのログがノードovirt1およびovirt3で正しくローテーションしていないことを確認します。ログは両方のノードで同じ障害を示しています。 broker.logには、このステートメントが頻繁に繰り返されています。

MainThread::WARNING::2020-03-25 18:03:28,846::storage_broker::97::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) Can't connect vdsm storage: [Errno 5] Input/output error: '/rhev/data-center/mnt/glusterSD/ovirt2:_engine/182a4a94-743f-4941-89c1-dc2008ae1cf5/ha_agent/hosted-engine.lockspace'

rHEVドキュメントが問題を理解するためにhosted-engine --vm-statusを実行することを提案していることを確認してください。その出力（上記）は、ovirt1が完全にクラスターの一部ではないことを示唆しています。
私は昨日のOvirtフォーラムで質問しましたが、私はそこに新しいので、モデレーターのレビューが必要です。それはまだ起こりません（このクラスターのユーザー全員が突然自宅から作業しておらず、突然依存している場合）それは、私が数日待つことについて心配しないでしょう）。

この状況からどのように回復すればよいですか？（最初にglusterfsクラスターで何かを回復する必要があると思いますが、ヒントが見つからないか、正しいクエリを形成するための言語がありません。）

更新：ovirt3でglusterdを再起動した後、glusterfsクラスターは正常に見えますが、ovirtサービスの動作に変化はありません。

Randall · Answer

上記の状況から回復するために必要な手順は、ovirt3で以下を実行することになります。

hosted-engine --vm-shutdown hosted-engine --reinitialize-lockspace hosted-engine --vm-start

これにより、ovirt-engineがovirt2で起動しました。その後、ovirt3でサービスovirt-ha-broker.serviceとovirt-ha-agent.serviceを再起動しました。