通知をログに記録するが、syslogに重大なエラーは記録しないOpensolaris / OmniiOSサービス/デーモンを自動的に再起動するにはどうすればよいですか？

Question

次の問題は、CIFS/AD関連の問題（特定のビュー）またはサービスの再起動、エラー処理、およびログの解析に関する質問（一般的なビュー）と見なすことができます。ここでは両方の領域を紹介しますが、いずれかについて回答をいただければ幸いです（興味のない部分はスキップしてください）。

特定の状況：idmapがドメインコントローラーを定期的に再スキャンしない

Windows Server2008互換のActiveDirectoryには、通常、高可用性のために複数のドメインコントローラーがあります。これらのサーバーがすべて同時に使用できず、アクティブなカーネルSMB/CIFSサーバー（ドメインに正常に参加し、期待どおりに機能する）を備えたOmniOS（r151018）ファイルサーバーが起動すると、次のようになります。

idmapサービスは60秒間DCに到達しようとし、その後あきらめます...

root@omnios:/root# tail -n 20 /var/svc/log/system-idmap:default.log @ Tue Sep 6 10:19:42 2016 Global Catalog servers not configured/discoverable Domain controller servers not configured/discoverable created thread ID 3 - 1 threads currently active [ Sep 6 10:19:42 Method "start" exited with status 0. ] @ Tue Sep 6 10:19:53 2016 created thread ID 4 - 2 threads currently active getdcname wait begin @ Tue Sep 6 10:19:57 2016 DNS: _ldap._tcp.dc._msdcs.testdomain.intranet: Host name lookup failure @ Tue Sep 6 10:20:08 2016 getdcname timeout @ Tue Sep 6 10:20:12 2016 DNS: _ldap._tcp.dc._msdcs.testdomain.intranet: Host name lookup failure @ Tue Sep 6 10:20:27 2016 DNS: _ldap._tcp.dc._msdcs.testdomain.intranet: Host name lookup failure @ Tue Sep 6 10:20:42 2016 DNS: _ldap._tcp.dc._msdcs.testdomain.intranet: Host name lookup failure Domain discovery took 60 sec. Check the DNS configuration.

...しかし、重大な失敗はありません：

root@omnios:/root# svcs -xv idmap svc:/system/idmap:default (Native Identity Mapping Service) State: online since Tue Sep 6 10:19:42 2016 See: man -M /usr/share/man -s 1M idmapd See: man -M /usr/share/man -s 1M idmap See: /var/svc/log/system-idmap:default.log Impact: None.

その後、smbdは、毎分syslogでDCが見つからないと不平を言います。

smbd[525]: [ID 510351 daemon.notice] smb_locate_dc status 0xc0000233 smbd[525]: [ID 199031 daemon.notice] smbd_dc_update: testdomain.intranet: locate failed

これは、DCがオンラインに戻って到達可能になった後も持続します。これは、idmapをsvcadm restart idmapで再起動することで即座に回避されます。もちろん、これらの停止は計画なしに発生する可能性があるため、すべきではありません。手作業で行う必要があります。

これらのイベントで自動的に発生するようにidmap再起動をスクリプト化するにはどうすればよいですか？ SMFを使用しようとしましたが、これはクラッシュしたサービスに対してのみ機能するようですが、idmapは問題を報告しません（そしてsmbdのみが報告します）通知）。もう1つの可能性は、ログファイルを常に監視し、イベントをgrepすることですが、これは私には非効率的です。また、config/rediscovery_intervalの値を60秒に短縮しようとしましたが、無視されているようです（またはここでは適用されません）。
または、問題自体を取り除く解決策は何でしょうか？残念ながら、完全に再起動すると問題がどのように解決されるかを確認する投稿以外に、使用できるものは見つかりませんでした（idmapもそこで再起動されるため）。

編集： svccfg -s idmap listpropの出力-変更したのはconfig/rediscovery_interval（デフォルトは3600）のみで、IDは後で手動で削除されました。

config application config/id_cache_timeout count 86400 config/list_size_limit count 0 config/name_cache_timeout count 604800 config/preferred_dc astring config/stability astring Unstable config/use_ads boolean true config/use_lsa boolean true config/value_authorization astring solaris.smf.value.idmap config/machine_uuid astring [...] config/machine_sid astring [...] config/rediscovery_interval count 60 config/domain_name astring testdomain.intranet debug application debug/all integer 0 debug/config integer 0 debug/discovery integer 0 debug/dns integer 0 debug/ldap integer 0 debug/mapping integer 0 debug/stability astring Unstable debug/value_authorization astring solaris.smf.value.idmap rpcbind dependency rpcbind/entities fmri svc:/network/rpc/bind rpcbind/grouping astring require_all rpcbind/restart_on astring restart rpcbind/type astring service filesystem-minimal dependency filesystem-minimal/entities fmri svc:/system/filesystem/minimal filesystem-minimal/grouping astring require_all filesystem-minimal/restart_on astring error filesystem-minimal/type astring service manifestfiles framework manifestfiles/lib_svc_manifest_system_idmap_xml astring /lib/svc/manifest/system/idmap.xml general framework general/action_authorization astring solaris.smf.manage.idmap general/entity_stability astring Unstable general/single_instance boolean true general/value_authorization astring solaris.smf.manage.idmap start method start/exec astring /usr/lib/idmapd start/timeout_seconds count 60 start/type astring method stop method stop/exec astring :kill stop/timeout_seconds count 60 stop/type astring method refresh method refresh/exec astring ":kill -HUP" refresh/timeout_seconds count 60 refresh/type astring method tm_common_name template tm_common_name/C ustring "Native Identity Mapping Service" tm_man_idmapd1M template tm_man_idmapd1M/manpath astring /usr/share/man tm_man_idmapd1M/section astring 1M tm_man_idmapd1M/title astring idmapd tm_man_idmap1M template tm_man_idmap1M/manpath astring /usr/share/man tm_man_idmap1M/section astring 1M tm_man_idmap1M/title astring idmap

一般的な問題：プロセスが正常に実行されているように見える場合、syslogメッセージに効率的に対応するにはどうすればよいですか？

この問題は、Solarisで最も効率的な方法でログファイルを監視する方法の質問に一般化できます。 swatch、logsurfer、logwatcher、または毎分実行され、dmesg出力を読み取る単純なスクリプトに接続された単純な古いcronジョブなど、いくつかのツールを検索して見つけました。

これを行うための唯一の可能な方法ですか、それともより良い解決策がありますか？
- 重大な状態が発生していなくても、一部のプロセスに関する特定の通知にアクションを実行する必要があることをSMFに通知する方法はありますか？
- フォールトマネージャーのFMAに出くわしましたが、それは重大な状況でのみ機能し、単なる通知（またはユーザーが指定できる文字列）では機能しないようです。これは正しいです？
それが唯一の方法である場合、何を使用することを提案しますか、そしてその理由は何ですか？

Jim Klimov · Answer

質問でこの問題を詳しく説明していただきありがとうございます。私も最近ヒットしましたが、MSADコントローラーは（post-）OpenSolarisホストで実行されているVMであり、Win/ADサポートがそのサイトで減少するにつれて、これは存続している唯一のレプリカです。したがって、起動の初期段階（およびidmapがマルチユーザーサーバーにつながる依存関係ツリーの一部である場合）では、VMはまだ実行されておらず、idmapは接続に失敗します投稿したとおりで、smbdが文句を言います。ADに接続する際に問題が発生したためにサーバーを完全に起動できなかった場合よりも優れていると思います。

あなたの質問への直接の答えとして、他のロギングデーモンがあると思います:)たとえば、ログに記録されたメッセージが設定したパターンに一致する場合、rsyslogはコマンドをトリガーできます。

SMF svcログが引用sysloggedされていないため、特定のログをチェックしてサービスを再起動するには、この状況に対してアドホックスクリプトを作成する必要があると思います。

また、遅延アクション（たとえば、echo "svcadm restart idmap" | at now + 10minになるinit-script）を実行して、起動後に常にこれをキックすることもできます。これは、ここで行うことだと思います。

最後に、VM起動スクリプトにアクティビティを追加して、VMが起動した後にidmapをキックします（したがって、純粋にハードコードされたタイミングに依存しません）。 idmapは起動の非常に早い段階で必要になるため、SMF依存関係ツリーの一部としてではなく、少なくとも厳密な依存関係としてではなく、restart_onのようなものです。