管理者は、イベントが発生しない場合のアラートをどのように一般化しますか？

Question

多くの場合、ユーザーは、イベントが発生していないかどうかを知る責任があることを私に要求します。

私は常に、cron化されたシェルスクリプトと多くの日付エッジケーステストを使用して、カスタムで脆弱なソリューションを構築する必要がありました。

一元化されたロギングにより、過去N時間に何が起こったかを把握するためのより優れた保守可能な方法が可能になるはずですしなかった。 logstashの通知やnagiosの警告のようなもの。

更新

toppledwagonの答えはとても役に立ちました。 o O（Light。Bulb。）は、鮮度チェック中のバッチジョブが12個あります。私は彼の徹底的な答えの正義を行い、私が彼のアイデアをどのように実行したかをフォローアップしたかった。

私はjenkinsをsyslogを発行するように構成し、logstashがそれらをキャッチして、nscaを介してnagiosにステータス更新を送信します。また、check_mkを使用して、すべてをDRY）保持し、nagiosで整理します。

Logstashフィルター

:::Ruby filter { if [type] == "syslog" { grok { match => [ "message", '%{SYSLOGBASE} job="%{DATA:job}"(?: repo="%{DATA:repo}")?$', "message", "%{SYSLOGLINE}" ] break_on_match => true } date { match => [ "timestamp", "MMM d HH:mm:ss", "MMM dd HH:mm:ss" ] } } }

魔法は、break_on_match => trueとともにgrokのmatchパラメーターのパターンの2つのペアにあります。 Logstashは、パターンの1つが一致するまで、各パターンを順番に試行します。

Logstash出力

Logstash nagios_nsca出力プラグインを使用して、syslogでjenkinsジョブを見たことをicingaに通知します。

:::Ruby output { if [type] == "syslog" and [program] == "jenkins" and [job] == "Install on Cluster" and "_grokparsefailure" not in [tags] { nagios_nsca { Host => "icinga.example.com" port => 5667 send_nsca_config => "/etc/send_nsca.cfg" message_format => "%{job} %{repo}" nagios_Host => "jenkins" nagios_service => "deployed %{repo}" nagios_status => "2" } } # if type=syslog, program=jenkins, job="Install on Cluster" } # output

icinga（nagios）

最後に、nsca経由でicinga（nagios）に到着しました。ここで、気づきたいすべてのジョブが時間どおりに行われなかったことに対して、パッシブサービスチェックを定義する必要があります。それは多くの仕事になる可能性があるので、check_mkを使用してpython仕事のリストをnagiosオブジェクト定義に変換しましょう。

check_mkはそのようにかっこいいです。

/etc/check_mk/conf.d/freshness.mk

# check_mk requires local variables be prefixed with '_' _dailies = [ 'newyork' ] _day_stale = 86400 * 1.5 _weeklies = [ 'atlanta', 'denver', ] _week_stale = 86400 * 8 _monthlies = [ 'stlouis' ] _month_stale = 86400 * 32 _service_opts = [ ("active_checks_enabled", "0"), ("passive_checks_enabled", "1"), ("check_freshness", "1"), ("notification_period", "workhours"), ("contacts", "root"), ("check_period", "workhours"), ] # Define a new command 'check-periodically' that sets the service to UKNOWN. # This is called after _week_stale seconds have passed since the service last checked in. extra_nagios_conf += """ define command { command_name check-periodicaly command_line $USER1$/check_dummy 3 $ARG1$ } """ # Loop through all passive checks and assign the new check-period command to them. for _repo in _dailies + _weeklies + _monthlies: _service_name = 'deployed %s' % _repo legacy_checks += [(('check-periodicaly', _service_name, False), ['lead'])] # Look before you leap - python needs the list defined before appending to it. # We can't assume it already exists because it may be defined earlier. if "freshness_threshold" not in extra_service_conf: extra_service_conf["freshness_threshold"] = [] # Some check_mk wizardry to set when the check has passed its expiration date. # Results in (659200, ALL_HOSTS, [ 'atlanta', 'denver' ]) for weeklies, etc. extra_service_conf["freshness_threshold"] += [ (_day_stale, ALL_HOSTS, ["deployed %s" % _x for _x in _dailies] ), (_week_stale, ALL_HOSTS, ["deployed %s" % _x for _x in _weeklies] ), (_month_stale, ALL_HOSTS, ["deployed %s" % _x for _x in _monthlies] ), ] # Now we assign all the other nagios directives listed in _service_opts for _k,_v in _service_opts: if _k not in extra_service_conf: extra_service_conf[_k] = [] extra_service_conf[_k] += [(_v, ALL_HOSTS, ["deployed "]) ]

toppledwagon · Accepted Answer

さまざまなイベントのnagiosでパッシブチェックを設定します。次に、イベントの終了時に、パッシブチェックがnagiosに送信されます（ラッパースクリプトを介して、またはイベント自体に組み込まれます）。パッシブチェックがfreshness_threshold秒以内に受信されない場合、ローカルでcheck_commandが実行されます。 check_commandは、重要なサービスの説明の情報を返す単純なシェルスクリプトとして設定されます。

便利なコード例はありませんが、興味があればできれば。

追加されたコード例を1つ編集します。

これは、NSCAとsend_nscaの基本設定が完了していることを前提としています（パスワードとencryption_methodがクライアントのsend_nsca.cfgとnagiosサーバーのnsca.cfgで同じであることを確認してください。次にnagiosサーバーでnscaデーモンを起動します）。

まず、他のパッシブチェックで使用できるテンプレートを定義します。これはservices.cfgに入ります。

define service { name standard-passive-service-template active_checks_enabled 0 passive_checks_enabled 1 check_freshness 1 max_check_attempts 1 normal_check_interval 10 retry_check_interval 5 contact_groups sysadmins notification_interval 0 notification_options w,u,c,r notification_period 24x7 check_period 24x7 check_command check_failed!$SERVICEDESC$ register 0 }

これは、通知が届かない場合は、引数として$ SERVICEDESC $を指定してcheck_failedを実行することを意味します。 commands.cfgでcheck_failedコマンドを定義しましょう。

define command { command_name check_failed command_line /usr/lib/nagios/plugins/check_failed $ARG1$ }

これが/usr/lib/nagios/plugins/check_failed 脚本。

#!/bin/bash /bin/echo "No update from $*. Is NSCA running?" exit 2

2の出口があると、nagiosによるとこのサービスは重要になります（すべてのnagiosサービス状態については以下を参照してください）。ソーシング/usr/lib/nagios/plugins/utils.shは別の方法であり、exit $STATE_CRITICAL。しかし、あなたがそれを持っていなくても、上記は機能します。

これにより、サービスが適切にチェックインされなかった可能性があるため、「NSCAは実行されていますか」という通知が追加されますOR NSCAが失敗した可能性があります。これは、より一般的です。複数のパッシブチェックが同時に行われる場合は、NSCAの問題をチェックしてください。

ここで、結果を受け入れるためのパッシブチェックが必要です。この例では、環境内のさまざまなタイプのRAIDコントローラーのすべてを認識している特別に細工されたcronジョブがあります。実行すると、このパッシブチェックに通知を送信します。この例では、深夜に起こされたくありません（必要に応じてnotification_periodを編集してください）。

define service { use standard-passive-service-template hostgroup_name all service_description raidcheck notification_period daytime flap_detection_enabled 1 freshness_threshold 7500 # 125 minutes notification_options c is_volatile 0 servicegroups raidcheck }

これで、nagiosサーバーに情報を送り返すcronjobがあります。これが/etc/cron.d/raidcheckの行です

0 * * * * root /usr/local/bin/raidcheck --cron | /usr/sbin/send_nsca -H nagios -to 1000 >> /dev/null 2>&1

見る man send_nscaオプションの場合ですが、重要な部分は「nagios」です。これは、私のnagiosサーバーの名前であり、このスクリプトの最後に出力される文字列です。 send_nsca形式のstdinに行が必要です（ここではPerl）

print "$hostname	$check	$state	$status_info
";

$ hostnameは明らかで、この場合の$ checkは 'raidcheck'、$ stateはnagiosサービス状態（0 = OK、1 =警告、2 =クリティカル、3 =不明、4 =依存）、$ status_infoはオプションです。ステータス情報として送信するメッセージ。

これで、クライアントのコマンドラインでチェックをテストできます。

echo -e "$HOSTNAME	raidcheck	2	Uh oh, raid degraded (just kidding..)" | /usr/sbin/send_nsca -H nagios

これにより、freshness_threshold秒ごとに更新されることを期待するnagiosパッシブチェックが提供されます。チェックが更新されていない場合は、check_command（この場合はcheck_failed）が実行されます。上記の例はnagios2.Xのインストール用ですが、nagios 3.Xでは（おそらくマイナーな変更を加えて）機能する可能性があります。

b0ti · Answer

「イベントが発生しない」と言っているタイプが異なる場合は、条件付きまたは無条件のいずれかになります。例：

ユーザー認証の失敗に続いてログインが成功しない場合は、ユーザーがパスワードを忘れた（またはブルートフォース攻撃）ことを示します。
日中はユーザー認証がありません-ユーザーは仕事に出ませんでした

最初のケースの後、オープンソースツールが必要な場合は、SECに Pairwithwindow ルールがあり、nxlogに Absence ルールがあります（私は提携していることに注意してください）後者）。

2番目のタイプはより単純で、どちらのツールもそれをあまりにも簡単に処理できます。