PostgreSQLはautovacuumワーカープロセスをフォークできませんでした：メモリを割り当てることができません

Question

Postgresがクラッシュしてリカバリモードに入ると、数日ごとにいくつかの問題が発生します。 postgresからのログは次のようになります

... Lots of this for 5-10minutes 2015-09-24 10:07:27 GMT LOG: could not fork autovacuum worker process: Cannot allocate memory 2015-09-24 10:07:28 GMT LOG: could not fork autovacuum worker process: Cannot allocate memory 2015-09-24 10:07:29 GMT LOG: could not fork autovacuum worker process: Cannot allocate memory 2015-09-24 10:07:30 GMT LOG: could not fork autovacuum worker process: Cannot allocate memory 2015-09-24 10:07:32 GMT LOG: server process (PID 16244) was terminated by signal 9: Killed 2015-09-24 10:07:32 GMT DETAIL: Failed process was running: SELECT 1 2015-09-24 10:07:32 GMT LOG: terminating any other active server processes 2015-09-24 10:07:32 GMT WARNING: terminating connection because of crash of another server process 2015-09-24 10:07:32 GMT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2015-09-24 10:07:32 GMT HINT: In a moment you should be able to reconnect to the database and repeat your command. 2015-09-24 10:07:32 GMT WARNING: terminating connection because of crash of another server process 2015-09-24 10:07:32 GMT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. .... for some time repeats this log: 2015-09-24 10:07:33 GMT HINT: In a moment you should be able to reconnect to the database and repeat your command. 2015-09-24 10:07:33 GMT WARNING: terminating connection because of crash of another server process 2015-09-24 10:07:33 GMT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2015-09-24 10:07:33 GMT HINT: In a moment you should be able to reconnect to the database and repeat your command. .... then 2015-09-24 10:07:33 GMT FATAL: the database system is in recovery mode 2015-09-24 10:07:33 GMT FATAL: the database system is in recovery mode 2015-09-24 10:07:33 GMT FATAL: the database system is in recovery mode 2015-09-24 10:07:33 GMT FATAL: the database system is in recovery mode 2015-09-24 10:07:33 GMT FATAL: the database system is in recovery mode 2015-09-24 10:07:33 GMT FATAL: the database system is in recovery mode 2015-09-24 10:07:33 GMT FATAL: the database system is in recovery mode

このデータベースは、2 GBのRAMデジタルオーシャンドロップレットで実行されています。そして私のpostgresql.confは次のようになります（使用するデフォルトが何であるかを確認する必要がある場合に備えて、設定をコメントアウトしておきます）

max_connections = 250 shared_buffers = 768mb temp_buffers = 8MB #work_mem = 1MB # min 64kB #maintenance_work_mem = 16MB # min 1MB #max_stack_depth = 2MB # min 100kB # - Cost-Based Vacuum Delay - #vacuum_cost_delay = 0ms # 0-100 milliseconds #vacuum_cost_page_hit = 1 # 0-10000 credits #vacuum_cost_page_miss = 10 # 0-10000 credits #vacuum_cost_page_dirty = 20 # 0-10000 credits #vacuum_cost_limit = 200 # 1-10000 credits # - Background Writer - #bgwriter_delay = 200ms # 10-10000ms between rounds #bgwriter_lru_maxpages = 100 # 0-1000 max buffers written/round #bgwriter_lru_multiplier = 2.0 # 0-10.0 multipler on buffers scanned/round # - Asynchronous Behavior - #effective_io_concurrency = 1 # 1-1000. 0 disables prefetching #------------------------------------------------------------------------------ # WRITE AHEAD LOG #------------------------------------------------------------------------------ # - Settings - wal_level = 'hot_standby' # minimal, archive, or hot_standby # (change requires restart) #fsync = on # turns forced synchronization on or off #synchronous_commit = on # synchronization level; on, off, or local #wal_sync_method = fsync # the default is the first option # supported by the operating system: # open_datasync # fdatasync (default on Linux) # fsync # fsync_writethrough # open_sync #full_page_writes = on # recover from partial page writes #wal_buffers = -1 # min 32kB, -1 sets based on shared_buffers # (change requires restart) #wal_writer_delay = 200ms # 1-10000 milliseconds #commit_delay = 0 # range 0-100000, in microseconds #commit_siblings = 5 # range 1-1000 # - Checkpoints - #checkpoint_segments = 3 # in logfile segments, min 1, 16MB each #checkpoint_timeout = 5min # range 30s-1h #checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0 #checkpoint_warning = 30s # 0 disables # - Archiving - archive_mode = on # allows archiving to be done # (change requires restart) archive_command = 'cd .' # command to use to archive a logfile segment #archive_timeout = 0 # force a logfile segment switch after this # number of seconds; 0 disables #------------------------------------------------------------------------------ # REPLICATION #------------------------------------------------------------------------------ # - Master Server - # These settings are ignored on a standby server max_wal_senders = 1 # max number of walsender processes # (change requires restart) #wal_sender_delay = 1s # walsender cycle time, 1-10000 milliseconds wal_keep_segments = 100 # in logfile segments, 16MB each; 0 disables #vacuum_defer_cleanup_age = 0 # number of xacts by which cleanup is delayed #replication_timeout = 60s # in milliseconds; 0 disables #synchronous_standby_names = '' # standby servers that provide sync rep # comma-separated list of application_name # from standby(s); '*' = all # - Standby Servers - # These settings are ignored on a master server hot_standby = on # "on" allows queries during recovery # (change requires restart) #max_standby_archive_delay = 30s # max delay before canceling queries # when reading WAL from archive; # -1 allows indefinite delay #max_standby_streaming_delay = 30s # max delay before canceling queries # when reading streaming WAL; # -1 allows indefinite delay #wal_receiver_status_interval = 10s # send replies at least this often # 0 disables #hot_standby_feedback = off # send info from standby to prevent # query conflicts

どんな助けも大歓迎です！

dezso · Accepted Answer

ハードウェアの問題...

このログエントリ：

別のサーバープロセスが異常終了し、共有メモリが破損している可能性があるため、ポストマスターがこのサーバープロセスに現在のトランザクションをロールバックして終了するように指示しました。

少なくとも2つの異なる根本的な問題の結果である可能性があります。最初のものは不良な実行可能ファイルまたは障害のあるハードウェアです-これが、データベースをより良い場所に移動することを提案した理由です（それが何であれ）。

あなたは現在、デジタルオーシャンドロップレットを使用しています。これは（先ほど確認したように）仮想プライベートサーバーです。これは、少なくとも私にとっては、それが別個のハードウェアであることを必ずしも意味しません。そうでない場合、他のユーザーも影響を受ける可能性があり、問題はプロバイダーによって迅速に処理されます。うまくいけば、そこでの構成は、外部システムがあなたのシステムに悪影響を与える可能性を排除します。

クラウドと共有ホスティングについてはこれだけです:)上記のコメントからわかるように、問題の根本原因は解決できるものです。

...またはメモリ処理の問題？

エラーの2番目の（そして私がより一般的だと思う）理由は、メモリの圧迫です。メモリが不足している場合（この場合も考えられます。以下の計算を参照してください）、オペレーティングシステムは、いくつかのプロセスを強制終了して、他のいくつかのプロセスにメモリを割り当てる場合があります。 OSでメモリのオーバーコミットが許可されている場合、その可能性はそれがない場合よりもはるかに高くなります。

PostgreSQLのドキュメントをご覧くださいこれについて言わなければなりません：

Linux 2.4以降では、デフォルトの仮想メモリの動作はPostgreSQLに最適ではありません。カーネルがメモリオーバーコミットを実装する方法のため、PostgreSQLまたは別のプロセスのメモリ要求によってシステムの仮想メモリが不足すると、カーネルはPostgreSQLポストマスター（マスターサーバープロセス）を終了する場合があります。

これが発生すると、次のようなカーネルメッセージが表示されます（そのようなメッセージを探す場所については、システムのドキュメントと構成を参照してください）。
Out of Memory: Killed process 12345 (postgres). 
これは、メモリの圧力が原因でpostgresプロセスが終了したことを示しています。既存のデータベース接続は引き続き正常に機能しますが、新しい接続は受け入れられません。回復するには、PostgreSQLを再起動する必要があります。

さらに下では、これを変更する方法について説明します。 OOMキラーを完全に禁止できないことは、興味深い重要なことです。これは、OSをできるだけ長く実行し続けるために重要です。したがって、オーバーコミット動作をstrictに設定します

sysctl -w vm.overcommit_memory=2

（または編集sysctl.confおよびsysctl）を介して再読み込みします。

または、postmasterプロセスのターゲットスコアを可能な限り低い値に設定して、OOMキラーが被害者を探すときに選択される可能性を低くすることができます。これはrootが所有する起動スクリプトで行う必要があります。既に使用されているスクリプトを編集するのが適切と思われます。これはあなたが必要とするものです：

echo -1000 > /proc/self/oom_score_adj

詳細については、リンクされたドキュメントのページを確認してください。各ソリューションで確認する必要のある細かい詳細があります。

物理メモリとスワップ領域の両方が使い果たされた場合にのみ、OOMキラーがウェイクアップすることを知っておくのは良いことです。安価な方法は、スワップ領域を増やすことですが、それに依存すると、通常のデータベース操作には遅すぎます。ただし、ユースケースによっては、それが解決策になる場合があります。

どちらの方法でも、OSへのrootアクセスが必要であることに注意してください。

Rootアクセスなしで機能する可能性のあるアプローチ

ハードウェアの問題を根本原因から除外でき、ルートアクセス権がない場合でも、問題を回避できます。これは防弾ソリューションを提供しませんが、問題が再発する可能性を減らすことができます。

元のセットアップが使用するメモリの量を簡単に確認しましょう。

max_connections = 250 shared_buffers = 768MB temp_buffers = 8MB # work_mem = 1MB # a commented-out value means it is at the default - # in 9.4 it is 4MB

2 GBの物理メモリがあります。

カウントしてみましょう使用量（最悪のシナリオの計算）：

shared_buffersは常に使用されます：768MB
work_memおよびtemp_buffersはセッション（つまり、接続）ごとに割り当てられ、max_connectionsは250です：（4MB + 8MB）* 250 = 3000MB
もちろん、すべての接続がこのスペースをすべて使用することはほとんどありません。また、コメントで述べるように、一度に使用する接続は70を超えないため、数は840MBに減少します。
maintenance_work_memおよび（オプションで）autovacuum_work_memはさらに消費する可能性があります。あなたはそれらをデフォルト値、つまり64MBに持っているようです。

これらすべての合計は1672MBになります。他のすべてのために残っているものは2048MB-1672MB = 376MBです。 Linuxサーバーのインストールに必要な量を確認して、例としてUbunt を採用しました。ドキュメントには、最小セットアップには192MiBで十分であると書かれています-このようにして、設定で生き残ることができます。明らかに、他のプロセス（すべてメモリを消費している）がそこで実行されており、RAMを使い果たすことがあります。

これを回避するには、上記の設定を下げます。データベースのサイズと一般的なクエリに応じて、それらのいずれかを下げることができます。設定を変更する前に、どれが何に使用されているかを確認してください。