RESTAPIを提供するAppEngineで実行されているウェブサーバーコンテナがあります。 TCPソケットを使用するnginx + PHP-FPMという比較的標準的な実装を試してみました(何らかの理由でunixソケットが機能していません)。 DB接続は、Google CloudVPN上で実行されるTCPソケットでもあります。
APIで最大25%の可用性が得られます。多くの場合、リクエストは最大時間の後に504 Gateway Timeout
になります(App Engineのnginxプロキシは60秒に設定されています)。 PHP-FPMがタイムアウトした場合は502 Bad Gateway
になることがあります(request_terminate_timeout
)。
これがAppEngineのnginxの設定が不適切なのか、nginxの設定なのか、PHP-FPMの設定なのかを調べようとしています。 Nginxはソケットを閉じるか再利用する必要がありますが、どちらも行っていないようです。
特定のエンドポイント(25ユーザー)を数分間siege
すると、次のように表示されます。
HTTP/1.1 504 60.88 secs: 176 bytes ==> GET /path/to/rest
...15 lines...
HTTP/1.1 504 61.23 secs: 176 bytes ==> GET /path/to/rest
HTTP/1.1 200 57.54 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 57.68 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 504 60.76 secs: 176 bytes ==> GET /path/to/rest
...15 lines...
HTTP/1.1 504 61.06 secs: 176 bytes ==> GET /path/to/rest
HTTP/1.1 200 33.35 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 32.97 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 36.61 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 39.00 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 42.47 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 48.51 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 56.45 secs: 13143 bytes ==> GET /path/to/rest
# Another run
HTTP/1.1 200 7.65 secs: 13143 bytes ==> GET /path/to/rest
...10 lines...
HTTP/1.1 200 8.20 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 502 47.15 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 47.15 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 200 8.30 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 504 61.15 secs: 176 bytes ==> GET /path/to/rest
HTTP/1.1 502 54.46 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 54.33 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 54.25 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 53.63 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 48.40 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.31 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 6.97 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.27 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.26 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 502 54.99 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 60.08 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 60.56 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 200 6.83 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 502 60.85 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 59.99 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 58.99 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 52.40 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 52.21 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 59.61 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 52.65 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.13 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 6.96 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.48 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.81 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 6.89 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 502 59.26 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 200 6.80 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 502 59.44 secs: 166 bytes ==> GET /path/to/rest
これは、1人のユーザーでも発生します。
HTTP/1.1 502 55.43 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.71 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.54 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 502 59.21 secs: 166 bytes ==> GET /path/to/rest
各ケースのNginxログ:
# 200
Normal logging i.e. [notice] GET /path/to/rest (param1, param2) ...
# 502
[error] 1059#0: *1395 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.3, server: gaeapp, request: "GET /path/to/rest HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", Host: "api.example.com"
# 504
[error] 34#0: *326 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 172.18.0.3, server: gaeapp, request: "GET /path/to/rest HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", Host: "api.example.com"
netstat -t
は次のようになります。
# Before starting
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:33971 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34072 ESTABLISHED
# During the siege
tcp 0 0 localhost:56144 localhost:9000 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34565 ESTABLISHED
tcp 0 0 5c2ad0938ce9:53073 192.168.2.29:postgresql ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:33971 ESTABLISHED
tcp 0 0 localhost:56148 localhost:9000 ESTABLISHED
tcp 0 0 5c2ad0938ce9:53071 192.168.2.29:postgresql ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34580 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34584 ESTABLISHED
tcp 0 0 localhost:56106 localhost:9000 ESTABLISHED
tcp 0 0 localhost:56191 localhost:9000 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34566 ESTABLISHED
tcp 0 0 localhost:56113 localhost:9000 ESTABLISHED
tcp 0 0 localhost:56150 localhost:9000 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34591 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34574 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34072 ESTABLISHED
tcp 0 0 5c2ad0938ce9:53102 192.168.2.29:postgresql ESTABLISHED
tcp 0 0 5c2ad0938ce9:53051 192.168.2.29:postgresql ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34572 ESTABLISHED
tcp 8 0 localhost:9000 localhost:56146 ESTABLISHED
tcp 0 0 localhost:9000 localhost:56117 TIME_WAIT
tcp 8 0 localhost:9000 localhost:56179 ESTABLISHED
tcp 8 0 localhost:9000 localhost:56160 ESTABLISHED
tcp 0 0 localhost:9000 localhost:56168 TIME_WAIT
tcp 0 0 localhost:9000 localhost:56170 TIME_WAIT
tcp 0 0 localhost:9000 localhost:56111 TIME_WAIT
tcp 0 0 localhost:9000 localhost:56115 TIME_WAIT
tcp 8 0 localhost:9000 localhost:56123 ESTABLISHED
tcp 0 0 localhost:9000 localhost:56109 TIME_WAIT
tcp 8 0 localhost:9000 localhost:56113 ESTABLISHED
tcp 0 0 localhost:9000 localhost:56140 TIME_WAIT
tcp 0 0 localhost:9000 localhost:56181 TIME_WAIT
tcp 0 0 localhost:9000 localhost:56121 TIME_WAIT
tcp 8 0 localhost:9000 localhost:56191 ESTABLISHED
tcp 0 0 localhost:9000 localhost:56119 TIME_WAIT
tcp 0 0 localhost:9000 localhost:56142 TIME_WAIT
tcp 8 0 localhost:9000 localhost:56106 ESTABLISHED
tcp 0 0 localhost:9000 localhost:56110 TIME_WAIT
tcp 8 0 localhost:9000 localhost:56144 ESTABLISHED
tcp 8 0 localhost:9000 localhost:56148 ESTABLISHED
tcp 8 0 localhost:9000 localhost:56150 ESTABLISHED
# A minute or so after ending the siege
tcp 0 0 5c2ad0938ce9:53319 192.168.2.29:postgresql ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34578 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34576 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34570 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34565 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:33971 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34580 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34584 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34566 ESTABLISHED
tcp 0 0 localhost:56396 localhost:9000 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34591 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34574 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34072 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34572 ESTABLISHED
tcp 8 0 localhost:9000 localhost:56396 ESTABLISHED
user www-data;
worker_processes auto;
worker_cpu_affinity auto;
events {
worker_connections 512;
}
http {
server_tokens off;
fastcgi_ignore_client_abort off;
keepalive_timeout 650;
keepalive_requests 10000;
gzip on;
..more gzip settings..
server {
charset utf-8;
client_max_body_size 512M;
listen 8080;
rewrite_log on;
root /app/web;
index index.php;
location / {
try_files $uri /index.php?$args;
}
location ~ \.php$ {
fastcgi_pass 127.0.0.1:9000;
include /etc/nginx/fastcgi_params;
fastcgi_keep_conn off;
fastcgi_param SCRIPT_FILENAME $document_root/$fastcgi_script_name;
}
}
include /etc/nginx/conf.d/*.conf; # There are no extra conf files
}
[www]
user = www-data
group = www-data
listen = 127.0.0.1:9000
pm = ondemand
pm.process_idle_timeout = 10s
request_terminate_timeout = 45
キープアライブを無効にすることは悪い考えです。AppEngineは常にコンテナーをポーリングしてヘルスチェックを行い、これにより多くのデッドTIME_WAIT
ソケットが作成されます(私は試しました)。
request_terminate_timeout
の前は、CLOSE_WAIT
の代わりにTIME_WAIT
ソケットがたくさんありました。 request_terminate_timeout = 45
を設定すると、ワーカープロセスが強制終了され、リスポーン後に再び200
が提供されるため、ある意味で役立ちます。終了タイムアウトを短くすると、生成される502
sが多くなり、504
sが少なくなります。
ソケットは技術的にアイドル状態ではないため、process_idle_timeout
は無視されます。
fastcgi_keep_conn on
を設定しても、nginxの動作に測定可能な影響はありません。
問題は、アプリケーションではなく、コンテナー構成に関連していることが判明しました。 MTUをGoogleのクラウドネットワークに適した値(1500から1430に下げる)に設定すると、アプリケーションのクエリで問題が発生しなくなります。
これは、Google Cloud VPNを介してデータベースへのソケットを開いたリクエストのみに問題を切り分けることで発見されました(postgresql
ログのnetstat
エントリを参照)。たまたま2番目のVPNへのVPNルーティングがあり、最初のホップだけが高MTUトラフィックを伝送したため、そのDB接続は完全に機能しました。