完全なページをダウンロードせずにHTTPステータスを判断する方法は？

Question

Ubuntuを使用しているWebサイトのHTTPステータスを知りたい。その目的でcurlおよびwgetコマンドを使用しました。しかし、問題は、これらのコマンドが完全なWebサイトページをダウンロードし、ヘッダーを検索して画面に表示することです。例えば：

$ curl -I trafficinviter.com HTTP/1.1 200 OK Date: Mon, 02 Jan 2017 14:13:14 GMT Server: Apache X-Pingback: http://trafficinviter.com/xmlrpc.php Link: <http://trafficinviter.com/>; rel=shortlink Set-Cookie: wpfront-notification-bar-landingpage=1 Content-Type: text/html; charset=UTF-8

同じことがWgetcommandでも発生します。この場合、ページ全体がダウンロードされ、帯域幅を不必要に消費します。

私が探しているのは、帯域幅の消費を節約できるように、実際にページをダウンロードせずにHTTPステータスコードを取得する方法です。 curlを使用しようとしましたが、ステータスコードを取得するために完全なページをダウンロードするのか、システムにヘッダーのみをダウンロードするのかはわかりません。

AlexP · Accepted Answer

curl -Iは、HTTPヘッダーをonlyフェッチします。ページ全体をダウンロードするわけではありません。 man curl から：

-I, --head (HTTP/FTP/FILE) Fetch the HTTP-header only! HTTP-servers feature the command HEAD which this uses to get nothing but the header of a document. When used on an FTP or FILE file, curl displays the file size and last modification time only.

別のオプションは、lynxをインストールし、lynx -head -dumpを使用することです。

HEADリクエストはHTTP 1.1プロトコルで指定されています（ RFC 2616 ）：

9.4 HEAD The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. The metainformation contained in the HTTP headers in response to a HEAD request SHOULD be identical to the information sent in response to a GET request. This method can be used for obtaining metainformation about the entity implied by the request without transferring the entity-body itself. This method is often used for testing hypertext links for validity, accessibility, and recent modification.

muru · Answer

wgetを使用すると、 --spiderオプションを使用する必要があります curlのようなHEADリクエストを送信します。

$ wget -S --spider https://google.com Spider mode enabled. Check if remote file exists. --2017-01-03 00:08:38-- https://google.com/ Resolving google.com (google.com)... 216.58.197.174 Connecting to google.com (google.com)|216.58.197.174|:443... connected. HTTP request sent, awaiting response... HTTP/1.1 302 Found Cache-Control: private Content-Type: text/html; charset=UTF-8 Location: https://www.google.co.jp/?gfe_rd=cr&ei=... Content-Length: 262 Date: Mon, 02 Jan 2017 15:08:38 GMT Alt-Svc: quic=":443"; ma=2592000; v="35,34" Location: https://www.google.co.jp/?gfe_rd=cr&ei=... [following] Spider mode enabled. Check if remote file exists. --2017-01-03 00:08:38-- https://www.google.co.jp/?gfe_rd=cr&ei=... Resolving www.google.co.jp (www.google.co.jp)... 210.139.253.109, 210.139.253.93, 210.139.253.123, ... Connecting to www.google.co.jp (www.google.co.jp)|210.139.253.109|:443... connected. HTTP request sent, awaiting response... HTTP/1.1 200 OK Date: Mon, 02 Jan 2017 15:08:38 GMT Expires: -1 Cache-Control: private, max-age=0 Content-Type: text/html; charset=Shift_JIS P3P: CP="This is not a P3P policy! See https://www.google.com/support/accounts/answer/151657?hl=en for more info." Server: gws X-XSS-Protection: 1; mode=block X-Frame-Options: SAMEORIGIN Set-Cookie: NID=...; expires=Tue, 04-Jul-2017 15:08:38 GMT; path=/; domain=.google.co.jp; HttpOnly Alt-Svc: quic=":443"; ma=2592000; v="35,34" Transfer-Encoding: chunked Accept-Ranges: none Vary: Accept-Encoding Length: unspecified [text/html] Remote file exists and could contain further links, but recursion is disabled -- not retrieving.