PHPで404のURLをテストする簡単な方法は？

Question

私は自分自身に基本的なスクレイピングを教えていますが、コードにフィードするURLが404を返すことがあります。

そのため、URLが404を返すかどうかを確認するために、コードの上部でテストが必要です。

これは非常に簡単な作業のように思えますが、Googleから回答が得られません。間違ったものを探しているのではないかと心配です。

私はこれを使用することを推奨するブログ：

$valid = @fsockopen($url, 80, $errno, $errstr, 30);

次に、空の場合は$ validかどうかをテストします。

しかし、問題を引き起こしているURLにはリダイレクトがあると思うので、$ validはすべての値に対して空になります。または、おそらく何か他のことを間違っています。

「ヘッドリクエスト」も調べましたが、実際に試してみたり試してみたりできる実際のコード例はまだ見つかりません。

提案？そして、カールについてこれは何ですか？

strager · Accepted Answer

PHPの curl bindings を使用している場合、 curl_getinfo を使用してエラーコードを確認できます。

$handle = curl_init($url); curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE); /* Get the HTML or whatever is linked in $url. */ $response = curl_exec($handle); /* Check for 404 (file not found). */ $httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE); if($httpCode == 404) { /* Handle 404 here. */ } curl_close($handle); /* Handle $response here. */

Asciant · Answer

実行中のphp5を使用できる場合：

$url = 'http://www.example.com'; print_r(get_headers($url, 1));

あるいは、php4を使用すると、ユーザーは次のように貢献しました。

/** This is a modified version of code from "stuart at sixletterwords dot com", at 14-Sep-2005 04:52. This version tries to emulate get_headers() function at PHP4. I think it works fairly well, and is simple. It is not the best emulation available, but it works. Features: - supports (and requires) full URLs. - supports changing of default port in URL. - stops downloading from socket as soon as end-of-headers is detected. Limitations: - only gets the root URL (see line with "GET / HTTP/1.1"). - don't support HTTPS (nor the default HTTPS port). */ if(!function_exists('get_headers')) { function get_headers($url,$format=0) { $url=parse_url($url); $end = "

"; $fp = fsockopen($url['Host'], (empty($url['port'])?80:$url['port']), $errno, $errstr, 30); if ($fp) { $out = "GET / HTTP/1.1
"; $out .= "Host: ".$url['Host']."
"; $out .= "Connection: Close

"; $var = ''; fwrite($fp, $out); while (!feof($fp)) { $var.=fgets($fp, 1280); if(strpos($var,$end)) break; } fclose($fp); $var=preg_replace("/

.*\$/",'',$var); $var=explode("
",$var); if($format) { foreach($var as $i) { if(preg_match('/^([a-zA-Z -]+): +(.*)$/',$i,$parts)) $v[$parts[1]]=$parts[2]; } return $v; } else return $var; } } }

どちらも次のような結果になります。

Array ( [0] => HTTP/1.1 200 OK [Date] => Sat, 29 May 2004 12:28:14 GMT [Server] => Apache/1.3.27 (Unix) (Red-Hat/Linux) [Last-Modified] => Wed, 08 Jan 2003 23:11:55 GMT [ETag] => "3f80f-1b6-3e1cb03b" [Accept-Ranges] => bytes [Content-Length] => 438 [Connection] => close [Content-Type] => text/html )

そのため、ヘッダーの応答が正常であることを確認するだけです。例：

$headers = get_headers($url, 1); if ($headers[0] == 'HTTP/1.1 200 OK') { //valid } if ($headers[0] == 'HTTP/1.1 301 Moved Permanently') { //moved or redirect page }

W3Cコードと定義

Aram Kocharyan · Answer

Stragerのコードでは、CURLINFO_HTTP_CODEで他のコードを確認することもできます。一部のWebサイトは404を報告せず、単にカスタム404ページにリダイレクトし、302（リダイレクト）または同様のものを返します。これを使用して、実際のファイル（robots.txtなど）がサーバーに存在するかどうかを確認しました。明らかに、この種類のファイルは、存在する場合はリダイレクトを引き起こしませんが、存在しない場合は、404ページにリダイレクトします。

function is_404($url) { $handle = curl_init($url); curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE); /* Get the HTML or whatever is linked in $url. */ $response = curl_exec($handle); /* Check for 404 (file not found). */ $httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE); curl_close($handle); /* If the document has loaded successfully without any redirection or error */ if ($httpCode >= 200 && $httpCode < 300) { return false; } else { return true; } }

Beau Simensen · Answer

Stragerが示唆するように、cURLの使用を検討してください。また、CURLOPT_NOBODYを curl_setopt で設定して、ページ全体のダウンロードをスキップすることもできます（ヘッダーだけが必要です）。

Nasaralla · Answer

あなたが最も簡単な解決策を探しているなら、あなたは一度に試してみることができます

file_get_contents('www.yoursite.com'); //and check by echoing echo $http_response_header[0];

Ross · Answer

私はこの答えを見つけましたここ：

if(($Twitter_XML_raw=file_get_contents($timeline))==false){ // Retrieve HTTP status code list($version,$status_code,$msg) = explode(' ',$http_response_header[0], 3); // Check the HTTP Status code switch($status_code) { case 200: $error_status="200: Success"; break; case 401: $error_status="401: Login failure. Try logging out and back in. Password are ONLY used when posting."; break; case 400: $error_status="400: Invalid request. You may have exceeded your rate limit."; break; case 404: $error_status="404: Not found. This shouldn't happen. Please let me know what happened using the feedback link above."; break; case 500: $error_status="500: Twitter servers replied with an error. Hopefully they'll be OK soon!"; break; case 502: $error_status="502: Twitter servers may be down or being upgraded. Hopefully they'll be OK soon!"; break; case 503: $error_status="503: Twitter service unavailable. Hopefully they'll be OK soon!"; break; default: $error_status="Undocumented error: " . $status_code; break; }

基本的に、「file get contents」メソッドを使用してURLを取得します。これにより、http応答ヘッダー変数にステータスコードが自動的に入力されます。

Email · Answer

補遺;パフォーマンスを考慮してこれら3つの方法をテスト.

その結果、少なくとも私のテスト環境では：

カールが勝つ

このテストは、ヘッダー（noBody）のみが必要であることを考慮して行われます。自分を試す：

$url = "http://de.wikipedia.org/wiki/Pinocchio"; $start_time = microtime(TRUE); $headers = get_headers($url); echo $headers[0]."<br>"; $end_time = microtime(TRUE); echo $end_time - $start_time."<br>"; $start_time = microtime(TRUE); $response = file_get_contents($url); echo $http_response_header[0]."<br>"; $end_time = microtime(TRUE); echo $end_time - $start_time."<br>"; $start_time = microtime(TRUE); $handle = curl_init($url); curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE); curl_setopt($handle, CURLOPT_NOBODY, 1); // and *only* get the header /* Get the HTML or whatever is linked in $url. */ $response = curl_exec($handle); /* Check for 404 (file not found). */ $httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE); // if($httpCode == 404) { // /* Handle 404 here. */ // } echo $httpCode."<br>"; curl_close($handle); $end_time = microtime(TRUE); echo $end_time - $start_time."<br>";

Juergen · Answer

これは、URLが200 OKを返さない場合にtrueになります

function check_404($url) { $headers=get_headers($url, 1); if ($headers[0]!='HTTP/1.1 200 OK') return true; else return false; }

markus · Answer

偉大な受け入れられた答えへの追加のヒントとして：

提案されたソリューションのバリエーションを使用すると、phpの設定 'max_execution_time'のためにエラーが発生しました。だから私は次のことをしました：

set_time_limit(120); $curl = curl_init($url); curl_setopt($curl, CURLOPT_NOBODY, true); $result = curl_exec($curl); set_time_limit(ini_get('max_execution_time')); curl_close($curl);

最初に時間制限をより高い秒数に設定し、最後にphp設定で定義された値に戻します。

Melbin Mathew Antony · Answer

<?php $url= 'www.something.com'; $ch = curl_init($url); curl_setopt($ch, CURLOPT_HEADER, true); curl_setopt($ch, CURLOPT_NOBODY, true); curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.4"); curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); curl_setopt($ch, CURLOPT_TIMEOUT,10); curl_setopt($ch, CURLOPT_ENCODING, "gzip"); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); $output = curl_exec($ch); $httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE); curl_close($ch); echo $httpcode; ?>

T.Todua · Answer

このコードを使用して、リンクのステータスを確認することもできます。

<?php function get_url_status($url, $timeout = 10) { $ch = curl_init(); // set cURL options $opts = array(CURLOPT_RETURNTRANSFER => true, // do not output to browser CURLOPT_URL => $url, // set URL CURLOPT_NOBODY => true, // do a HEAD request only CURLOPT_TIMEOUT => $timeout); // set timeout curl_setopt_array($ch, $opts); curl_exec($ch); // do it! $status = curl_getinfo($ch, CURLINFO_HTTP_CODE); // find HTTP status curl_close($ch); // close handle echo $status; //or return $status; //example checking if ($status == '302') { echo 'HEY, redirection';} } get_url_status('http://yourpage.comm'); ?>

Andreas · Answer

これが簡単な解決策です。

$handle = curl_init($uri); curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE); curl_setopt($handle,CURLOPT_HTTPHEADER,array ("Accept: application/rdf+xml")); curl_setopt($handle, CURLOPT_NOBODY, true); curl_exec($handle); $httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE); if($httpCode == 200||$httpCode == 303) { echo "you might get a reply"; } curl_close($handle);

あなたの場合、application/rdf+xmlを使用するものに変更できます。

gabriel · Answer

これはコードのほんの一部であり、希望があなたのために働く

 $ch = @curl_init(); @curl_setopt($ch, CURLOPT_URL, 'http://example.com'); @curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1"); @curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); @curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); @curl_setopt($ch, CURLOPT_TIMEOUT, 10); $response = @curl_exec($ch); $errno = @curl_errno($ch); $error = @curl_error($ch); $response = $response; $info = @curl_getinfo($ch); return $info['http_code'];

wawan · Answer

すべてのエラーをキャッチするために：4XXおよび5XX、私はこの小さなスクリプトを使用します：

function URLIsValid($URL){ $headers = @get_headers($URL); preg_match("/ [45][0-9]{2} /", (string)$headers[0] , $match); return count($match) === 0; }