リンクからウェブサイトのタイトルを取得

Question

Google News の各記事の抜粋の下部にソースがあることに注意してください。

ガーディアン-ABCニュース-ロイター-ブルームバーグ

私はそれを真似しようとしています。

たとえば、URLを送信するとhttp://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/返したいThe Washington Times

これはどのようにPHPで可能ですか？

Jose Vega · Accepted Answer

私の答えは、ページのタイトルを使用するという@AI Wの答えを拡張したものです。以下は、彼が言ったことを達成するためのコードです。

<?php function get_title($url){ $str = file_get_contents($url); if(strlen($str)>0){ $str = trim(preg_replace('/\s+/', ' ', $str)); // supports line breaks inside <title> preg_match("/\<title\>(.*)\<\/title\>/i",$str,$title); // ignore case return $title[1]; } } //Example: echo get_title("http://www.washingtontimes.com/"); ?>

[〜＃〜] output [〜＃〜]

Washington Times-政治、ニュース速報、米国および世界のニュース

ご覧のとおり、これはGoogleが使用しているものとはまったく異なります。そのため、URLのホスト名を取得し、それを独自のリストに一致させると信じさせられます。

http://www.washingtontimes.com/ => The Washington Times

Matthew · Answer

$doc = new DOMDocument(); @$doc->loadHTMLFile('http://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/'); $xpath = new DOMXPath($doc); echo $xpath->query('//title')->item(0)->nodeValue."
";

出力：

債務委員会はテスト票で不足している-Washington Times

もちろん、基本的なエラー処理も実装する必要があります。

James Sumners · Answer

URLのコンテンツを取得し、title要素のコンテンツの正規表現検索を実行できます。

<?php $urlContents = file_get_contents("http://example.com/"); preg_match("/<title>(.*)</title>/i", $urlContents, $matches); print($matches[1] . "
"); // "Example Web Page" ?>

または、（ドキュメントの上部に非常に近いものに一致するために）正規表現を使用したくない場合は、 DOMDocumentオブジェクトを使用できます。

<?php $urlContents = file_get_contents("http://example.com/"); $dom = new DOMDocument(); @$dom->loadHTML($urlContents); $title = $dom->getElementsByTagName('title'); print($title->item(0)->nodeValue . "
"); // "Example Web Page" ?>

どの方法が一番好きかはあなた次第です。

Cups · Answer

ドメインホームページからget_meta_tags（）を使用すると、NYTは切り捨てる必要があるかもしれないが、役立つ可能性のあるものを返します。

$b = "http://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/" ; $url = parse_url( $b ) ; $tags = get_meta_tags( $url['scheme'].'://'.$url['Host'] ); var_dump( $tags );

「ワシントンタイムズは、わが国の未来に影響を与える問題についての最新ニュースと解説を提供しています」という説明を含んでいます。

Novikov · Answer

cURLのPHPマニュアル

<?php $ch = curl_init("http://www.example.com/"); $fp = fopen("example_homepage.txt", "w"); curl_setopt($ch, CURLOPT_FILE, $fp); curl_setopt($ch, CURLOPT_HEADER, 0); curl_exec($ch); curl_close($ch); fclose($fp); ?>

Perl正規表現マッチングに関するPHPマニュアル

<?php $subject = "abcdef"; $pattern = '/^def/'; preg_match($pattern, $subject, $matches, PREG_OFFSET_CAPTURE, 3); print_r($matches); ?>

そして、これら2つをまとめる：

<?php // create curl resource $ch = curl_init(); // set url curl_setopt($ch, CURLOPT_URL, "example.com"); //return the transfer as a string curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // $output contains the output string $output = curl_exec($ch); $pattern = '/[<]title[>]([^<]*)[<][\/]titl/i'; preg_match($pattern, $output, $matches); print_r($matches); // close curl resource to free up system resources curl_close($ch); ?>

ここにPHPがないので、この例が機能することを約束することはできませんが、始めるのに役立つはずです。

Sudhir Jonathan · Answer

このためにサードパーティのサービスを使用する場合は、 www.runway7.net/radar

タイトル、説明などを提供します。たとえば、レーダーの例を試してください。（ http://radar.runway7.net/?url=http://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/ ）

Kise Xu · Answer

リンクからウェブサイトのタイトルを取得し、タイトルをutf-8文字エンコードに変換します：

https://Gist.github.com/kisexu/b64bc6ab787f302ae838

function getTitle($url) { // get html via url $ch = curl_init(); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36"); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); $html = curl_exec($ch); curl_close($ch); // get title preg_match('/(?<=<title>).+(?=<\/title>)/iU', $html, $match); $title = empty($match[0]) ? 'Untitled' : $match[0]; $title = trim($title); // convert title to utf-8 character encoding if ($title != 'Untitled') { preg_match('/(?<=charset\=).+(?=\")/iU', $html, $match); if (!empty($match[0])) { $charset = str_replace('"', '', $match[0]); $charset = str_replace("'", '', $charset); $charset = strtolower( trim($charset) ); if ($charset != 'utf-8') { $title = iconv($charset, 'utf-8', $title); } } } return $title; }

xianyu · Answer

私はそれを処理する関数を書きました：

 function getURLTitle($url){ $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); $content = curl_exec($ch); $contentType = curl_getinfo($ch, CURLINFO_CONTENT_TYPE); $charset = ''; if($contentType && preg_match('/\bcharset=(.+)\b/i', $contentType, $matches)){ $charset = $matches[1]; } curl_close($ch); if(strlen($content) > 0 && preg_match('/\<title\b.*\>(.*)\<\/title\>/i', $content, $matches)){ $title = $matches[1]; if(!$charset && preg_match_all('/\<meta\b.*\>/i', $content, $matches)){ //order: //http header content-type //meta http-equiv content-type //meta charset foreach($matches as $match){ $match = strtolower($match); if(strpos($match, 'content-type') && preg_match('/\bcharset=(.+)\b/', $match, $ms)){ $charset = $ms[1]; break; } } if(!$charset){ //meta charset=utf-8 //meta charset='utf-8' foreach($matches as $match){ $match = strtolower($match); if(preg_match('/\bcharset=([\'"])?(.+)\1?/', $match, $ms)){ $charset = $ms[1]; break; } } } } return $charset ? iconv($charset, 'utf-8', $title) : $title; } return $url; }

webページのコンテンツを取得し、（（最高の優先度から最低の優先度まで）でドキュメントの文字セットエンコーディングを取得しようとします。

「Content-Type」フィールドのHTTP「charset」パラメーター。
「http-equiv」が「Content-Type」に設定され、値が「charset」に設定されたMETA宣言。
外部リソースを指定する要素に設定されたcharset属性。

（ http://www.w3.org/TR/html4/charset.html を参照）

そして、iconvを使用してタイトルをutf-8エンコードに変換します。

Istv&#225;n Ujj-M&#233;sz&#225;ros · Answer

あるいは、 Simple Html Dom Parser を使用できます。

<?php require_once('simple_html_dom.php'); $html = file_get_html('http://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/'); echo $html->find('title', 0)->innertext . "<br>
"; echo $html->find('div[class=entry-content]', 0)->innertext;

Jake · Answer

必要のないときは正規表現を避けようとします。以下のcurlとDOMDocumentでウェブサイトのタイトルを取得する関数を作成しました。

function website_title($url) { $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // some websites like Facebook need a user agent to be set. curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36'); $html = curl_exec($ch); curl_close($ch); $dom = new DOMDocument; @$dom->loadHTML($html); $title = $dom->getElementsByTagName('title')->item('0')->nodeValue; return $title; } echo website_title('https://www.facebook.com/');

上記は以下を返します：Facebookへようこそ-ログイン、サインアップ、または詳細