.htmlページからリンクとタイトルを抽出する方法は？

Question

私のウェブサイトでは、新しい機能を追加したいと思います。

ユーザーがブックマークバックアップファイルをアップロードできるようにしたい（可能な場合は任意のブラウザーから）ので、プロファイルにアップロードできます。すべて手動で挿入する必要はありません...

私がこれを行うのに欠けている唯一の部分は、アップロードされたファイルからタイトルとURLを抽出する部分です。

検索オプションを使用し、（生のHTMLファイルからデータを抽出する方法？）これは私の最も関連する質問であり、それについては語りません。

Jqueryまたはphpを使用しているかどうかは本当に気にしません

どうもありがとうございました。

Toni Michel Caubet · Accepted Answer

皆さん、ありがとうございます。

最終コード：

$html = file_get_contents('bookmarks.html'); //Create a new DOM document $dom = new DOMDocument; //Parse the HTML. The @ is used to suppress any parsing errors //that will be thrown if the $html string isn't valid XHTML. @$dom->loadHTML($html); //Get all links. You could also use any other tag name here, //like 'img' or 'table', to extract other tags. $links = $dom->getElementsByTagName('a'); //Iterate over the extracted links and display their URLs foreach ($links as $link){ //Extract and show the "href" attribute. echo $link->nodeValue; echo $link->getAttribute('href'), '<br>'; }

これにより、。htmlファイル内のすべてのリンクに割り当てられたanchorテキストとhrefが表示されます。

繰り返しますが、どうもありがとう。

Matthew · Answer

これでおそらく十分です：

$dom = new DOMDocument; $dom->loadHTML($html); foreach ($dom->getElementsByTagName('a') as $node) { echo $node->nodeValue.': '.$node->getAttribute("href")."
"; }

Simon Groenewolt · Answer

格納されたリンクがhtmlファイル内にあると仮定すると、おそらく PHP Simple HTML DOM Parser などのhtmlパーサーを使用するのが最善の解決策です。（他のオプションは、基本的な文字列検索または正規表現を使用して検索することです。おそらくnever正規表現を使用してhtmlを解析する必要があります）。

パーサーを使用してhtmlファイルを読み取った後、その関数を使用してaタグを見つけます。

チュートリアルから：

// Find all links foreach($html->find('a') as $element) echo $element->href . '<br>';

Adrian Cid Almaguer · Answer

これは例であり、あなたの場合にこれを使用できます：

$content = file_get_contents('bookmarks.html');

これを実行します：

<?php $content = '<html> <title>Random Website I am Crawling</title> <body> Click <a href="http://clicklink.com">here</a> for foobar Another site is http://foobar.com </body> </html>'; $regex = "((https?|ftp)\:\/\/)?"; // SCHEME $regex .= "([a-z0-9+!*(),;?&=\$_.-]+(\:[a-z0-9+!*(),;?&=\$_.-]+)?@)?"; // User and Pass $regex .= "([a-z0-9-.]*)\.([a-z]{2,4})"; // Host or IP $regex .= "(\:[0-9]{2,5})?"; // Port $regex .= "(\/([a-z0-9+\$_-]\.?)+)*\/?"; // Path $regex .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+\/\$_.-]*)?"; // GET Query $regex .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; // Anchor $matches = array(); //create array $pattern = "/$regex/"; preg_match_all($pattern, $content, $matches); print_r(array_values(array_unique($matches[0]))); echo "<br><br>"; echo implode("<br>", array_values(array_unique($matches[0])));

出力：

Array ( [0] => http://clicklink.com [1] => http://foobar.com )

http://clicklink.com

http://foobar.com

Raghavendra · Answer

$html = file_get_contents('your file path'); $dom = new DOMDocument; @$dom->loadHTML($html); $styles = $dom->getElementsByTagName('link'); $links = $dom->getElementsByTagName('a'); $scripts = $dom->getElementsByTagName('script'); foreach($styles as $style) { if($style->getAttribute('href')!="#") { echo $style->getAttribute('href'); echo'<br>'; } } foreach ($links as $link){ if($link->getAttribute('href')!="#") { echo $link->getAttribute('href'); echo'<br>'; } } foreach($scripts as $script) { echo $script->getAttribute('src'); echo'<br>'; }