HTMLの構文解析の例を提供できますか？

Question

さまざまな言語と解析ライブラリでHTMLをどのように解析しますか？

答えるとき：

正しい方法を示す方法として正規表現を使用してHTMLを解析する方法に関する質問への回答には、個々のコメントがリンクされます。

一貫性を保つために、この例では、アンカータグのhrefのHTMLファイルを解析することを求めています。この質問を簡単に検索できるように、この形式に従うようお願いします

言語：[言語名]

ライブラリ：[ライブラリ名]

[example code]

ライブラリをライブラリのドキュメントへのリンクにしてください。リンクを抽出する以外の例を提供する場合は、以下も含めてください。

目的：[解析の機能]

Ward Werbrouck · Answer

言語： JavaScript
ライブラリ： jQuery

$.each($('a[href]'), function(){ console.debug(this.href); });

（出力にfirebug console.debugを使用...）

そして、htmlページをロードします。

$.get('http://stackoverflow.com/', function(page){ $(page).find('a[href]').each(function(){ console.debug(this.href); }); });

このために別の各関数を使用しました。メソッドをチェーン化するときはきれいだと思います。

alexn · Answer

言語：C＃
ライブラリ： HtmlAgilityPack

class Program { static void Main(string[] args) { var web = new HtmlWeb(); var doc = web.Load("http://www.stackoverflow.com"); var nodes = doc.DocumentNode.SelectNodes("//a[@href]"); foreach (var node in nodes) { Console.WriteLine(node.InnerHtml); } } }

Paolo Bergantino · Answer

言語：Python
library： BeautifulSoup

from BeautifulSoup import BeautifulSoup html = "<html><body>" for link in ("foo", "bar", "baz"): html += '<a href="http://%s.com">%s</a>' % (link, link) html += "</body></html>" soup = BeautifulSoup(html) links = soup.findAll('a', href=True) # find <a> with a defined href attribute print links

出力：

[<a href="http://foo.com">foo</a>, <a href="http://bar.com">bar</a>, <a href="http://baz.com">baz</a>]

また可能：

for link in links: print link['href']

出力：

http://foo.com http://bar.com http://baz.com

draegtun · Answer

言語：Perl
ライブラリ： pQuery

use strict; use warnings; use pQuery; my $html = join '', "<html><body>", (map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/), "</body></html>"; pQuery( $html )->find( 'a' )->each( sub { my $at = $_->getAttribute( 'href' ); print "$at
" if defined $at; } );

user80168 · Answer

言語：シェル
library： lynx （まあ、それはライブラリではありませんが、シェルでは、すべてのプログラムは一種のライブラリです）

lynx -dump -listonly http://news.google.com/

Pesto · Answer

言語：Ruby
library： Hpricot

#!/usr/bin/Ruby require 'hpricot' html = '<html><body>' ['foo', 'bar', 'baz'].each {|link| html += "<a href=\"http://#{link}.com\">#{link}</a>" } html += '</body></html>' doc = Hpricot(html) doc.search('//a').each {|Elm| puts Elm.attributes['href'] }

Chas. Owens · Answer

言語：Python
library： HTMLParser

#!/usr/bin/python from HTMLParser import HTMLParser class FindLinks(HTMLParser): def __init__(self): HTMLParser.__init__(self) def handle_starttag(self, tag, attrs): at = dict(attrs) if tag == 'a' and 'href' in at: print at['href'] find = FindLinks() html = "<html><body>" for link in ("foo", "bar", "baz"): html += '<a href="http://%s.com">%s</a>' % (link, link) html += "</body></html>" find.feed(html)

Chas. Owens · Answer

言語：Perl
library： HTML :: Parser

#!/usr/bin/Perl use strict; use warnings; use HTML::Parser; my $find_links = HTML::Parser->new( start_h => [ sub { my ($tag, $attr) = @_; if ($tag eq 'a' and exists $attr->{href}) { print "$attr->{href}
"; } }, "tag, attr" ] ); my $html = join '', "<html><body>", (map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/), "</body></html>"; $find_links->parse($html);

user80168 · Answer

言語Perl
ライブラリ： HTML :: LinkExtor

Perlの美しさは、非常に特定のタスク用のモジュールがあることです。リンク抽出と同様。

プログラム全体：

#!/usr/bin/Perl -w use strict; use HTML::LinkExtor; use LWP::Simple; my $url = 'http://www.google.com/'; my $content = get( $url ); my $p = HTML::LinkExtor->new( \&process_link, $url, ); $p->parse( $content ); exit; sub process_link { my ( $tag, %attr ) = @_; return unless $tag eq 'a'; return unless defined $attr{ 'href' }; print "- $attr{'href'}
"; return; }

説明：

厳格な使用-「厳格な」モードをオンにします-デバッグの可能性を緩和しますが、例に完全には関係ありません
hTML :: LinkExtorを使用-興味深いモジュールのロード
lWP :: Simpleを使用する-テスト用のHTMLを取得するための簡単な方法
my $ url = ' http://www.google.com/ '-URLを抽出するページ
my $ content = get（$ url）-ページhtmlを取得します
my $ p = HTML :: LinkExtor-> new（\＆process_link、$ url）-LinkExtorオブジェクトを作成し、すべてのURLでコールバックとして使用される関数への参照を指定し、相対URLのBASEURLとして使用する$ url
$ p-> parse（$ content）-かなり明らかだと思います
exit-プログラムの終わり
sub process_link-関数process_linkの始まり
my（$ tag、％attr）-タグ名とその属性である引数を取得します
$ tag eq 'a'を除いて戻る-タグが<a>でない場合は処理をスキップする
$ attr {'href'}を無効にしない限り戻ります-<a>タグにhref属性がない場合、処理をスキップします
print "-$ attr {'href'} "; -かなり明白だと思う:)
戻り; -機能を終了する

それで全部です。

Jules Glegg · Answer

言語：Ruby
ライブラリ： Nokogiri

#!/usr/bin/env Ruby require 'nokogiri' require 'open-uri' document = Nokogiri::HTML(open("http://google.com")) document.css("html head title").first.content => "Google" document.xpath("//title").first.content => "Google"

dmitry_vk · Answer

言語：Common LISP
ライブラリ： Closure Html 、 Closure Xml 、 CL-WHO

（XPATHまたはSTP API）を使用せずにDOM APIを使用して表示）

(defvar *html* (who:with-html-output-to-string (stream) (:html (:body (loop for site in (list "foo" "bar" "baz") do (who:htm (:a :href (format nil "http://~A.com/" site)))))))) (defvar *dom* (chtml:parse *html* (cxml-dom:make-dom-builder))) (loop for tag across (dom:get-elements-by-tag-name *dom* "a") collect (dom:get-attribute tag "href")) => ("http://foo.com/" "http://bar.com/" "http://baz.com/")

Michał Marczyk · Answer

言語： Clojure
ライブラリー： Enlive （Clojureのセレクターベース（CSS）テンプレートおよび変換システム）

セレクター式：

(def test-select (html/select (html/html-resource (Java.io.StringReader. test-html)) [:a]))

これでREPLで次のことができます（test-selectに改行を追加しました）：

user> test-select ({:tag :a, :attrs {:href "http://foo.com/"}, :content ["foo"]} {:tag :a, :attrs {:href "http://bar.com/"}, :content ["bar"]} {:tag :a, :attrs {:href "http://baz.com/"}, :content ["baz"]}) user> (map #(get-in % [:attrs :href]) test-select) ("http://foo.com/" "http://bar.com/" "http://baz.com/")

試してみるには次のものが必要です。

前文：

(require '[net.cgrand.enlive-html :as html])

テストHTML：

(def test-html (apply str (concat ["<html><body>"] (for [link ["foo" "bar" "baz"]] (str "<a href=\"http://" link ".com/\">" link "</a>")) ["</body></html>"])))

laz · Answer

言語：Java
ライブラリ： [〜＃〜] xom [〜＃〜] 、 TagSoup

このサンプルには、意図的に不正で一貫性のないXMLを含めました。

import Java.io.IOException; import nu.xom.Builder; import nu.xom.Document; import nu.xom.Element; import nu.xom.Node; import nu.xom.Nodes; import nu.xom.ParsingException; import nu.xom.ValidityException; import org.ccil.cowan.tagsoup.Parser; import org.xml.sax.SAXException; public class HtmlTest { public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException { final Parser parser = new Parser(); parser.setFeature(Parser.namespacesFeature, false); final Builder builder = new Builder(parser); final Document document = builder.build("<html><body><ul><li><a href=\"http://google.com\">google</li><li><a HREF=\"http://reddit.org\" target=\"_blank\">reddit</a></li><li><a name=\"nothing\">nothing</a><li></ul></body></html>", null); final Element root = document.getRootElement(); final Nodes links = root.query("//a[@href]"); for (int linkNumber = 0; linkNumber < links.size(); ++linkNumber) { final Node node = links.get(linkNumber); System.out.println(((Element) node).getAttributeValue("href")); } } }

TagSoupは、デフォルトでXHTMLを参照するXML名前空間をドキュメントに追加します。このサンプルではそれを抑制することにしました。デフォルトの動作を使用するには、root.queryへの呼び出しで次のような名前空間を含める必要があります。

root.query("//xhtml:a[@href]", new nu.xom.XPathContext("xhtml", root.getNamespaceURI())

Tanktalus · Answer

言語：Perl
library： XML :: Twig

#!/usr/bin/Perl use strict; use warnings; use Encode ':all'; use LWP::Simple; use XML::Twig; #my $url = 'http://stackoverflow.com/questions/773340/can-you-provide-an-example-of-parsing-html-with-your-favorite-parser'; my $url = 'http://www.google.com'; my $content = get($url); die "Couldn't fetch!" unless defined $content; my $twig = XML::Twig->new(); $twig->parse_html($content); my @hrefs = map { $_->att('href'); } $twig->get_xpath('//*[@href]'); print "$_
" for @hrefs;

警告：このようなページでワイド文字エラーが発生する可能性があります（URLをコメント化されたものに変更するとこのエラーが発生します）が、上記のHTML :: Parserソリューションではこの問題を共有しません。

runrig · Answer

言語：Perl
ライブラリ： HTML :: Parser
目的：未使用のネストされたHTMLスパンタグをPerl正規表現で削除するにはどうすればよいですか？

Ward Werbrouck · Answer

言語： JavaScript
ライブラリ： [〜＃〜] dom [〜＃〜]

var links = document.links; for(var i in links){ var href = links[i].href; if(href != null) console.debug(href); }

（出力にfirebug console.debugを使用...）

zigzag · Answer

言語：C＃
ライブラリ： System.XML （標準.NET）

using System.Collections.Generic; using System.Xml; public static void Main(string[] args) { List<string> matches = new List<string>(); XmlDocument xd = new XmlDocument(); xd.LoadXml("<html>...</html>"); FindHrefs(xd.FirstChild, matches); } static void FindHrefs(XmlNode xn, List<string> matches) { if (xn.Attributes != null && xn.Attributes["href"] != null) matches.Add(xn.Attributes["href"].InnerXml); foreach (XmlNode child in xn.ChildNodes) FindHrefs(child, matches); }

Ryan Culpepper · Answer

言語：ラケット

ライブラリ：（planet ashinn/html-parser：1）および（planet clements/sxml2：1）

(require net/url (planet ashinn/html-parser:1) (planet clements/sxml2:1)) (define the-url (string->url "http://stackoverflow.com/")) (define doc (call/input-url the-url get-pure-port html->sxml)) (define links ((sxpath "//a/@href/text()") doc))

新しいパッケージシステムのパッケージを使用した上記の例： html-parsing および sxml

(require net/url html-parsing sxml) (define the-url (string->url "http://stackoverflow.com/")) (define doc (call/input-url the-url get-pure-port html->xexp)) (define links ((sxpath "//a/@href/text()") doc))

注：コマンドラインから「raco」を使用して必要なパッケージをインストールします。

raco pkg install html-parsing

そして：

raco pkg install sxml

Alex Reynolds · Answer

言語： Objective-C
ライブラリ： libxml2 + Matt Gallagherのlibxml2ラッパー + Ben CopseyのASIHTTPRequest

ASIHTTPRequest *request = [ASIHTTPRequest alloc] initWithURL:[NSURL URLWithString:@"http://stackoverflow.com/questions/773340"]; [request start]; NSError *error = [request error]; if (!error) { NSData *response = [request responseData]; NSLog(@"Data: %@", [[self query:@"//a[@href]" withResponse:response] description]); [request release]; } else @throw [NSException exceptionWithName:@"kMyHTTPRequestFailed" reason:@"Request failed!" userInfo:nil]; ... - (id) query:(NSString *)xpathQuery WithResponse:(NSData *)resp { NSArray *nodes = PerformHTMLXPathQuery(resp, xpathQuery); if (nodes != nil) return nodes; return nil; }

dfa · Answer

言語：Perl
ライブラリ： HTML :: TreeBuilder

use strict; use HTML::TreeBuilder; use LWP::Simple; my $content = get 'http://www.stackoverflow.com'; my $document = HTML::TreeBuilder->new->parse($content)->eof; for my $a ($document->find('a')) { print $a->attr('href'), "
" if $a->attr('href'); }

Adam · Answer

言語：Python
library： lxml.html

import lxml.html html = "<html><body>" for link in ("foo", "bar", "baz"): html += '<a href="http://%s.com">%s</a>' % (link, link) html += "</body></html>" tree = lxml.html.document_fromstring(html) for element, attribute, link, pos in tree.iterlinks(): if attribute == "href": print link

lxmlには、DOMをトラバースするためのCSSセレクタークラスもあります。これにより、JQueryの使用と非常によく似た使用が可能になります。

for a in tree.cssselect('a[href]'): print a.get('href')

Ward Werbrouck · Answer

言語：PHP
ライブラリ： SimpleXML （およびDOM）

<?php $page = new DOMDocument(); $page->strictErrorChecking = false; $page->loadHTMLFile('http://stackoverflow.com/questions/773340'); $xml = simplexml_import_dom($page); $links = $xml->xpath('//a[@href]'); foreach($links as $link) echo $link['href']."
";

seagulf · Answer

言語： Python
ライブラリ： [〜＃〜] htql [〜＃〜]

import htql; page="<a href=a.html>1</a><a href=b.html>2</a><a href=c.html>3</a>"; query="<a>:href,tx"; for url, text in htql.HTQL(page, query): print url, text;

シンプルで直感的。

laz · Answer

言語：Java
ライブラリ： jsoup

import Java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import org.xml.sax.SAXException; public class HtmlTest { public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException { final Document document = Jsoup.parse("<html><body><ul><li><a href=\"http://google.com\">google</li><li><a HREF=\"http://reddit.org\" target=\"_blank\">reddit</a></li><li><a name=\"nothing\">nothing</a><li></ul></body></html>"); final Elements links = document.select("a[href]"); for (final Element element : links) { System.out.println(element.attr("href")); } } }

the Tin Man · Answer

言語：Ruby
library：のこぎり

#!/usr/bin/env Ruby require "nokogiri" require "open-uri" doc = Nokogiri::HTML(open('http://www.example.com')) hrefs = doc.search('a').map{ |n| n['href'] } puts hrefs

どの出力：

/ /domains/ /numbers/ /protocols/ /about/ /go/rfc2606 /about/ /about/presentations/ /about/performance/ /reports/ /domains/ /domains/root/ /domains/int/ /domains/arpa/ /domains/idn-tables/ /protocols/ /numbers/ /abuse/ http://www.icann.org/ mailto:iana@iana.org?subject=General%20website%20feedback

これは上記のマイナースピンであり、レポートに使用できる出力になります。 hrefのリストの最初と最後の要素のみを返します。

#!/usr/bin/env Ruby require "nokogiri" require "open-uri" doc = Nokogiri::HTML(open('http://nokogiri.org')) hrefs = doc.search('a[href]').map{ |n| n['href'] } puts hrefs .each_with_index # add an array index .minmax{ |a,b| a.last <=> b.last } # find the first and last element .map{ |h,i| '%3d %s' % [1 + i, h ] } # format the output 1 http://github.com/tenderlove/nokogiri 100 http://yokolet.blogspot.com

jGc · Answer

Phantomjsを使用して、このファイルをextract-links.jsとして保存します。

var page = new WebPage(), url = 'http://www.udacity.com'; page.open(url, function (status) { if (status !== 'success') { console.log('Unable to access network'); } else { var results = page.evaluate(function() { var list = document.querySelectorAll('a'), links = [], i; for (i = 0; i < list.length; i++) { links.Push(list[i].href); } return links; }); console.log(results.join('
')); } phantom.exit(); });

実行：

$ ../path/to/bin/phantomjs extract-links.js

GabaGabaDev · Answer

言語：JavaScript/Node.js

ライブラリ：リクエストおよび Cheerio

var request = require('request'); var cheerio = require('cheerio'); var url = "https://news.ycombinator.com/"; request(url, function (error, response, html) { if (!error && response.statusCode == 200) { var $ = cheerio.load(html); var anchorTags = $('a'); anchorTags.each(function(i,element){ console.log(element["attribs"]["href"]); }); } });

リクエストライブラリはhtmlドキュメントをダウンロードし、Cheerioはjquery cssセレクターを使用してhtmlドキュメントをターゲットにします。

Entea · Answer

言語：PHPライブラリ：DOM

<?php $doc = new DOMDocument(); $doc->strictErrorChecking = false; $doc->loadHTMLFile('http://stackoverflow.com/questions/773340'); $xpath = new DOMXpath($doc); $links = $xpath->query('//a[@href]'); for ($i = 0; $i < $links->length; $i++) echo $links->item($i)->getAttribute('href'), "
";

時々、@記号の前に$doc->loadHTMLFile無効なHTML解析警告を抑制する

chewymole · Answer

言語：Coldfusion 9.0.1+

ライブラリ： jSoup

<cfscript> function parseURL(required string url){ var res = []; var javaLoader = createObject("javaloader.JavaLoader").init([expandPath("./jsoup-1.7.3.jar")]); var jSoupClass = javaLoader.create("org.jsoup.Jsoup"); //var dom = jSoupClass.parse(html); // if you already have some html to parse. var dom = jSoupClass.connect( arguments.url ).get(); var links = dom.select("a"); for(var a=1;a LT arrayLen(links);a++){ var s={};s.href= links[a].attr('href'); s.text= links[a].text(); if(s.href contains "http://" || s.href contains "https://") arrayAppend(res,s); } return res; } //writeoutput(writedump(parseURL(url))); </cfscript> <cfdump var="#parseURL("http://stackoverflow.com/questions/773340/can-you-provide-examples-of-parsing-html")#">

構造体の配列を返します。各構造体にはHREFおよびTEXTオブジェクトが含まれます。