正規表現を使用して複数のHTMLタグ間でテキストを取得する

Question

正規表現を使用して、複数のDIVタグの間のテキストを取得できるようにしたいと考えています。たとえば、次のとおりです。

<div>first html tag</div> <div>another tag</div>

出力されます：

first html tag another tag

私が使用している正規表現パターンは、最後のdivタグにのみ一致し、最初のdivタグにはありません。コード：

 static void Main(string[] args) { string input = "<div>This is a test</div><div class=\"something\">This is ANOTHER test</div>"; string pattern = "(<div.*>)(.*)(<\/div>)"; MatchCollection matches = Regex.Matches(input, pattern); Console.WriteLine("Matches found: {0}", matches.Count); if (matches.Count > 0) foreach (Match m in matches) Console.WriteLine("Inner DIV: {0}", m.Groups[2]); Console.ReadLine(); }

出力：

見つかった一致：1

内部DIV：これは別のテストです

coolmine · Answer

パターンを貪欲でない一致に置き換えます

static void Main(string[] args) { string input = "<div>This is a test</div><div class=\"something\">This is ANOTHER test</div>"; string pattern = "<div.*?>(.*?)<\/div>"; MatchCollection matches = Regex.Matches(input, pattern); Console.WriteLine("Matches found: {0}", matches.Count); if (matches.Count > 0) foreach (Match m in matches) Console.WriteLine("Inner DIV: {0}", m.Groups[1]); Console.ReadLine(); }

Mehdi Dehghani · Answer

他の人が言及しなかったようにHTML tags with attributes、これを処理するための私の解決策は次のとおりです。

// <TAG(.*?)>(.*?)</TAG> // Example var regex = new System.Text.RegularExpressions.Regex("<h1(.*?)>(.*?)</h1>"); var m = regex.Match("Hello <h1 style='color: red;'>World</h1> !!"); Console.Write(m.Groups[2].Value); // will print -> World

Mayman · Answer

まず最初に、HTMLファイルに新しい行記号（ "\ n"）が含まれることを覚えておいてください。これは、正規表現のチェックに使用している文字列には含まれていません。

次に、正規表現を使用します。

((<div.*>)(.*)(<\/div>))+ //This Regex will look for any amount of div tags, but it must see at least one div tag. ((<div.*>)(.*)(<\/div>))* //This regex will look for any amount of div tags, and it will not complain if there are no results at all.

この種の情報を探すのにも良い場所です：

http://www.regular-expressions.info/reference.html

http://www.regular-expressions.info/refadv.html

メイマン

Craig · Answer

Html Agility Pack を確認しましたか（ https://stackoverflow.com/a/857926/618649 を参照）？

CsQuery も非常に便利に見えます（基本的にCSSセレクタースタイルの構文を使用して要素を取得します）。 https://stackoverflow.com/a/11090816/618649 を参照してください。

CsQueryは基本的に「jQuery for C＃」を意味します。これは、私がそれを見つけるために使用した正確な検索基準とほぼ同じです。

これをWebブラウザーで実行できる場合は、$("div").each(function(idx){ alert( idx + ": " + $(this).text()); }に類似した構文を使用してjQueryを簡単に使用できます（明らかに結果をログまたは画面に出力するか、Webサービスを呼び出します）それ、またはあなたがそれを行うために必要なものすべて）。

Tri Nguyen Dung · Answer

私はこのコードがうまくいくと思います：

string htmlSource = "<div>first html tag</div><div>another tag</div>"; string pattern = @"<div[^>]*?>(.*?)</div>"; MatchCollection matches = Regex.Matches(htmlSource, pattern, RegexOptions.IgnoreCase | RegexOptions.Singleline); ArrayList l = new ArrayList(); foreach (Match match in matches) { l.Add(match.Groups[1].Value); }

Tom Jacques · Answer

短いバージョンでは、これをすべての状況で正しく実行できるわけではありません。正規表現が必要な情報の抽出に失敗する有効なHTMLのケースは常に存在します。

その理由は、HTMLがコンテキストフリーの文法であり、正規表現よりも複雑なクラスだからです。

ここに例があります-複数のdivがスタックされている場合はどうなりますか？

<div><div>stuff</div><div>stuff2</div></div>

他の回答としてリストされている正規表現は、次のようになります。

<div><div>stuff</div> <div>stuff</div> <div>stuff</div><div>stuff2</div> <div>stuff</div><div>stuff2</div></div> <div>stuff2</div> <div>stuff2</div></div>

これは、正規表現がHTMLを解析しようとするときに行うことです。

正規表現では解釈できないため、すべてのケースを解釈する方法を理解する正規表現を作成することはできません。非常に限定されたHTMLの制限されたセットを扱っている場合、それは可能かもしれませんが、この事実を覚えておく必要があります。

詳細： https://stackoverflow.com/a/1732454/2022565