HTMLアジリティパック-コンテンツを削除せずに不要なタグを削除しますか？

Question

ここでいくつかの関連する質問を見てきましたが、彼らは私が直面しているのと同じ問題について正確に語っていません。

HTML Agility Pack を使用して、タグ内のコンテンツを失うことなくHTMLから不要なタグを削除します。

したがって、たとえば、私のシナリオでは、タグ「b」、「i」、および「u」を保持したいと思います。

そして、次のような入力の場合：

my paragraph <div>and my div</div> are italic and bold

結果のHTMLは次のようになります。

my paragraph and my div are italic and bold

HtmlNodeのRemoveメソッドを使用してみましたが、コンテンツも削除されます。助言がありますか？

Mathias Lykkegaard Lorenzen · Accepted Answer

Odedの提案に基づいてアルゴリズムを作成しました。ここにあります。魅力のように機能します。

strong、em、uおよび生のテキストノードを除くすべてのタグを削除します。

internal static string RemoveUnwantedTags(string data) { if(string.IsNullOrEmpty(data)) return string.Empty; var document = new HtmlDocument(); document.LoadHtml(data); var acceptableTags = new String[] { "strong", "em", "u"}; var nodes = new Queue<HtmlNode>(document.DocumentNode.SelectNodes("./*|./text()")); while(nodes.Count > 0) { var node = nodes.Dequeue(); var parentNode = node.ParentNode; if(!acceptableTags.Contains(node.Name) && node.Name != "#text") { var childNodes = node.SelectNodes("./*|./text()"); if (childNodes != null) { foreach (var child in childNodes) { nodes.Enqueue(child); parentNode.InsertBefore(child, node); } } parentNode.RemoveChild(node); } } return document.DocumentNode.InnerHtml; }

theyetiman · Answer

Html文字列から不要なhtmlタグの特定のリストを再帰的に削除する方法

@mathiasの回答を取り、拡張メソッドを改善して、List<string>（例：{"a","p","hr"}）として除外するタグのリストを提供できるようにしました。また、再帰的に適切に動作するようにロジックを修正しました。

public static string RemoveUnwantedHtmlTags(this string html, List<string> unwantedTags) { if (String.IsNullOrEmpty(html)) { return html; } var document = new HtmlDocument(); document.LoadHtml(html); HtmlNodeCollection tryGetNodes = document.DocumentNode.SelectNodes("./*|./text()"); if (tryGetNodes == null || !tryGetNodes.Any()) { return html; } var nodes = new Queue<HtmlNode>(tryGetNodes); while (nodes.Count > 0) { var node = nodes.Dequeue(); var parentNode = node.ParentNode; var childNodes = node.SelectNodes("./*|./text()"); if (childNodes != null) { foreach (var child in childNodes) { nodes.Enqueue(child); } } if (unwantedTags.Any(tag => tag == node.Name)) { if (childNodes != null) { foreach (var child in childNodes) { parentNode.InsertBefore(child, node); } } parentNode.RemoveChild(node); } } return document.DocumentNode.InnerHtml; }

Nathan Phillips · Answer

以下を試してください。他の提案されたソリューションよりも少しすっきりするかもしれません。

public static int RemoveNodesButKeepChildren(this HtmlNode rootNode, string xPath) { HtmlNodeCollection nodes = rootNode.SelectNodes(xPath); if (nodes == null) return 0; foreach (HtmlNode node in nodes) node.RemoveButKeepChildren(); return nodes.Count; } public static void RemoveButKeepChildren(this HtmlNode node) { foreach (HtmlNode child in node.ChildNodes) node.ParentNode.InsertBefore(child, node); node.Remove(); } public static bool TestYourSpecificExample() { string html = "<p>my paragraph <div>and my <b>div</b></div> are <i>italic</i> and <b>bold</b></p>"; HtmlDocument document = new HtmlDocument(); document.LoadHtml(html); document.DocumentNode.RemoveNodesButKeepChildren("//div"); document.DocumentNode.RemoveNodesButKeepChildren("//p"); return document.DocumentNode.InnerHtml == "my paragraph and my <b>div</b> are <i>italic</i> and <b>bold</b>"; }

Oded · Answer

ノードを削除する前に、その親とそのInnerTextを取得してから、ノードを削除し、InnerTextを親に再割り当てします。

var parent = node.ParentNode; var innerText = parent.InnerText; node.Remove(); parent.AppendChild(doc.CreateTextNode(innerText));

Dilip0165 · Answer

Htmlアジリティパックを使用したくないが、それでも不要なHtmlタグを削除したい場合は、以下のようにしてください。

public static string RemoveHtmlTags(string strHtml) { string strText = Regex.Replace(strHtml, "<(.|
)*?>", String.Empty); strText = HttpUtility.HtmlDecode(strText); strText = Regex.Replace(strText, @"\s+", " "); return strText; }