htmlagilitypack-スクリプトとスタイルを削除しますか？

Question

次の方法を使用してテキストフォームhtmlを抽出します。

 public string getAllText(string _html) { string _allText = ""; try { HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument(); document.LoadHtml(_html); var root = document.DocumentNode; var sb = new StringBuilder(); foreach (var node in root.DescendantNodesAndSelf()) { if (!node.HasChildNodes) { string text = node.InnerText; if (!string.IsNullOrEmpty(text)) sb.AppendLine(text.Trim()); } } _allText = sb.ToString(); } catch (Exception) { } _allText = System.Web.HttpUtility.HtmlDecode(_allText); return _allText; }

問題は、スクリプトタグとスタイルタグも取得することです。

どうすればそれらを除外できますか？

L.B · Accepted Answer

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); doc.LoadHtml(html); doc.DocumentNode.Descendants() .Where(n => n.Name == "script" || n.Name == "style") .ToList() .ForEach(n => n.Remove());

johnw86 · Answer

HtmlDocumentクラスを使用してこれを行うことができます。

HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(input); doc.DocumentNode.SelectNodes("//style|//script").ToList().ForEach(n => n.Remove());

Rusty Nail · Answer

いくつかの優れた答え、System.Linqは便利です！

Linqベースではないアプローチの場合：

private HtmlAgilityPack.HtmlDocument RemoveScripts(HtmlAgilityPack.HtmlDocument webDocument) { // Get all Nodes: script HtmlAgilityPack.HtmlNodeCollection Nodes = webDocument.DocumentNode.SelectNodes("//script"); // Make sure not Null: if (Nodes == null) return webDocument; // Remove all Nodes: foreach (HtmlNode node in Nodes) node.Remove(); return webDocument; }