どのようにHtmlをプレーンテキストに変換しますか？

Question

Htmlのスニペットをテーブルに保存しています。 ページ全体ではなく、タグなどではなく、基本的なフォーマットのみ

特定のページで、そのHTMLをテキストのみフォーマットなしとして表示できるようにしたいと思います（実際には最初の30〜50文字だけですが、それは簡単なことです）。

HTML内の「テキスト」をストレートテキストとして文字列に配置するにはどうすればよいですか？

したがって、このコード。

<b>Hello World.</b><br/><p><i>Is there anyone out there?</i><p>

になる：

こんにちは世界。誰かいますか？

vfilby · Accepted Answer

タグの除去について話している場合、<script>タグなどのことを心配する必要がない場合は、比較的簡単です。タグなしでテキストを表示するだけでよい場合は、正規表現でそれを実現できます。

<[^>]*>

<script>タグなどを心配する必要がある場合は、状態を追跡する必要があるため、正規表現よりも少し強力なものが必要になります。これは、Context Free Grammar（CFG）のようなものです。ただし、「Left To Right」または欲張りでないマッチングでそれを達成できる場合があります。

正規表現を使用できる場合は、多くのWebページに適切な情報が含まれています。

CFGのより複雑な動作が必要な場合、サードパーティのツールを使用することをお勧めしますが、残念ながら、推奨する良いツールがわかりません。

Judah Gabriel Himango · Answer

フリーでオープンソースの HtmlAgilityPack にはそのサンプルの1つに HTMLからプレーンテキストに変換するメソッドがあります。

var plainText = HtmlUtilities.ConvertToPlainText(string html);

次のようなHTML文字列をフィードします

<b>hello, <i>world!</i></b>

そして、次のようなプレーンテキストの結果が得られます。

hello world!

Ben Anderson · Answer

HtmlAgilityPackを使用できなかったため、2番目に最適なソリューションを自分で作成しました

private static string HtmlToPlainText(string html) { const string tagWhiteSpace = @"(>|$)(\W|
|
)+<";//matches one or more (white space or line breaks) between '>' and '<' const string stripFormatting = @"<[^>]*(>|$)";//match any character between '<' and '>', even when end tag is missing const string lineBreak = @"<(br|BR)\s{0,1}/{0,1}>";//matches: <br>,<br/>,<br />,<BR>,<BR/>,<BR /> var lineBreakRegex = new Regex(lineBreak, RegexOptions.Multiline); var stripFormattingRegex = new Regex(stripFormatting, RegexOptions.Multiline); var tagWhiteSpaceRegex = new Regex(tagWhiteSpace, RegexOptions.Multiline); var text = html; //Decode html specific characters text = System.Net.WebUtility.HtmlDecode(text); //Remove tag whitespace/line breaks text = tagWhiteSpaceRegex.Replace(text, "><"); //Replace <br /> with line breaks text = lineBreakRegex.Replace(text, Environment.NewLine); //Strip formatting text = stripFormattingRegex.Replace(text, string.Empty); return text; }

George Stocker · Answer

HTTPUtility.HTMLEncode()は、HTMLタグのエンコードを文字列として処理するためのものです。それはあなたのためにすべての重荷を引き受けます。 MSDNドキュメントから：

空白や句読点などの文字がHTTPストリームで渡されると、受信側で誤って解釈される可能性があります。 HTMLエンコーディングは、HTMLで許可されていない文字を同等の文字エンティティに変換します。 HTMLデコードはエンコードを逆にします。たとえば、テキストのブロックに埋め込まれている場合、文字<および>は、HTTP転送用に<および>としてエンコードされます。

HTTPUtility.HTMLEncode()メソッド、詳細ここ：

public static void HtmlEncode( string s, TextWriter output )

使用法：

String TestString = "This is a <Test String>."; StringWriter writer = new StringWriter(); Server.HtmlEncode(TestString, writer); String EncodedString = writer.ToString();

WEFX · Answer

Vfilbyの答えに追加するには、コード内でRegEx置換を実行するだけです。新しいクラスは必要ありません。私のような他の初心者がこの質問に手を出した場合に備えて。

using System.Text.RegularExpressions;

その後...

private string StripHtml(string source) { string output; //get rid of HTML tags output = Regex.Replace(source, "<[^>]*>", string.Empty); //get rid of multiple blank lines output = Regex.Replace(output, @"^\s*$
", string.Empty, RegexOptions.Multiline); return output; }

Abdulqadir_WDDN · Answer

HTMLをプレーンテキストに変換するための3ステッププロセス

最初にNugetパッケージをインストールする必要があります HtmlAgilityPack Second Create This class

public class HtmlToText { public HtmlToText() { } public string Convert(string path) { HtmlDocument doc = new HtmlDocument(); doc.Load(path); StringWriter sw = new StringWriter(); ConvertTo(doc.DocumentNode, sw); sw.Flush(); return sw.ToString(); } public string ConvertHtml(string html) { HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html); StringWriter sw = new StringWriter(); ConvertTo(doc.DocumentNode, sw); sw.Flush(); return sw.ToString(); } private void ConvertContentTo(HtmlNode node, TextWriter outText) { foreach(HtmlNode subnode in node.ChildNodes) { ConvertTo(subnode, outText); } } public void ConvertTo(HtmlNode node, TextWriter outText) { string html; switch(node.NodeType) { case HtmlNodeType.Comment: // don't output comments break; case HtmlNodeType.Document: ConvertContentTo(node, outText); break; case HtmlNodeType.Text: // script and style must not be output string parentName = node.ParentNode.Name; if ((parentName == "script") || (parentName == "style")) break; // get text html = ((HtmlTextNode)node).Text; // is it in fact a special closing node output as text? if (HtmlNode.IsOverlappedClosingElement(html)) break; // check the text is meaningful and not a bunch of whitespaces if (html.Trim().Length > 0) { outText.Write(HtmlEntity.DeEntitize(html)); } break; case HtmlNodeType.Element: switch(node.Name) { case "p": // treat paragraphs as crlf outText.Write("
"); break; } if (node.HasChildNodes) { ConvertContentTo(node, outText); } break; } } }

ユダ・ヒマンゴの答えを参照して上記のクラスを使用することにより

第三に、上記のクラスのオブジェクトを作成し、ConvertHtml(HTMLContent)ではなくConvertToPlainText(string html);メソッドを使用してHTMLをプレーンテキストに変換する必要があります。

HtmlToText htt=new HtmlToText(); var plainText = htt.ConvertHtml(HTMLContent);

Roman O · Answer

私が見つけた最も簡単な方法：

HtmlFilter.ConvertToPlainText(html);

HtmlFilterクラスは、Microsoft.TeamFoundation.WorkItemTracking.Controls.dllにあります

Dllは、次のようなフォルダーにあります。％ProgramFiles％\ Common Files\Microsoft shared\Team Foundation Server\14.0 \

VS 2015では、dllは同じフォルダーにあるMicrosoft.TeamFoundation.WorkItemTracking.Common.dllへの参照も必要とします。

Amine · Answer

HtmlAgilityPackには「ConvertToPlainText」という名前のメソッドはありませんが、次を使用してHTML文字列をCLEAR文字列に変換できます。

HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(htmlString); var textString = doc.DocumentNode.InnerText; Regex.Replace(textString , @"<(.|n)*?>", string.Empty).Replace("&nbsp", "");

それは私のために働く。しかし、「HtmlAgilityPack」に「ConvertToPlainText」という名前のメソッドはありません。

jeiea · Answer

長いインライン空白を折りたたまないという制限がありますが、間違いなく移植性があり、webbrowserのようなレイアウトを尊重します。

static string HtmlToPlainText(string html) { string buf; string block = "address|article|aside|blockquote|canvas|dd|div|dl|dt|" + "fieldset|figcaption|figure|footer|form|h\d|header|hr|li|main|nav|" + "noscript|ol|output|p|pre|section|table|tfoot|ul|video"; string patNestedBlock = $"(\s*?</?({block})[^>]*?>)+\s*"; buf = Regex.Replace(html, patNestedBlock, "
", RegexOptions.IgnoreCase); // Replace br tag to newline. buf = Regex.Replace(buf, @"<(br)[^>]*>", "
", RegexOptions.IgnoreCase); // (Optional) remove styles and scripts. buf = Regex.Replace(buf, @"<(script|style)[^>]*?>.*?</\1>", "", RegexOptions.Singleline); // Remove all tags. buf = Regex.Replace(buf, @"<[^>]*(>|$)", "", RegexOptions.Multiline); // Replace HTML entities. buf = WebUtility.HtmlDecode(buf); return buf; }

mik-t · Answer

最も簡単な方法は、「文字列」拡張メソッドを作成することだと思います（ユーザーRichardが提案したものに基づいて）：

using System; using System.Text.RegularExpressions; public static class StringHelpers { public static string StripHTML(this string HTMLText) { var reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase); return reg.Replace(HTMLText, ""); } }

次に、プログラムの「文字列」変数でこの拡張メソッドを使用します。

var yourHtmlString = "<div class=\"someclass\"><h2>yourHtmlText</h2></span>"; var yourTextString = yourHtmlString.StripHTML();

この拡張メソッドを使用して、HTML形式のコメントをプレーンテキストに変換し、クリスタルレポートに正しく表示されるようにします。

Corey Trager · Answer

HTMLタグのあるデータがあり、そのタグを表示してユーザーがタグを参照できるようにする場合は、HttpServerUtility :: HtmlEncodeを使用します。

HTMLタグが含まれるデータがあり、ユーザーにタグのレンダリングを表示させたい場合は、テキストをそのまま表示します。テキストがWebページ全体を表す場合、IFRAMEを使用します。

HTMLタグを含むデータがあり、タグを削除して書式なしテキストのみを表示する場合は、正規表現を使用します。

LakshmiSarada · Answer

私は同様の問題に直面し、最良の解決策を見つけました。以下のコードは私に最適です。

 private string ConvertHtml_Totext(string source) { try { string result; // Remove HTML Development formatting // Replace line breaks with space // because browsers inserts space result = source.Replace("
", " "); // Replace line breaks with space // because browsers inserts space result = result.Replace("
", " "); // Remove step-formatting result = result.Replace("	", string.Empty); // Remove repeating spaces because browsers ignore them result = System.Text.RegularExpressions.Regex.Replace(result, @"( )+", " "); // Remove the header (prepare first by clearing attributes) result = System.Text.RegularExpressions.Regex.Replace(result, @"<( )*head([^>])*>","<head>", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"(<( )*(/)( )*head( )*>)","</head>", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, "(<head>).*(</head>)",string.Empty, System.Text.RegularExpressions.RegexOptions.IgnoreCase); // remove all scripts (prepare first by clearing attributes) result = System.Text.RegularExpressions.Regex.Replace(result, @"<( )*script([^>])*>","<script>", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"(<( )*(/)( )*script( )*>)","</script>", System.Text.RegularExpressions.RegexOptions.IgnoreCase); //result = System.Text.RegularExpressions.Regex.Replace(result, // @"(<script>)([^(<script>\.</script>)])*(</script>)", // string.Empty, // System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"(<script>).*(</script>)",string.Empty, System.Text.RegularExpressions.RegexOptions.IgnoreCase); // remove all styles (prepare first by clearing attributes) result = System.Text.RegularExpressions.Regex.Replace(result, @"<( )*style([^>])*>","<style>", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"(<( )*(/)( )*style( )*>)","</style>", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, "(<style>).*(</style>)",string.Empty, System.Text.RegularExpressions.RegexOptions.IgnoreCase); // insert tabs in spaces of <td> tags result = System.Text.RegularExpressions.Regex.Replace(result, @"<( )*td([^>])*>","	", System.Text.RegularExpressions.RegexOptions.IgnoreCase); // insert line breaks in places of <BR> and <LI> tags result = System.Text.RegularExpressions.Regex.Replace(result, @"<( )*br( )*>","
", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"<( )*li( )*>","
", System.Text.RegularExpressions.RegexOptions.IgnoreCase); // insert line paragraphs (double line breaks) in place // if <P>, <DIV> and <TR> tags result = System.Text.RegularExpressions.Regex.Replace(result, @"<( )*div([^>])*>","

", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"<( )*tr([^>])*>","

", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"<( )*p([^>])*>","

", System.Text.RegularExpressions.RegexOptions.IgnoreCase); // Remove remaining tags like <a>, links, images, // comments etc - anything that's enclosed inside < > result = System.Text.RegularExpressions.Regex.Replace(result, @"<[^>]*>",string.Empty, System.Text.RegularExpressions.RegexOptions.IgnoreCase); // replace special characters: result = System.Text.RegularExpressions.Regex.Replace(result, @" "," ", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"&bull;"," * ", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"&lsaquo;","<", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"&rsaquo;",">", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"&trade;","(tm)", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"&frasl;","/", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"&lt;","<", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"&gt;",">", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"&copy;","(c)", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"&reg;","(r)", System.Text.RegularExpressions.RegexOptions.IgnoreCase); // Remove all others. More can be added, see // http://hotwired.lycos.com/webmonkey/reference/special_characters/ result = System.Text.RegularExpressions.Regex.Replace(result, @"&(.{2,6});", string.Empty, System.Text.RegularExpressions.RegexOptions.IgnoreCase); // for testing //System.Text.RegularExpressions.Regex.Replace(result, // this.txtRegex.Text,string.Empty, // System.Text.RegularExpressions.RegexOptions.IgnoreCase); // make line breaking consistent result = result.Replace("
", "
"); // Remove extra line breaks and tabs: // replace over 2 breaks with 2 and over 4 tabs with 4. // Prepare first to remove any whitespaces in between // the escaped characters and remove redundant tabs in between line breaks result = System.Text.RegularExpressions.Regex.Replace(result, "(
)( )+(
)","

", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, "(	)( )+(	)","		", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, "(	)( )+(
)","	
", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, "(
)( )+(	)","
	", System.Text.RegularExpressions.RegexOptions.IgnoreCase); // Remove redundant tabs result = System.Text.RegularExpressions.Regex.Replace(result, "(
)(	)+(
)","

", System.Text.RegularExpressions.RegexOptions.IgnoreCase); // Remove multiple tabs following a line break with just one tab result = System.Text.RegularExpressions.Regex.Replace(result, "(
)(	)+","
	", System.Text.RegularExpressions.RegexOptions.IgnoreCase); // Initial replacement target string for line breaks string breaks = "


"; // Initial replacement target string for tabs string tabs = "					"; for (int index=0; index<result.Length; index++) { result = result.Replace(breaks, "

"); result = result.Replace(tabs, "				"); breaks = breaks + "
"; tabs = tabs + "	"; } // That's it. return result; } catch { MessageBox.Show("Error"); return source; }

}

\ nや\ rなどのエスケープ文字は、正規表現が期待どおりに動作しなくなるため、最初に削除する必要がありました。

さらに、結果文字列をテキストボックスに正しく表示するには、それを分割して、Textプロパティに割り当てるのではなく、textboxのLinesプロパティを設定する必要がある場合があります。

this.txtResult.Lines = StripHTML（this.txtSource.Text）.Split（ "\ r" .ToCharArray（））;

ソース： https://www.codeproject.com/Articles/11902/Convert-HTML-to-Plain-Text-2

Karlas · Answer

私は同じ質問を持っていました、私のHTMLには次のような単純な既知のレイアウトがありました：

<DIV><P>abc</P><P>def</P></DIV>

だから私はそのような単純なコードを使用することになりました：

string.Join (Environment.NewLine, XDocument.Parse (html).Root.Elements ().Select (el => el.Value))

どの出力：

abc def

Mehdi Dehghani · Answer

私の解決策は次のとおりです。

public string StripHTML(string html) { var regex = new Regex("<[^>]+>", RegexOptions.IgnoreCase); return System.Web.HttpUtility.HtmlDecode((regex.Replace(html, ""))); }

例：

StripHTML("<p class='test' style='color:red;'>Here is my solution:</p>"); // output -> Here is my solution:

mpez0 · Answer

「html」の意味に依存します。最も複雑なケースは、完全なWebページです。また、テキストモードのWebブラウザーを使用できるため、これは最も簡単に処理できます。 Wikipediaの記事テキストモードブラウザーを含むWebブラウザーのリストを参照してください。 Lynxはおそらく最もよく知られていますが、他の1つはあなたのニーズにより適しているかもしれません。

sobelito · Answer

書いていませんでしたが、

using HtmlAgilityPack; using System; using System.IO; using System.Text.RegularExpressions; namespace foo { //small but important modification to class https://github.com/zzzprojects/html-agility-pack/blob/master/src/Samples/Html2Txt/HtmlConvert.cs public static class HtmlToText { public static string Convert(string path) { HtmlDocument doc = new HtmlDocument(); doc.Load(path); return ConvertDoc(doc); } public static string ConvertHtml(string html) { HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html); return ConvertDoc(doc); } public static string ConvertDoc(HtmlDocument doc) { using (StringWriter sw = new StringWriter()) { ConvertTo(doc.DocumentNode, sw); sw.Flush(); return sw.ToString(); } } internal static void ConvertContentTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo) { foreach (HtmlNode subnode in node.ChildNodes) { ConvertTo(subnode, outText, textInfo); } } public static void ConvertTo(HtmlNode node, TextWriter outText) { ConvertTo(node, outText, new PreceedingDomTextInfo(false)); } internal static void ConvertTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo) { string html; switch (node.NodeType) { case HtmlNodeType.Comment: // don't output comments break; case HtmlNodeType.Document: ConvertContentTo(node, outText, textInfo); break; case HtmlNodeType.Text: // script and style must not be output string parentName = node.ParentNode.Name; if ((parentName == "script") || (parentName == "style")) { break; } // get text html = ((HtmlTextNode)node).Text; // is it in fact a special closing node output as text? if (HtmlNode.IsOverlappedClosingElement(html)) { break; } // check the text is meaningful and not a bunch of whitespaces if (html.Length == 0) { break; } if (!textInfo.WritePrecedingWhiteSpace || textInfo.LastCharWasSpace) { html = html.TrimStart(); if (html.Length == 0) { break; } textInfo.IsFirstTextOfDocWritten.Value = textInfo.WritePrecedingWhiteSpace = true; } outText.Write(HtmlEntity.DeEntitize(Regex.Replace(html.TrimEnd(), @"\s{2,}", " "))); if (textInfo.LastCharWasSpace = char.IsWhiteSpace(html[html.Length - 1])) { outText.Write(' '); } break; case HtmlNodeType.Element: string endElementString = null; bool isInline; bool skip = false; int listIndex = 0; switch (node.Name) { case "nav": skip = true; isInline = false; break; case "body": case "section": case "article": case "aside": case "h1": case "h2": case "header": case "footer": case "address": case "main": case "div": case "p": // stylistic - adjust as you tend to use if (textInfo.IsFirstTextOfDocWritten) { outText.Write("
"); } endElementString = "
"; isInline = false; break; case "br": outText.Write("
"); skip = true; textInfo.WritePrecedingWhiteSpace = false; isInline = true; break; case "a": if (node.Attributes.Contains("href")) { string href = node.Attributes["href"].Value.Trim(); if (node.InnerText.IndexOf(href, StringComparison.InvariantCultureIgnoreCase) == -1) { endElementString = "<" + href + ">"; } } isInline = true; break; case "li": if (textInfo.ListIndex > 0) { outText.Write("
{0}.	", textInfo.ListIndex++); } else { outText.Write("
*	"); //using '*' as bullet char, with tab after, but whatever you want eg "	->", if utf-8 0x2022 } isInline = false; break; case "ol": listIndex = 1; goto case "ul"; case "ul": //not handling nested lists any differently at this stage - that is getting close to rendering problems endElementString = "
"; isInline = false; break; case "img": //inline-block in reality if (node.Attributes.Contains("alt")) { outText.Write('[' + node.Attributes["alt"].Value); endElementString = "]"; } if (node.Attributes.Contains("src")) { outText.Write('<' + node.Attributes["src"].Value + '>'); } isInline = true; break; default: isInline = true; break; } if (!skip && node.HasChildNodes) { ConvertContentTo(node, outText, isInline ? textInfo : new PreceedingDomTextInfo(textInfo.IsFirstTextOfDocWritten) { ListIndex = listIndex }); } if (endElementString != null) { outText.Write(endElementString); } break; } } } internal class PreceedingDomTextInfo { public PreceedingDomTextInfo(BoolWrapper isFirstTextOfDocWritten) { IsFirstTextOfDocWritten = isFirstTextOfDocWritten; } public bool WritePrecedingWhiteSpace { get; set; } public bool LastCharWasSpace { get; set; } public readonly BoolWrapper IsFirstTextOfDocWritten; public int ListIndex { get; set; } } internal class BoolWrapper { public BoolWrapper() { } public bool Value { get; set; } public static implicit operator bool(BoolWrapper boolWrapper) { return boolWrapper.Value; } public static implicit operator BoolWrapper(bool boolWrapper) { return new BoolWrapper { Value = boolWrapper }; } } }

user3077654 · Answer

私はそれが簡単な答えを持っていると思う：

public string RemoveHTMLTags(string HTMLCode) { string str=System.Text.RegularExpressions.Regex.Replace(HTMLCode, "<[^>]*>", ""); return str; }