C＃で＆nbspを含む文字列からHTMLタグを削除する

Question

C＃で正規表現を使用して＆nbspを含むすべてのHTMLタグを削除するにはどうすればよいですか。私の文字列は次のようになります

 "<div>hello</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div>"

Ravi Thapliyal · Accepted Answer

HTMLパーサー指向のソリューションを使用してタグを除外できない場合は、簡単な正規表現を使用します。

string noHTML = Regex.Replace(inputHTML, @"<[^>]+>|&nbsp;", "").Trim();

理想的には、複数のスペースを処理する正規表現フィルターに別のパスを作成する必要があります

string noHTMLNormalised = Regex.Replace(noHTML, @"\s{2,}", " ");

Don Rolling · Answer

@Ravi Thapliyalのコードを使用してメソッドを作成しました。これは単純で、すべてをきれいにするわけではありませんが、今のところ必要なことをしています。

public static string ScrubHtml(string value) { var step1 = Regex.Replace(value, @"<[^>]+>|&nbsp;", "").Trim(); var step2 = Regex.Replace(step1, @"\s{2,}", " "); return step2; }

David S. · Answer

私はこの機能をしばらく使用しています。乱雑なhtmlを削除して、テキストをそのまま残します。

 private static readonly Regex _tags_ = new Regex(@"<[^>]+?>", RegexOptions.Multiline | RegexOptions.Compiled); //add characters that are should not be removed to this regex private static readonly Regex _notOkCharacter_ = new Regex(@"[^\w;&#@.:/\?=|%!() -]", RegexOptions.Compiled); public static String UnHtml(String html) { html = HttpUtility.UrlDecode(html); html = HttpUtility.HtmlDecode(html); html = RemoveTag(html, "<!--", "-->"); html = RemoveTag(html, "<script", "</script>"); html = RemoveTag(html, "<style", "</style>"); //replace matches of these regexes with space html = _tags_.Replace(html, " "); html = _notOkCharacter_.Replace(html, " "); html = SingleSpacedTrim(html); return html; } private static String RemoveTag(String html, String startTag, String endTag) { Boolean bAgain; do { bAgain = false; Int32 startTagPos = html.IndexOf(startTag, 0, StringComparison.CurrentCultureIgnoreCase); if (startTagPos < 0) continue; Int32 endTagPos = html.IndexOf(endTag, startTagPos + 1, StringComparison.CurrentCultureIgnoreCase); if (endTagPos <= startTagPos) continue; html = html.Remove(startTagPos, endTagPos - startTagPos + endTag.Length); bAgain = true; } while (bAgain); return html; } private static String SingleSpacedTrim(String inString) { StringBuilder sb = new StringBuilder(); Boolean inBlanks = false; foreach (Char c in inString) { switch (c) { case '
': case '
': case '	': case ' ': if (!inBlanks) { inBlanks = true; sb.Append(' '); } continue; default: inBlanks = false; sb.Append(c); break; } } return sb.ToString().Trim(); }

MRP · Answer

var noHtml = Regex.Replace(inputHTML, @"<[^>]*(>|$)|&nbsp;|&zwnj;|&raquo;|&laquo;", string.Empty).Trim();

Sabique A Khan · Answer

@RaviThapliyalと@Don Rollingのコードを使用しましたが、少し変更を加えました。＆nbspを空の文字列に置き換えていますが、代わりに＆nbspをスペースに置き換える必要があるため、追加の手順を追加しました。それは魅力のように私のために働いた。

public static string FormatString(string value) { var step1 = Regex.Replace(value, @"<[^>]+>", "").Trim(); var step2 = Regex.Replace(step1, @"&nbsp;", " "); var step3 = Regex.Replace(step2, @"\s{2,}", " "); return step3; }

スタックオーバーフローによってフォーマットされていたため、セミコロンなしで＆nbpsを使用しました。

nivs1978 · Answer

HTMLは基本的な形でXMLにすぎません。 XmlDocumentオブジェクト内のテキストを解析し、ルート要素でInnerTextを呼び出してテキストを抽出できます。これにより、あらゆる形式のすべてのHTMLタグが削除され、＆lt;などの特殊文字も処理されます。＆nbsp;一度にすべて。

Jonesopolis · Answer

この：

(<.+?> | &nbsp;)

任意のタグまたは に一致します

string regex = @"(<.+?>|&nbsp;)"; var x = Regex.Replace(originalString, regex, "").Trim();

次にx = hello

Ehsan88 · Answer

Htmlドキュメントのサニタイズには、多くの注意が必要です。このパッケージは助けになるかもしれません： https://github.com/mganss/HtmlSanitizer