ブラウザ環境なしでJSでHTMLをプレーンテキストに変換する

Question

保存されたHTMLドキュメントの抽象（テキストの最初のx文字）を生成するCouchDBビューマップ関数があります。残念ながら、HTMLをプレーンテキストに変換するブラウザ環境はありません。

現在、私はこの多段階正規表現を使用しています

html.replace(/<style([\s\S]*?)</style>/gi, ' ') .replace(/<script([\s\S]*?)</script>/gi, ' ') .replace(/(<(?:.|
)*?>)/gm, ' ') .replace(/\s+/gm, ' ');

それは非常に良いフィルターですが、それは明らかに完璧なものではなく、いくつかの残り物は時々すり抜けます。ブラウザ環境なしでプレーンテキストに変換するより良い方法はありますか？

EpokK · Accepted Answer

HTMLをGmailのようなプレーンテキストに変換します。

html = html.replace(/<style([\s\S]*?)</style>/gi, ''); html = html.replace(/<script([\s\S]*?)</script>/gi, ''); html = html.replace(/</div>/ig, '
'); html = html.replace(/</li>/ig, '
'); html = html.replace(/<li>/ig, ' * '); html = html.replace(/</ul>/ig, '
'); html = html.replace(/</p>/ig, '
'); html = html.replace(/<br\s*[/]?>/gi, "
"); html = html.replace(/<[^>]+>/ig, '');

jQueryを使用できる場合：

var html = jQuery('<div>').html(html).text();

Gael · Answer

この正規表現は機能します：

text.replace(/<[^>]*>/g, '');

gyula.nemeth · Answer

TextVersionJS（ http://textversionjs.com ）を使用すると、HTMLをプレーンテキストに変換できます。これは純粋なjavascript（大量のRegExpsを含む）なので、ブラウザーやnode.jsでも使用できます。

Node.jsでは次のようになります。

var createTextVersion = require("textversionjs"); var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>"; var textVersion = createTextVersion(yourHtml);

（ページから例をコピーしました。最初にモジュールをnpmインストールする必要があります。）

Dostonbek Oripjonov · Answer

この方法を試すことができます。 textContent with innerTextどちらもすべてのブラウザと互換性がありません：

var temp = document.createElement("div"); temp.innerHTML = html; return temp.textContent || temp.innerText || "";