JavaScriptを使用してPDFファイルからテキストを抽出する

Question

サーバーを使用せずにクライアント側でJavascriptのみを使用してpdfファイルからテキストを抽出したい。次のリンクでJavaScriptコードを既に見つけました。 JavascriptのPDFからテキストを抽出

そしてその後

http://hublog.hubmed.org/archives/001948.html

および：

https://github.com/hubgit/hubgit.github.com/tree/master/2011/11/pdftotext

1）以前のファイルからこれらの抽出に必要なファイルは何かを知りたいです。 2）ウェブではなくアプリケーションでこれらのコードをどのように適合させるか正確に知りません。

どんな答えでも歓迎です。ありがとうございました。

Allanon · Accepted Answer

ここに、テキストを抽出するためにpdf.jsを使用する良い例があります： http://git.macropus.org/2011/11/pdftotext/example/

もちろん、目的のために多くのコードを削除する必要がありますが、削除する必要があります

Carlos Delgado · Answer

同じライブラリ（最新バージョンを使用）、 pdf.jsを使用を使用してiframe間でメッセージを投稿する必要のない、より簡単なアプローチを作成しました。

次の例では、PDFの最初のページからのみすべてのテキストを抽出します。

/** * Retrieves the text of a specif page within a PDF Document obtained through pdf.js * * @param {Integer} pageNum Specifies the number of the page * @param {PDFDocument} PDFDocumentInstance The PDF document obtained **/ function getPageText(pageNum, PDFDocumentInstance) { // Return a Promise that is solved once the text of the page is retrieven return new Promise(function (resolve, reject) { PDFDocumentInstance.getPage(pageNum).then(function (pdfPage) { // The main trick to obtain the text of the PDF page, use the getTextContent method pdfPage.getTextContent().then(function (textContent) { var textItems = textContent.items; var finalString = ""; // Concatenate the string of the item to the final string for (var i = 0; i < textItems.length; i++) { var item = textItems[i]; finalString += item.str + " "; } // Solve promise with the text retrieven from the page resolve(finalString); }); }); }); } /** * Extract the test from the PDF */ var PDF_URL = '/path/to/example.pdf'; PDFJS.getDocument(PDF_URL).then(function (PDFDocumentInstance) { var totalPages = PDFDocumentInstance.pdfInfo.numPages; var pageNumber = 1; // Extract the text getPageText(pageNumber , PDFDocumentInstance).then(function(textPage){ // Show the text of the page in the console console.log(textPage); }); }, function (reason) { // PDF loading error console.error(reason); });

ここでこのソリューションに関する記事を読む。 @xarxziuxが言及したように、最初のソリューションが投稿されてからライブラリが変更されました（pdf.jsの最新バージョンでは動作しなくなりました）。これはほとんどの場合に機能するはずです。