docおよびdocxからテキストを抽出します

Question

Docまたはdocxの内容を読み取る方法を知りたいのですが。 Linux VPSとPHPを使用していますが、他の言語を使用したより簡単な解決策がある場合は、Linux Webサーバーで動作する限り、お知らせください。

no_freedom · Answer

これは.DOCXソリューションのみです。 .DOCまたは.PDFの場合は、PDFの場合は pdf2text.php のような他のものを使用する必要があります。

function docx2text($filename) { return readZippedXML($filename, "Word/document.xml"); } function readZippedXML($archiveFile, $dataFile) { // Create new Zip archive $Zip = new ZipArchive; // Open received archive file if (true === $Zip->open($archiveFile)) { // If done, search for the data file in the archive if (($index = $Zip->locateName($dataFile)) !== false) { // If found, read it to the string $data = $Zip->getFromIndex($index); // Close archive file $Zip->close(); // Load XML from a string // Skip errors and warnings $xml = new DOMDocument(); $xml->loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING); // Return data without XML formatting tags return strip_tags($xml->saveXML()); } $Zip->close(); } // In case of failure return empty string return ""; } echo docx2text("test.docx"); // Save this contents to file

M Khalid Junaid · Answer

ここに、。doc、.docxWordファイルからテキストを取得するためのソリューションを追加しました

Wordファイル.doc、docx phpからテキストを抽出する方法

.docの場合

private function read_doc() { $fileHandle = fopen($this->filename, "r"); $line = @fread($fileHandle, filesize($this->filename)); $lines = explode(chr(0x0D),$line); $outtext = ""; foreach($lines as $thisline) { $pos = strpos($thisline, chr(0x00)); if (($pos !== FALSE)||(strlen($thisline)==0)) { } else { $outtext .= $thisline." "; } } $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@/\_]/","",$outtext); return $outtext; }

.docxの場合

private function read_docx(){ $striped_content = ''; $content = ''; $Zip = Zip_open($this->filename); if (!$Zip || is_numeric($Zip)) return false; while ($Zip_entry = Zip_read($Zip)) { if (Zip_entry_open($Zip, $Zip_entry) == FALSE) continue; if (Zip_entry_name($Zip_entry) != "Word/document.xml") continue; $content .= Zip_entry_read($Zip_entry, Zip_entry_filesize($Zip_entry)); Zip_entry_close($Zip_entry); }// end while Zip_close($Zip); $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content); $content = str_replace('</w:r></w:p>', "\r\n", $content); $striped_content = strip_tags($content); return $striped_content; }

Luke Madhanga · Answer

。docx、.odt、.doc、.rtfドキュメントを解析

ここや他の場所での回答に基づいて、docx、odt、およびrtfドキュメントを解析するライブラリを作成しました。

.docxおよび.odtの解析に加えた主な改善点は、ライブラリがドキュメントを説明するXMLを処理し、それをHTMLタグemおよびstrongタグ。これは、CMSにライブラリを使用している場合、テキストの書式設定が失われないことを意味します

あなたはそれを得ることができますここ

chiptuned · Answer

私の解決策は Antiword for .docおよび docx2txt for .docx

制御しているLinuxサーバーを想定して、それぞれをダウンロードし、抽出してからインストールします。私はそれぞれをシステム全体にインストールしました：

アンチワード：make global_install
docx2txt：make install

次に、これらのツールを使用して、テキストをphpの文字列に抽出します。

//for .doc $text = Shell_exec('/usr/local/bin/antiword -w 0 ' . escapeshellarg($docFilePath)); //for .docx $text = Shell_exec('/usr/local/bin/docx2txt.pl ' . escapeshellarg($docxFilePath) . ' -');

docx2txtにはPerlが必要です

no_freedomのソリューションは、docxファイルからテキストを抽出しますが、空白を削除することができます。私がテストしたほとんどのファイルには、分離する必要のある単語の間にスペースがない場合がありました。処理中のドキュメントを全文検索する場合は適切ではありません。

Lalaka · Answer

ApachePOI を試してください。 Javaでうまく機能します。 LinuxにJavaをインストールするのに問題はないと思います。

Mohini · Answer

Docxtotxtを使用してdocxファイルのコンテンツを抽出しました。私のコードは次のとおりです。

if($extention == "docx") { $docxFilePath = "/var/www/vhosts/abc.com/httpdocs/writers/filename.docx"; $content = Shell_exec('/var/www/vhosts/abc.com/httpdocs/docx2txt/docx2txt.pl '.escapeshellarg($docxFilePath) . ' -'); }

Ilya P · Answer

Apache Tikaを、それが提供する完全なソリューションとして使用できますREST API。

もう1つの優れたライブラリは RawText です。これは、画像に対してOCRを実行し、任意のドキュメントからテキストを抽出できるためです。これは無料ではなく、REST APIで機能します。

RawTextを使用してファイルを抽出するサンプルコード：

$result = $rawText->extract($your_file)

kadutskyi · Answer

Docからtxtへのコンバーター機能に少し改善を挿入します

private function read_doc() { $line_array = array(); $fileHandle = fopen( $this->filename, "r" ); $line = @fread( $fileHandle, filesize( $this->filename ) ); $lines = explode( chr( 0x0D ), $line ); $outtext = ""; foreach ( $lines as $thisline ) { $pos = strpos( $thisline, chr( 0x00 ) ); if ( $pos !== false ) { } else { $line_array[] = preg_replace( "/[^a-zA-Z0-9\s\,\.\-\n\r\t@/\_]/", "", $thisline ); } } return implode("\n",$line_array); }

これで空の行が保存され、txtファイルは行ごとに表示されます。