PDFboxを使用してドキュメント内の単語の座標を決定する

Question

PDFドキュメント内の単語/文字列の座標を抽出するためにPDFboxを使用しています。これまでのところ、個々の文字の位置を決定することに成功しています。これは、これまでのところ、PDFboxからのコードです。文書：

package printtextlocations; import Java.io.*; import org.Apache.pdfbox.exceptions.InvalidPasswordException; import org.Apache.pdfbox.pdmodel.PDDocument; import org.Apache.pdfbox.pdmodel.PDPage; import org.Apache.pdfbox.pdmodel.common.PDStream; import org.Apache.pdfbox.util.PDFTextStripper; import org.Apache.pdfbox.util.TextPosition; import Java.io.IOException; import Java.util.List; public class PrintTextLocations extends PDFTextStripper { public PrintTextLocations() throws IOException { super.setSortByPosition(true); } public static void main(String[] args) throws Exception { PDDocument document = null; try { File input = new File("C:\path\to\PDF.pdf"); document = PDDocument.load(input); if (document.isEncrypted()) { try { document.decrypt(""); } catch (InvalidPasswordException e) { System.err.println("Error: Document is encrypted with a password."); System.exit(1); } } PrintTextLocations printer = new PrintTextLocations(); List allPages = document.getDocumentCatalog().getAllPages(); for (int i = 0; i < allPages.size(); i++) { PDPage page = (PDPage) allPages.get(i); System.out.println("Processing page: " + i); PDStream contents = page.getContents(); if (contents != null) { printer.processStream(page, page.findResources(), page.getContents().getStream()); } } } finally { if (document != null) { document.close(); } } } /** * @param text The text to be processed */ @Override /* this is questionable, not sure if needed... */ protected void processTextPosition(TextPosition text) { System.out.println("String[" + text.getXDirAdj() + "," + text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale=" + text.getXScale() + " height=" + text.getHeightDir() + " space=" + text.getWidthOfSpace() + " width=" + text.getWidthDirAdj() + "]" + text.getCharacter()); } }

これにより、スペースを含む各文字の位置を含む次のような一連の行が生成されます。

String[202.5604,41.880127 fs=1.0 xscale=13.98 height=9.68814 space=3.8864403 width=9.324661]P

ここで、「P」は文字です。 PDFboxで単語を検索する関数を見つけることができませんでした。また、Javaに慣れていないため、これらの文字を正確に連結して単語に戻し、スペースを検索することができます。他の人も同じような状況にありますか？そうであれば、どのようにそれに取り組みましたか？パーツを簡略化するために、Wordの最初の文字の座標だけが本当に必要ですが、どのように一致させるかについてそのような出力に対する文字列は私を超えています。

Nicolas W. · Answer

PDFBoxには、単語を自動的に抽出する機能はありません。私は現在、データを抽出してブロックに収集する作業をしており、ここに私のプロセスがあります：

ドキュメントのすべての文字（グリフと呼ばれる）を抽出し、リストに格納します。
リストをループして、各グリフの座標の分析を行います。それらが重なる場合（現在のグリフの上部が前の上部と下部の間に含まれている場合、または現在のグリフの下部が前のグリフの上部と下部の間に含まれている場合）、同じ行に追加します。
この時点で、ドキュメントのさまざまな行を抽出しました（ドキュメントが複数列の場合、「行」という表現は、垂直方向に重なるすべてのグリフ、つまり同じ垂直方向のすべての列のテキストを意味します）座標）。
次に、現在のグリフの左座標を前のグリフの右座標と比較して、それらが同じWordに属しているかどうかを判断できます（PDFTextStripperクラスは、試行錯誤に基づいて提供するgetSpacingTolerance（）メソッドを提供します）、「通常の」スペースの値です。右と左の座標の差がこの値よりも小さい場合、両方のグリフが同じWordに属しています。

私はこの方法を自分の仕事に適用しましたが、うまくいきます。

Dainesch · Answer

ここでの元のアイデアに基づくのは、PDFBox 2のテキスト検索のバージョンです。コード自体は大雑把ですが、単純です。それはあなたがかなり速く始めるようになるはずです。

import Java.io.IOException; import Java.io.Writer; import Java.util.List; import Java.util.Set; import lu.abac.pdfclient.data.PDFTextLocation; import org.Apache.pdfbox.pdmodel.PDDocument; import org.Apache.pdfbox.text.PDFTextStripper; import org.Apache.pdfbox.text.TextPosition; public class PrintTextLocator extends PDFTextStripper { private final Set<PDFTextLocation> locations; public PrintTextLocator(PDDocument document, Set<PDFTextLocation> locations) throws IOException { super.setSortByPosition(true); this.document = document; this.locations = locations; this.output = new Writer() { @Override public void write(char[] cbuf, int off, int len) throws IOException { } @Override public void flush() throws IOException { } @Override public void close() throws IOException { } }; } public Set<PDFTextLocation> doSearch() throws IOException { processPages(document.getDocumentCatalog().getPages()); return locations; } @Override protected void writeString(String text, List<TextPosition> textPositions) throws IOException { super.writeString(text); String searchText = text.toLowerCase(); for (PDFTextLocation textLoc:locations) { int start = searchText.indexOf(textLoc.getText().toLowerCase()); if (start!=-1) { // found TextPosition pos = textPositions.get(start); textLoc.setFound(true); textLoc.setPage(getCurrentPageNo()); textLoc.setX(pos.getXDirAdj()); textLoc.setY(pos.getYDirAdj()); } } } }

Marouita · Answer

これを見て、私はそれがあなたが必要とするものだと思います。

https://jackson-brain.com/using-pdfbox-to-locate-text-coordinates-within-a-pdf-in-Java/

これがコードです：

import Java.io.File; import Java.io.IOException; import Java.text.DecimalFormat; import Java.util.ArrayList; import Java.util.Arrays; import Java.util.List; import org.Apache.pdfbox.exceptions.InvalidPasswordException; import org.Apache.pdfbox.pdmodel.PDDocument; import org.Apache.pdfbox.pdmodel.PDPage; import org.Apache.pdfbox.pdmodel.common.PDStream; import org.Apache.pdfbox.util.PDFTextStripper; import org.Apache.pdfbox.util.TextPosition; public class PrintTextLocations extends PDFTextStripper { public static StringBuilder tWord = new StringBuilder(); public static String seek; public static String[] seekA; public static List wordList = new ArrayList(); public static boolean is1stChar = true; public static boolean lineMatch; public static int pageNo = 1; public static double lastYVal; public PrintTextLocations() throws IOException { super.setSortByPosition(true); } public static void main(String[] args) throws Exception { PDDocument document = null; seekA = args[1].split(","); seek = args[1]; try { File input = new File(args[0]); document = PDDocument.load(input); if (document.isEncrypted()) { try { document.decrypt(""); } catch (InvalidPasswordException e) { System.err.println("Error: Document is encrypted with a password."); System.exit(1); } } PrintTextLocations printer = new PrintTextLocations(); List allPages = document.getDocumentCatalog().getAllPages(); for (int i = 0; i < allPages.size(); i++) { PDPage page = (PDPage) allPages.get(i); PDStream contents = page.getContents(); if (contents != null) { printer.processStream(page, page.findResources(), page.getContents().getStream()); } pageNo += 1; } } finally { if (document != null) { System.out.println(wordList); document.close(); } } } @Override protected void processTextPosition(TextPosition text) { String tChar = text.getCharacter(); System.out.println("String[" + text.getXDirAdj() + "," + text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale=" + text.getXScale() + " height=" + text.getHeightDir() + " space=" + text.getWidthOfSpace() + " width=" + text.getWidthDirAdj() + "]" + text.getCharacter()); String REGEX = "[,.(:;!?)/]"; char c = tChar.charAt(0); lineMatch = matchCharLine(text); if ((!tChar.matches(REGEX)) && (!Character.isWhitespace(c))) { if ((!is1stChar) && (lineMatch == true)) { appendChar(tChar); } else if (is1stChar == true) { setWordCoord(text, tChar); } } else { endWord(); } } protected void appendChar(String tChar) { tWord.append(tChar); is1stChar = false; } protected void setWordCoord(TextPosition text, String tChar) { tWord.append("(").append(pageNo).append(")[").append(roundVal(Float.valueOf(text.getXDirAdj()))).append(" : ").append(roundVal(Float.valueOf(text.getYDirAdj()))).append("] ").append(tChar); is1stChar = false; } protected void endWord() { String newWord = tWord.toString().replaceAll("[^\x00-\x7F]", ""); String sWord = newWord.substring(newWord.lastIndexOf(' ') + 1); if (!"".equals(sWord)) { if (Arrays.asList(seekA).contains(sWord)) { wordList.add(newWord); } else if ("SHOWMETHEMONEY".equals(seek)) { wordList.add(newWord); } } tWord.delete(0, tWord.length()); is1stChar = true; } protected boolean matchCharLine(TextPosition text) { Double yVal = roundVal(Float.valueOf(text.getYDirAdj())); if (yVal.doubleValue() == lastYVal) { return true; } lastYVal = yVal.doubleValue(); endWord(); return false; } protected Double roundVal(Float yVal) { DecimalFormat rounded = new DecimalFormat("0.0'0'"); Double yValDub = new Double(rounded.format(yVal)); return yValDub; } }

依存関係：

PDFBox、FontBox、Apache Common Logging Interface。

コマンドラインで次のように入力して実行できます。

javac PrintTextLocations.Java Sudo Java PrintTextLocations file.pdf Word1,Word2,....

出力は次のようになります。

[(1)[190.3 : 286.8] Word1, (1)[283.3 : 286.8] Word2, ...]

mike · Answer

私はこれをIKVM変換PDFBox.NET 1.8.9を使用して動作させました。 C＃および.NET。

文字（グリフ）座標が.NETアセンブリにプライベートであることがわかりましたが、System.Reflectionを使用してアクセスできます。

WORDSの座標を取得し、SVGとHTMLを使用してPDFの画像に描画する完全な例をここに投稿しました： https://github.com/tsamop/PDF_Interpreter

以下の例では、PDFbox.NET： http://www.squarepdf.net/pdfbox-in-net が必要であり、プロジェクトへの参照を含めます。

それを理解するのにかなり時間がかかったので、誰か他の人の時間を節約できることを本当に望んでいます!!

キャラクターと座標を探す場所を知る必要がある場合、非常に簡略化したバージョンは次のようになります。

 using System; using System.Reflection; using org.Apache.pdfbox.pdmodel; using org.Apache.pdfbox.util; // to test run pdfTest.RunTest(@"C:	emp	est_2.pdf"); class pdfTest { //simple example for getting character (gliph) coordinates out of a pdf doc. // a more complete example is here: https://github.com/tsamop/PDF_Interpreter public static void RunTest(string sFilename) { //probably a better way to get page count, but I cut this out of a bigger project. PDDocument oDoc = PDDocument.load(sFilename); object[] oPages = oDoc.getDocumentCatalog().getAllPages().toArray(); int iPageNo = 0; //1's based!! foreach (object oPage in oPages) { iPageNo++; //feed the stripper a page. PDFTextStripper tStripper = new PDFTextStripper(); tStripper.setStartPage(iPageNo); tStripper.setEndPage(iPageNo); tStripper.getText(oDoc); //This gets the "charactersByArticle" private object in PDF Box. FieldInfo charactersByArticleInfo = typeof(PDFTextStripper).GetField("charactersByArticle", BIndingFlags.NonPublic | BindingFlags.Instance); object charactersByArticle = charactersByArticleInfo.GetValue(tStripper); object[] aoArticles = (object[])charactersByArticle.GetField("elementData"); foreach (object oArticle in aoArticles) { if (oArticle != null) { //THE CHARACTERS within the article object[] aoCharacters = (object[])oArticle.GetField("elementData"); foreach (object oChar in aoCharacters) { /*properties I caulght using reflection: * endX, endY, font, fontSize, fontSizePt, maxTextHeight, pageHeight, pageWidth, rot, str textPos, unicodCP, widthOfSpace, widths, wordSpacing, x, y * */ if (oChar != null) { //this is a really quick test. // for a more complete solution that pulls the characters into words and displays the Word positions on the page, try this: https://github.com/tsamop/PDF_Interpreter //the Y's appear to be the bottom of the char? double mfMaxTextHeight = Convert.ToDouble(oChar.GetField("maxTextHeight")); //I think this is the height of the character/Word char mcThisChar = oChar.GetField("str").ToString().ToCharArray()[0]; double mfX = Convert.ToDouble(oChar.GetField("x")); double mfY = Convert.ToDouble(oChar.GetField("y")) - mfMaxTextHeight; //CALCULATE THE OTHER SIDE OF THE GLIPH double mfWidth0 = ((Single[])oChar.GetField("widths"))[0]; double mfXend = mfX + mfWidth0; // Convert.ToDouble(oChar.GetField("endX")); //CALCULATE THE BOTTOM OF THE GLIPH. double mfYend = mfY + mfMaxTextHeight; // Convert.ToDouble(oChar.GetField("endY")); double mfPageHeight = Convert.ToDouble(oChar.GetField("pageHeight")); double mfPageWidth = Convert.ToDouble(oChar.GetField("pageWidth")); System.Diagnostics.Debug.Print(@"add some stuff to test {0}, {1}, {2}", mcThisChar, mfX, mfY); } } } } } } } using System.Reflection; /// <summary> /// To deal with the Java interface hiding necessary properties! ~mwr /// </summary> public static class GetField_Extension { public static object GetField(this object randomPDFboxObject, string sFieldName) { FieldInfo itemInfo = randomPDFboxObject.GetType().GetField(sFieldName, BindingFlags.NonPublic | BindingFlags.Instance); return itemInfo.GetValue(randomPDFboxObject); } }

GingerMattRogers · Answer

まだ支援が必要な人のために、これは私のコードで使用したものであり、有用なスタートになるはずです。 PDFBox 2.0.16を使用しています

public class PDFTextLocator extends PDFTextStripper { private static String key_string; private static float x; private static float y; public PDFTextLocator() throws IOException { x = -1; y = -1; } /** * Takes in a PDF Document, phrase to find, and page to search and returns the x,y in float array * @param document * @param phrase * @param page * @return * @throws IOException */ public static float[] getCoordiantes(PDDocument document, String phrase, int page) throws IOException { key_string = phrase; PDFTextStripper stripper = new PDFTextLocator(); stripper.setSortByPosition(true); stripper.setStartPage(page); stripper.setEndPage(page); stripper.writeText(document, new OutputStreamWriter(new ByteArrayOutputStream())); y = document.getPage(page).getMediaBox().getHeight()-y; return new float[]{x,y}; } /** * Override the default functionality of PDFTextStripper.writeString() */ @Override protected void writeString(String string, List<TextPosition> textPositions) throws IOException { if(string.contains(key_string)) { TextPosition text = textPositions.get(0); if(x == -1) { x = text.getXDirAdj(); y = text.getYDirAdj(); } } }

}

Mavenプロジェクトに依存しています...

<dependency> <groupId>org.Apache.pdfbox</groupId> <artifactId>pdfbox</artifactId> <version>2.0.16</version> </dependency>