PDFBoxを使用してPDFドキュメントから特定のページを読み取る

Question

PDFBoxを使用してPDFドキュメントから特定のページ（ページ番号を指定）を読み取るにはどうすればよいですか？

Nicolas Modrzyk · Accepted Answer

これはうまくいくはずです：

PDPage firstPage = (PDPage)doc.getAllPages().get( 0 );

チュートリアルのブックマークセクションに見られるように

Update 2015、バージョン2.0.0 SNAPSHOT

これは削除されて戻されたようです（？）。 getPageは2.0.0 javadoc にあります。それを使用するには：

PDDocument document = PDDocument.load(new File(filename)); PDPage doc = document.getPage(0);

getAllPagesメソッドの名前が変更されました getPages

PDPage page = (PDPage)doc.getPages().get( 0 );

Raymond C Borges Hink · Answer

//Using PDFBox library available from http://pdfbox.Apache.org/ //Writes pdf document of specific pages as a new pdf file //Reads in pdf document PDDocument pdDoc = PDDocument.load(file); //Creates a new pdf document PDDocument document = null; //Adds specific page "i" where "i" is the page number and then saves the new pdf document try { document = new PDDocument(); document.addPage((PDPage) pdDoc.getDocumentCatalog().getAllPages().get(i)); document.save("file path"+"new document title"+".pdf"); document.close(); }catch(Exception e){}

sam9046 · Answer

上記の回答は役に立ちましたが、私が必要としているものと正確に一致していないため、ここに私の回答を追加します。

私のシナリオでは、各ページを個別にスキャンし、キーワードを探し、そのキーワードが表示された場合は、そのページで何かを実行しました（つまり、コピーするか無視します）。

私は単純に、私の答えの一般的な変数などを置き換えようとしました：

public void extractImages() throws Exception { try { String destinationDir = "OUTPUT DIR GOES HERE"; // Load the pdf String inputPdf = "INPUT PDF DIR GOES HERE"; document = PDDocument.load( inputPdf); List<PDPage> list = document.getDocumentCatalog().getAllPages(); // Declare output fileName String fileName = "output.pdf"; // Create output file PDDocument newDocument = new PDDocument(); // Create PDFTextStripper - used for searching the page string PDFTextStripper textStripper=new PDFTextStripper(); // Declare "pages" and "found" variable String pages= null; boolean found = false; // Loop through each page and search for "SEARCH STRING". If this doesn't exist // ie is the image page, then copy into the new output.pdf. for(int i = 0; i < list.size(); i++) { // Set textStripper to search one page at a time textStripper.setStartPage(i); textStripper.setEndPage(i); PDPage returnPage = null; // Fetch page text and insert into "pages" string pages = textStripper.getText(document); found = pages.contains("SEARCH STRING"); if (i != 0) { // if nothing is found, then copy the page across to new output pdf file if (found == false) { returnPage = list.get(i - 1); System.out.println("page returned is: " + returnPage); System.out.println("Copy page"); newDocument.importPage(returnPage); } } } newDocument.save(destinationDir + fileName); System.out.println(fileName + " saved"); } catch (Exception e) { e.printStackTrace(); System.out.println("catch extract image"); } }

Paulpro · Answer

これをコマンドライン呼び出しに追加します。

ExtractText -startPage 1 -endPage 1 filename.pdf

1を必要なページ番号に変更します。

Mowazzem Hosen · Answer

これが解決策です。それがあなたの問題を解決することを願っています。

string fileName="C:\mypdf.pdf"; PDDocument doc = PDDocument.load(fileName); PDFTextStripper stripper = new PDFTextStripper(); stripper.setStartPage(1); stripper.setEndPage(2); //above page number 1 to 2 will be parsed. for parsing only one page set both value same (ex:setStartPage(1); setEndPage(1);) string reslut = stripper.getText(doc); doc.close();

Bilal Shahid · Answer

pDDocumentインスタンスに対してgetPageメソッドを使用できます

PDDocument pdDocument=null; pdDocument = PDDocument.load(inputStream); PDPage pdPage = pdDocument.getPage(0);