<p>タグ間のテキストを抽出する方法

Question

pタグとliタグに配置されたHTMLページからテキストを抽出したいので、ページのトークン化を開始して、各ページの転置インデックスを作成します。検索クエリに答えます。

Jsoupを使用してpタグを取得する方法

Elements e = doc.select("");

そのパラメータに書き込まれる文字列は何でしょうか？

MaVRoSCy · Accepted Answer

これは仕事をすることができます

Elements e=doc.select("p");

これがあなたが使うことができるすべてのセレクターのリストです。

あなたがこのhtmlを持っているとしましょう：

String html="<p>some <strong>bold</strong> text</p>";

取得するため some bold text結果として使用する必要があります：

Document doc = Jsoup.parse(html); Element p= doc.select("p").first(); String text = doc.body().text(); //some bold text

または

String text = p.text(); //some bold text

次の複雑なhtmlがあるとします。

String html="<div id=someid><p>some text</p><span>some other text</span><p> another p tag</p></div>"

2つのpタグから値を取得するには、次のようにする必要があります

Document doc = Jsoup.parse(html); Element content = doc.getElementById("someid"); Elements p= content.getElementsByTag("p"); String pConcatenated=""; for (Element x: p) { pConcatenated+= x.text(); } System.out.println(pConcatenated);//sometext another p tag

あなたはより多くの情報を見つけることができますここまた

これがお役に立てば幸いです

PANKAJ MALI · Answer

これを試して：

File input = new File("/home/s5/Downloads/PDFCopy/PDs.html"); Document doc = Jsoup.parse(input, "UTF-8","http://www.Cisco.com/c/en/us/products/collateral/wireless/aironet-1815-series-access-points/datasheet-c78-738481.pdf"); Elements link = doc.select("p"); String linkText = link.text(); //System.out.println(linkText); String[] words=linkText.split("\W"); for(String str:words) { System.out.println(str); } } }

NomanJaved · Answer

String testText1 = d.select("body").text(); System.out.println(testText);

または

String testText2 = d.select("body p").text(); System.out.println(testText);

これを使用して、タグからテキストを取得できます。