私はRとtmパッケージで完全に新しいので、私の愚かな質問を許しません;-) R tmパッケージでプレーンテキストコーパスのテキストを表示するにはどうすればよいですか?
コーパスに323個のプレーンテキストファイルを含むコーパスをロードしました。
src <- DirSource("Korpora/technologie")
corpus <- Corpus(src)
しかし、私がコーパスを呼び出すと:
corpus[[1]]
コーパステキスト自体ではなく、常に次のような出力が表示されます。
<<PlainTextDocument>>
Metadata: 7
Content: chars: 144
Content: chars: 141
Content: chars: 224
Content: chars: 75
Content: chars: 105
コーパスのテキストを表示するにはどうすればよいですか?
ありがとう!
[〜#〜] update [〜#〜]再現可能なサンプル:組み込みのサンプルテキストで試してみました:
> data("crude")
> crude
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 20
> crude[1]
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 1
> crude[[1]]
<<PlainTextDocument>>
Metadata: 15
Content: chars: 527
文書のテキストを印刷するにはどうすればよいですか?
更新2:セッション情報:
> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tm_0.6-1 NLP_0.1-7
loaded via a namespace (and not attached):
[1] parallel_3.1.3 slam_0.1-32 tools_3.1.3
コーパスのテキストをデータフレームに変換し、データフレーム自体から必要なテキストにアクセスしてみることができます。例として、組み込みのサンプルデータ「原油」(tmパッケージから)を使用しました。
data("crude")
dataframe<-data.frame(text=unlist(sapply(crude, `[`, "content")), stringsAsFactors=F)
dataframe[1,]
[1] "Diamond Shamrock Corp said that\neffective today it had cut its contract prices for crude oil by\n1.50 dlrs a barrel.\n The reduction brings its posted price for West Texas\nIntermediate to 16.00 dlrs a barrel, the copany said.\n \"The price reduction today was made in the light of falling\noil product prices and a weak crude oil market,\" a company\nspokeswoman said.\n Diamond is the latest in a line of U.S. oil companies that\nhave cut its contract, or posted, prices over the last two days\nciting weak oil markets.\n Reuter"
これは私の中で動作し、最新バージョンのtmでコンテンツテキストを印刷します。
corpus[[1]]$content
注:前のコメントでリッキーが提案した多かれ少なかれ。申し訳ありませんが、コメントを書きたかったのですが、私の担当者は25名のみです(コメントするには最低50名の担当者が必要です)。
コーパスのテキストを表示する簡単で直接的な方法を次に示します。
strwrap(corpus[[1]])
粗雑なデータの場合、これは出力されます
[1] "Diamond Shamrock Corp said that effective today it had cut its contract"
[2] "prices for crude oil by 1.50 dlrs a barrel. The reduction brings its posted"
[3] "price for West Texas Intermediate to 16.00 dlrs a barrel, the copany said."
[4] "\"The price reduction today was made in the light of falling oil product"
[5] "prices and a weak crude oil market,\" a company spokeswoman said. Diamond is"
[6] "the latest in a line of U.S. oil companies that have cut its contract, or"
[7] "posted, prices over the last two days citing weak oil markets. Reuter"
Tm 0.6-1の時点で、inspectがきれいに印刷されないことを確認できます。これをqdapパッケージと組み合わせて、以下のように簡単にdata.frameに変換できます。
library(qdap)
as.data.frame(crude)
古い検査の動作をよりよくするために使用できます:
as.data.frame(crude) %>%
with(., invisible(sapply(text, function(x) {strWrap(x); cat("\n\n")})))
これは次のようになります。
Diamond Shamrock Corp said that effective today it had cut its
contract prices for crude oil by 1.50 dlrs a barrel. The reduction
brings its posted price for West Texas Intermediate to 16.00 dlrs a
barrel, the copany said. "The price reduction today was made in the
light of falling oil product prices and a weak crude oil market," a
company spokeswoman said. Diamond is the latest in a line of U.S. oil
companies that have cut its contract, or posted, prices over the last
two days citing weak oil markets. Reuter
OPEC may be forced to meet before a scheduled June session to
readdress its production cutting agreement if the organization wants
to halt the current slide in oil prices, oil industry analysts said.
"The movement to higher oil prices was never to be as easy as OPEC
thought. They may need an emergency meeting to sort out the
problems," said Daniel Yergin, director of Cambridge Energy Research
Associates, CERA. Analysts and oil industry sources said the problem
OPEC faces is excess oil supply in world oil markets. "OPEC's problem
is not a price problem but a production issue and must be addressed
in that way," said Paul Mlotok, oil analyst with Salomon Brothers
Inc. He said the market's earlier optimism about OPE
.
.
.
Tmビネットから、これは動作します:
writeLines(as.character(doc.corpus[[8]]))
ここで、「8」は任意の要素番号です
コーパス内のすべてのアイテムのcontent
を取得できます。
data("crude")
out <- sapply(crude, function(x){x$content})
out
# optionally export
writeCorpus(out, "outputdir/", filenames = "corpus.txt")
> inspect(crude[1])
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
$`reut-00001.xml`
<<PlainTextDocument (metadata: 15)>>
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
"The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
Reuter