R TMパッケージを使用して2および3の語句を見つける

Question

私は、Rテキストマイニングパッケージで最も頻繁に使用される2つおよび3つのWordフレーズを見つけるために実際に機能するコードを見つけようとしています（おそらく、私が知らない別のパッケージがある）。トークナイザーを使用しようとしていますが、運が悪いようです。

過去に同様の状況で作業した場合、テストされて実際に機能するコードを投稿できますか？どうもありがとうございます！

Timothy P. Jurka · Answer

カスタムのトークン化関数をtmのDocumentTermMatrix関数に渡すことができるため、パッケージtauがインストールされている場合は、かなり簡単です。

library(tm); library(tau); tokenize_ngrams <- function(x, n=3) return(rownames(as.data.frame(unclass(textcnt(x,method="string",n=n))))) texts <- c("This is the first document.", "This is the second file.", "This is the third text.") corpus <- Corpus(VectorSource(texts)) matrix <- DocumentTermMatrix(corpus,control=list(tokenize=tokenize_ngrams))

ここで、nはtokenize_ngrams関数は、フレーズあたりの単語数です。この機能はパッケージRTextToolsにも実装され、物事をさらに簡素化します。

library(RTextTools) texts <- c("This is the first document.", "This is the second file.", "This is the third text.") matrix <- create_matrix(texts,ngramLength=3)

これは、パッケージDocumentTermMatrixで使用するtmのクラスを返します。

Ben · Answer

これは tm パッケージの [〜＃〜] faq [〜＃〜] のパート5です。

5。用語文書マトリックスで単一トークンの代わりにバイグラムを使用できますか？

はい。 RWekaは、任意のn-gramのトークナイザーを提供します。これは、用語ドキュメントマトリックスコンストラクターに直接渡すことができます。例えば。：

 library("RWeka") library("tm") data("crude") BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer)) inspect(tdm[340:345,1:10])

Tyler Rinker · Answer

これはさまざまな目的のための私自身の作り上げの作成ですが、私はあなたのニーズにも適用できると思います：

#User Defined Functions Trim <- function (x) gsub("^\s+|\s+$", "", x) breaker <- function(x) unlist(strsplit(x, "[[:space:]]|(?=[.!?*-])", Perl=TRUE)) strip <- function(x, digit.remove = TRUE, apostrophe.remove = FALSE){ strp <- function(x, digit.remove, apostrophe.remove){ x2 <- Trim(tolower(gsub(".*?($|'|[^[:punct:]]).*?", "\1", as.character(x)))) x2 <- if(apostrophe.remove) gsub("'", "", x2) else x2 ifelse(digit.remove==TRUE, gsub("[[:digit:]]", "", x2), x2) } unlist(lapply(x, function(x) Trim(strp(x =x, digit.remove = digit.remove, apostrophe.remove = apostrophe.remove)) )) } unblanker <- function(x)subset(x, nchar(x)>0) #Fake Text Data x <- "I like green eggs and ham. They are delicious. They taste so yummy. I'm talking about ham and eggs of course" #The code using Base R to Do what you want breaker(x) strip(x) words <- unblanker(breaker(strip(x))) textDF <- as.data.frame(table(words)) textDF$characters <- sapply(as.character(textDF$words), nchar) textDF2 <- textDF[order(-textDF$characters, textDF$Freq), ] rownames(textDF2) <- 1:nrow(textDF2) textDF2 subset(textDF2, characters%in%2:3)

Patrick Perry · Answer

corpusライブラリにはterm_statsという関数があり、必要な処理を実行します。

library(corpus) corpus <- gutenberg_corpus(55) # Project Gutenberg #55, _The Wizard of Oz_ text_filter(corpus)$drop_punct <- TRUE # ignore punctuation term_stats(corpus, ngrams = 2:3) ## term count support ## 1 of the 336 1 ## 2 the scarecrow 208 1 ## 3 to the 185 1 ## 4 and the 166 1 ## 5 said the 152 1 ## 6 in the 147 1 ## 7 the lion 141 1 ## 8 the tin 123 1 ## 9 the tin woodman 114 1 ## 10 tin woodman 114 1 ## 11 i am 84 1 ## 12 it was 69 1 ## 13 in a 64 1 ## 14 the great 63 1 ## 15 the wicked 61 1 ## 16 wicked witch 60 1 ## 17 at the 59 1 ## 18 the little 59 1 ## 19 the wicked witch 58 1 ## 20 back to 57 1 ## ⋮ (52511 rows total)

ここで、countは出現回数、supportはその用語を含むドキュメントの数です。

G&#233;raud · Answer

tmおよびngramパッケージを使用して、同様の問題を追加します。 mclapplyをデバッグした後、次のエラーで2ワード未満のドキュメントに問題があることを確認しました

 input 'x' has nwords=1 and n=2; must have nwords >= n

そこで、Wordカウント数の少ないドキュメントを削除するフィルターを追加しました。

 myCorpus.3 <- tm_filter(myCorpus.2, function (x) { length(unlist(strsplit(stringr::str_trim(x$content), '[[:blank:]]+'))) > 1 })

次に、私のtokenize関数は次のようになります。

bigramTokenizer <- function(x) { x <- as.character(x) # Find words one.list <- c() tryCatch({ one.gram <- ngram::ngram(x, n = 1) one.list <- ngram::get.ngrams(one.gram) }, error = function(cond) { warning(cond) }) # Find 2-grams two.list <- c() tryCatch({ two.gram <- ngram::ngram(x, n = 2) two.list <- ngram::get.ngrams(two.gram) }, error = function(cond) { warning(cond) }) res <- unlist(c(one.list, two.list)) res[res != ''] }

次に、次のコマンドで関数をテストできます。

dtmTest <- lapply(myCorpus.3, bigramTokenizer)

そして最後に：

dtm <- DocumentTermMatrix(myCorpus.3, control = list(tokenize = bigramTokenizer))

Monika Singh · Answer

Tidytextパッケージを試す

library(dplyr) library(tidytext) library(janeaustenr) library(tidyr

）

コメント列を含むデータフレームCommentDataがあり、2つの単語の出現を一緒に検索したいとします。次に試してください

bigram_filtered <- CommentData %>% unnest_tokens(bigram, Comment, token= "ngrams", n=2) %>% separate(bigram, c("Word1","Word2"), sep=" ") %>% filter(!Word1 %in% stop_words$Word, !Word2 %in% stop_words$Word) %>% count(Word1, Word2, sort=TRUE)

上記のコードはトークンを作成し、分析に役立たないストップワード（the、an、toなど）を削除します。次に、これらの単語の出現をカウントします。次に、結合関数を使用して個々の単語を組み合わせ、それらの出現を記録します。

bigrams_united <- bigram_filtered %>% unite(bigram, Word1, Word2, sep=" ") bigrams_united

Renato Lyke · Answer

このコードを試してください。

library(tm) library(SnowballC) library(class) library(wordcloud) keywords <- read.csv(file.choose(), header = TRUE, na.strings=c("NA","-","?")) keywords_doc <- Corpus(VectorSource(keywords$"use your column that you need")) keywords_doc <- tm_map(keywords_doc, removeNumbers) keywords_doc <- tm_map(keywords_doc, tolower) keywords_doc <- tm_map(keywords_doc, stripWhitespace) keywords_doc <- tm_map(keywords_doc, removePunctuation) keywords_doc <- tm_map(keywords_doc, PlainTextDocument) keywords_doc <- tm_map(keywords_doc, stemDocument)

これは、使用できるバイグラムまたはトリグラムのセクションです

BigramTokenizer <- function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE) # creating of document matrix keywords_matrix <- TermDocumentMatrix(keywords_doc, control = list(tokenize = BigramTokenizer)) # remove sparse terms keywords_naremoval <- removeSparseTerms(keywords_matrix, 0.95) # Frequency of the words appearing keyword.freq <- rowSums(as.matrix(keywords_naremoval)) subsetkeyword.freq <-subset(keyword.freq, keyword.freq >=20) frequentKeywordSubsetDF <- data.frame(term = names(subsetkeyword.freq), freq = subsetkeyword.freq) # Sorting of the words frequentKeywordDF <- data.frame(term = names(keyword.freq), freq = keyword.freq) frequentKeywordSubsetDF <- frequentKeywordSubsetDF[with(frequentKeywordSubsetDF, order(-frequentKeywordSubsetDF$freq)), ] frequentKeywordDF <- frequentKeywordDF[with(frequentKeywordDF, order(-frequentKeywordDF$freq)), ] # Printing of the words wordcloud(frequentKeywordDF$term, freq=frequentKeywordDF$freq, random.order = FALSE, rot.per=0.35, scale=c(5,0.5), min.freq = 30, colors = brewer.pal(8,"Dark2"))

お役に立てれば。これは、使用できるコード全体です。