Google検索結果を取得する方法

Question

私は次のコードを使用しました：

library(XML) library(RCurl) getGoogleURL <- function(search.term, domain = '.co.uk', quotes=TRUE) { search.term <- gsub(' ', '%20', search.term) if(quotes) search.term <- paste('%22', search.term, '%22', sep='') getGoogleURL <- paste('http://www.google', domain, '/search?q=', search.term, sep='') } getGoogleLinks <- function(google.url) { doc <- getURL(google.url, httpheader = c("User-Agent" = "R(2.10.0)")) html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function(...){}) nodes <- getNodeSet(html, "//a[@href][@class='l']") return(sapply(nodes, function(x) x <- xmlAttrs(x)[[1]])) } search.term <- "cran" quotes <- "FALSE" search.url <- getGoogleURL(search.term=search.term, quotes=quotes) links <- getGoogleLinks(search.url)

私の検索から生じたすべてのリンクを見つけたいのですが、次の結果が得られます。

> links list()

どうすればリンクを取得できますか？さらに、Googleの結果の見出しと要約を入手するにはどうすればよいですか。そして最後に、ChillingEffects.orgの結果にあるリンクを取得する方法はありますか？

user3794498 · Accepted Answer

htmlvariableを見ると、検索結果のリンクがすべて<h3 class="r">タグにネストされていることがわかります。

getGoogleLinks関数を次のように変更してみてください：

getGoogleLinks <- function(google.url) { doc <- getURL(google.url, httpheader = c("User-Agent" = "R (2.10.0)")) html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function (...){}) nodes <- getNodeSet(html, "//h3[@class='r']//a") return(sapply(nodes, function(x) x <- xmlAttrs(x)[["href"]])) }

Bryce Chamberlain · Answer

私はこの関数を作成して、会社名のリストを読み取り、それぞれの上位Webサイトの結果を取得しました。必要に応じて調整できます。

#libraries. library(URLencode) library(rvest) #load data d <-read.csv("P:\needWebsites.csv") c <- as.character(d$Company.Name) # Function for getting website. getWebsite <- function(name) { url = URLencode(paste0("https://www.google.com/search?q=",name)) page <- read_html(url) results <- page %>% html_nodes("cite") %>% # Get all notes of type cite. You can change this to grab other node types. html_text() result <- results[1] return(as.character(result)) # Return results if you want to see them all. } # Apply the function to a list of company names. websites <- data.frame(Website = sapply(c,getWebsite))]

Moody_Mudskipper · Answer

他の解決策は私にとってはうまくいきません、これは2019年8月に私のために働く@ Bryce-Chamberlainの問題に対する私の見解です、それはまた別の閉じられた質問に答えます： RのURLへの会社名

 # install.packages("rvest") get_first_google_link <- function(name, root = TRUE) { url = URLencode(paste0("https://www.google.com/search?q=",name)) page <- xml2::read_html(url) # extract all links nodes <- rvest::html_nodes(page, "a") links <- rvest::html_attr(nodes,"href") # extract first link of the search results link <- links[startsWith(links, "/url?q=")][1] # clean it link <- sub("^/url\?q\=(.*?)\&sa.*$","\1", link) # get root if relevant if(root) link <- sub("^(https?://.*?/).*$", "\1", link) link } companies <- data.frame(company = c("Apple acres llc","abbvie inc","Apple inc")) companies <- transform(companies, url = sapply(company,get_first_google_link)) companies #> company url #> 1 Apple acres llc https://www.appleacresllc.com/ #> 2 abbvie inc https://www.abbvie.com/ #> 3 Apple inc https://www.Apple.com/

^{reprexパッケージ（v0.2.1）によって2019-08-10に作成されました}