HTMLDocumentの代わりにHTMLElementでgetElementByIdを使用します

Question

私は、VBS/VBAを使用してWebページからデータをスクレイピングして遊んでいます。

それがJavascriptだったら、私はその簡単さから離れていただろうが、VBS/VBAではそれほど単純ではないようだ。

これは私が答えのために作った例であり、動作しますが、getElementByTagNameを使用して子ノードにアクセスすることを計画していましたが、それらを使用する方法を理解できませんでした！ HTMLElementオブジェクトにはこれらのメソッドはありません。

_Sub Scrape() Dim Browser As InternetExplorer Dim Document As HTMLDocument Dim Elements As IHTMLElementCollection Dim Element As IHTMLElement Set Browser = New InternetExplorer Browser.navigate "http://www.hsbc.com/about-hsbc/leadership" Do While Browser.Busy And Not Browser.readyState = READYSTATE_COMPLETE DoEvents Loop Set Document = Browser.Document Set Elements = Document.getElementsByClassName("profile-col1") For Each Element in Elements Debug.Print "[ name] " & Trim(Element.Children(1).Children(0).innerText) Debug.Print "[ title] " & Trim(Element.Children(1).Children(1).innerText) Next Element Set Document = Nothing Set Browser = Nothing End Sub _

私は_HTMLElement.document_プロパティを見て、それがドキュメントの断片に似ているが、それを扱うのが難しいか、私が思うことだけではないかを見てきた

_Dim Fragment As HTMLDocument Set Element = Document.getElementById("example") ' This works Set Fragment = Element.document ' This doesn't _

これは、それを行うための長い道のりのようにも見えます（通常、それはvba imoの方法です）。関数を連鎖するより簡単な方法があるかどうか誰でも知っていますか？

Document.getElementById("target").getElementsByTagName("tr")は素晴らしいでしょう...

mkingston · Accepted Answer

私もそれが好きではありません。

したがって、javascriptを使用します。

Public Function GetJavaScriptResult(doc as HTMLDocument, jsString As String) As String Dim el As IHTMLElement Dim nd As HTMLDOMTextNode Set el = doc.createElement("INPUT") Do el.ID = GenerateRandomAlphaString(100) Loop Until Document.getElementById(el.ID) Is Nothing el.Style.display = "none" Set nd = Document.appendChild(el) doc.parentWindow.ExecScript "document.getElementById('" & el.ID & "').value = " & jsString GetJavaScriptResult = Document.getElementById(el.ID).Value Document.removeChild nd End Function Function GenerateRandomAlphaString(Length As Long) As String Dim i As Long Dim Result As String Randomize Timer For i = 1 To Length Result = Result & Chr(Int(Rnd(Timer) * 26 + 65 + Round(Rnd(Timer)) * 32)) Next i GenerateRandomAlphaString = Result End Function

これに関して何か問題があれば教えてください。コンテキストをメソッドから関数に変更しました。

ところで、IEのどのバージョンを使用していますか？IE8を使用していると思われます。IE8にアップグレードすると、shdocvw.dllがieframe.dllに更新され、 document.querySelector/Allを使用できるようになります。

編集

実際にはコメントではないコメント応答：基本的にVBAでこれを行う方法は、子ノードをトラバースすることです。問題は、正しい戻り値の型を取得できないことです。これを修正するには、IHTMLElementとIHTMLElementCollectionを（別々に）実装する独自のクラスを作成します。しかし、それは私が支払いを受けずにそれを行うにはあまりにも苦痛です:)。決定したら、VB6/VBAのImplementsキーワードを読んで読んでください。

Public Function getSubElementsByTagName(el As IHTMLElement, tagname As String) As Collection Dim descendants As New Collection Dim results As New Collection Dim i As Long getDescendants el, descendants For i = 1 To descendants.Count If descendants(i).tagname = tagname Then results.Add descendants(i) End If Next i getSubElementsByTagName = results End Function Public Function getDescendants(nd As IHTMLElement, ByRef descendants As Collection) Dim i As Long descendants.Add nd For i = 1 To nd.Children.Length getDescendants nd.Children.Item(i), descendants Next i End Function

dee · Answer

Sub Scrape() Dim Browser As InternetExplorer Dim Document As htmlDocument Dim Elements As IHTMLElementCollection Dim Element As IHTMLElement Set Browser = New InternetExplorer Browser.Visible = True Browser.navigate "http://www.stackoverflow.com" Do While Browser.Busy And Not Browser.readyState = READYSTATE_COMPLETE DoEvents Loop Set Document = Browser.Document Set Elements = Document.getElementById("hmenus").getElementsByTagName("li") For Each Element In Elements Debug.Print Element.innerText 'Questions 'Tags 'Users 'Badges 'Unanswered 'Ask Question Next Element Set Document = Nothing Set Browser = Nothing End Sub

QHarr · Answer

XMLHTTPリクエストを使用して、ページコンテンツをより高速に取得します。その後、querySelectorAllを使用してCSSクラスセレクターを適用し、クラス名で取得するのは簡単です。次に、タグ名とインデックスで子要素にアクセスします。

Option Explicit Public Sub GetInfo() Dim sResponse As String, html As HTMLDocument, elements As Object, i As Long With CreateObject("MSXML2.XMLHTTP") .Open "GET", "https://www.hsbc.com/about-hsbc/leadership", False .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT" .send sResponse = StrConv(.responseBody, vbUnicode) End With Set html = New HTMLDocument With html .body.innerHTML = sResponse Set elements = .querySelectorAll(".profile-col1") For i = 0 To elements.Length - 1 Debug.Print String(20, Chr$(61)) Debug.Print elements.item(i).getElementsByTagName("a")(0).innerText Debug.Print elements.item(i).getElementsByTagName("p")(0).innerText Debug.Print elements.item(i).getElementsByTagName("p")(1).innerText Next End With End Sub

参照：

VBE>ツール>参照> Microsoft HTML Object Library

WizzleWuzzle · Answer

Scrape（）サブルーチンを使用して上記の答えをしてくれたディーに感謝します。コードは書かれたとおりに完全に機能し、コードを変換して、スクレイピングしようとしている特定のWebサイトで動作するようになりました。

私は賛成票を投じたりコメントしたりするほどの評判はありませんが、実際にはディーの答えに追加するいくつかの小さな改善があります：

コードをコンパイルするには、「Tools\References」から「Microsoft HTML Object Library」にVBAリファレンスを追加する必要があります。
Browser.Visible行をコメントアウトし、次のようにコメントを追加しました
```
'if you need to debug the browser page, uncomment this line: 'Browser.Visible = True 
```
Set Browser = Nothingの前にブラウザを閉じる行を追加しました：
```
Browser.Quit 
```

再びありがとう！

ETA：これはIE9を搭載したマシンでは機能しますが、IE8を搭載したマシンでは機能しません。誰でも修正がありますか？

自分で修正を見つけたので、ここに戻って投稿しました。 ClassName関数はIE9で使用可能です。これをIE8で機能させるには、querySelectorAllを使用し、探しているオブジェクトのクラス名の前にドットを付けます。

'Set repList = doc.getElementsByClassName("reportList") 'only works in IE9, not in IE8 Set repList = doc.querySelectorAll(".reportList") 'this works in IE8+