PhantomJSとnode.jsを使用してWebページを保存およびレンダリングします

Question

Webページをリクエストし、JavaScriptがレンダリングされるのを待って（JavaScriptがDOMを変更する）、次にページのHTMLを取得する例を探しています。

これは、PhantomJSの明らかなユースケースを持つ単純な例です。私はまともな例を見つけることができません、ドキュメントはコマンドラインの使用に関するすべてのようです。

Declan Cook · Accepted Answer

あなたのコメントから、あなたには2つの選択肢があると思います

Phantomjsノードモジュールを見つけてみてください- https://github.com/amir20/phantomjs-node
ノード内でphantomjsを子プロセスとして実行します- http://nodejs.org/api/child_process.html

編集：

子プロセスがノードと対話する方法としてphantomjsによって提案されているようです、よくある質問を参照してください- http://code.google.com/p/phantomjs/wiki/FAQ

編集：

ページのHTMLマークアップを取得するためのPhantomjsスクリプトの例：

var page = require('webpage').create(); page.open('http://www.google.com', function (status) { if (status !== 'success') { console.log('Unable to access network'); } else { var p = page.evaluate(function () { return document.getElementsByTagName('html')[0].innerHTML }); console.log(p); } phantom.exit(); });

Amir Raminfar · Answer

phantomjs-nodeのv2では、処理後のHTMLの印刷は非常に簡単です。

var phantom = require('phantom'); phantom.create().then(function(ph) { ph.createPage().then(function(page) { page.open('https://stackoverflow.com/').then(function(status) { console.log(status); page.property('content').then(function(content) { console.log(content); page.close(); ph.exit(); }); }); }); });

これにより、ブラウザでレンダリングされた出力が表示されます。

2019年編集：

async/awaitを使用できます：

const phantom = require('phantom'); (async function() { const instance = await phantom.create(); const page = await instance.createPage(); await page.on('onResourceRequested', function(requestData) { console.info('Requesting', requestData.url); }); const status = await page.open('https://stackoverflow.com/'); const content = await page.property('content'); console.log(content); await instance.exit(); })();

または、単にテストする場合は、npxを使用できます

npx phantom@latest https://stackoverflow.com/

ultrageek · Answer

デクランが言及したDOMを照会するpage.evaluate（）メソッドなど、過去に2つの異なる方法を使用しました。 Webページから情報を渡すもう1つの方法は、そこからconsole.log（）にそれを吐き出すことです。phantomjsスクリプトでは次を使用します。

page.onConsoleMessage = function (msg, line, source) { console.log('console [' +source +':' +line +']> ' +msg); }

OnConsoleMessageで変数msgをトラップし、カプセル化されたデータを検索することもできます。出力の使用方法に依存します。

次に、Nodejsスクリプトで、Phantomjsスクリプトの出力をスキャンする必要があります。

var yourfunc = function(...params...) { var phantom = spawn('phantomjs', [...args]); phantom.stdout.setEncoding('utf8'); phantom.stdout.on('data', function(data) { //parse or echo data var str_phantom_output = data.toString(); // The above will get triggered one or more times, so you'll need to // add code to parse for whatever info you're expecting from the browser }); phantom.stderr.on('data', function(data) { // do something with error data }); phantom.on('exit', function(code) { if (code !== 0) { // console.log('phantomjs exited with code ' +code); } else { // clean exit: do something else such as a passed-in callback } }); }

いくつかの助けになることを願っています。

yossi · Answer

なぜこれを使用しないのですか？

var page = require('webpage').create(); page.open("http://example.com", function (status) { if (status !== 'success') { console.log('FAIL to load the address'); } else { console.log('Success in fetching the page'); console.log(page.content); } phantom.exit(); });

Stilltorik · Answer

誰かがこの質問につまずいた場合の最新の更新：

私の同僚が開発したGitHubのプロジェクトは、まさにそれを支援することを目的としています： https://github.com/vmeurisse/phantomCrawl 。

まだ少しですが、確かにいくつかのドキュメントが欠落していますが、提供されている例は基本的なクロールの実行に役立ちます。

user2950147 · Answer

これは、実行中のノード、エクスプレス、およびファントムjsを使用して、ページを.pngとして保存する古いバージョンです。 HTMLを取得するためにかなり素早く調整できます。

https://github.com/wehrhaus/sitescrape.git