CasperJSまたはPhantomJSでリソースの内容を取得します

Question

CasperJSには「ダウンロード」関数と「リソースを受信したとき」のコールバックがありますが、コールバックにリソースの内容が表示されないため、リソースをファイルシステムにダウンロードしたくありません。

スクリプトで何かを実行できるように、リソースのコンテンツを取得したいと思います。これはCasperJSまたはPhantomJSで可能ですか？

iwek · Accepted Answer

次のようにドキュメントオブジェクトからソースを取得できることに気づいていませんでした。

casper.start(url, function() { var js = this.evaluate(function() { return document; }); this.echo(js.all[0].outerHTML); });

詳細ここ。

brandon · Answer

この問題はここ数日私の邪魔をしてきました。プロキシソリューションは私の環境ではあまりクリーンではなかったので、phantomjsのQTNetworkingコアがリソースをキャッシュするときにリソースをどこに配置するかを見つけました。

簡単に言えば、ここに私の要点があります。 cache.jsファイルとmimetype.jsファイルが必要です： https://Gist.github.com/bshamric/471758

//for this to work, you have to call phantomjs with the cache enabled: //usage: phantomjs --disk-cache=true test.js var page = require('webpage').create(); var fs = require('fs'); var cache = require('./cache'); var mimetype = require('./mimetype'); //this is the path that QTNetwork classes uses for caching files for it's http client //the path should be the one that has 16 folders labeled 0,1,2,3,...,F cache.cachePath = '/Users/brandon/Library/Caches/Ofi Labs/PhantomJS/data7/'; var url = 'http://google.com'; page.viewportSize = { width: 1300, height: 768 }; //when the resource is received, go ahead and include a reference to it in the cache object page.onResourceReceived = function(response) { //I only cache images, but you can change this if(response.contentType.indexOf('image') >= 0) { cache.includeResource(response); } }; //when the page is done loading, go through each cachedResource and do something with it, //I'm just saving them to a file page.onLoadFinished = function(status) { for(index in cache.cachedResources) { var file = cache.cachedResources[index].cacheFileNoPath; var ext = mimetype.ext[cache.cachedResources[index].mimetype]; var finalFile = file.replace("."+cache.cacheExtension,"."+ext); fs.write('saved/'+finalFile,cache.cachedResources[index].getContents(),'b'); } }; page.open(url, function () { page.render('saved/google.pdf'); phantom.exit(); });

次に、phantomjsを呼び出すときに、キャッシュが有効になっていることを確認してください。

phantomjs --disk-cache = true test.js

いくつかの注意：プロキシを使用せずに、または低解像度のスナップショットを作成せずに、ページ上の画像を取得する目的でこれを作成しました。 QTは特定のテキストファイルリソースで圧縮を使用します。これをテキストファイルに使用する場合は、解凍に対処する必要があります。また、HTMLリソースを取得するための簡単なテストを実行しましたが、結果からhttpヘッダーを解析しませんでした。しかし、これは私にとっては便利です。他の誰かが見つけてくれることを願っています。特定のコンテンツタイプで問題が発生した場合は、変更してください。

Xedecimal · Answer

問題158によると、phantomjsが少し成熟するまで、 http://code.google.com/p/phantomjs/issues/detail?id=158 これは少しです彼らにとっての頭痛の種。

とにかくやりたいですか？私はこれを達成するために少し高くすることを選択し、PyMiProxyを https://github.com/allfro/pymiproxy で取得し、ダウンロード、インストール、セットアップし、サンプルコードを取得し、 proxy.pyでこれを作成しました

from miproxy.proxy import RequestInterceptorPlugin, ResponseInterceptorPlugin, AsyncMitmProxy from mimetools import Message from StringIO import StringIO class DebugInterceptor(RequestInterceptorPlugin, ResponseInterceptorPlugin): def do_request(self, data): data = data.replace('Accept-Encoding: gzip
', 'Accept-Encoding:
', 1); return data def do_response(self, data): #print '<< %s' % repr(data[:100]) request_line, headers_alone = data.split('
', 1) headers = Message(StringIO(headers_alone)) print "Content type: %s" %(headers['content-type']) if headers['content-type'] == 'text/x-comma-separated-values': f = open('data.csv', 'w') f.write(data) print '' return data if __name__ == '__main__': proxy = AsyncMitmProxy() proxy.register_interceptor(DebugInterceptor) try: proxy.serve_forever() except KeyboardInterrupt: proxy.server_close()

それから私はそれを起動します

python proxy.py

次に、プロキシを指定してphantomjsを実行します...

phantomjs --ignore-ssl-errors=yes --cookies-file=cookies.txt --proxy=127.0.0.1:8080 --web-security=no myfile.js

セキュリティをオンにするなどの方法がありますが、ソースを1つだけスクレイピングしているため、現在は不要です。これで、プロキシコンソールを流れる一連のテキストが表示され、mimeタイプが「text/x-comma-separated-values」のテキストに到達すると、data.csvとして保存されます。これにより、すべてのヘッダーとすべてが保存されますが、ここまで来たら、それらをポップする方法を理解できると確信しています。

もう1つの詳細として、gzipエンコーディングを無効にする必要があることがわかりました。zlibを使用して、自分のApache Webサーバーからgzip内のデータを解凍できますが、IISまたはこのような解凍ではエラーが発生し、その部分についてはよくわかりません。

それで、私の電力会社は私にAPIを提供しませんか？いいね！私たちはそれを難しい方法で行います！

NiKo · Answer

Casper.debugHTML()を使用して、HTMLリソースのコンテンツを出力できます。

_var casper = require('casper').create(); casper.start('http://google.com/', function() { this.debugHTML(); }); casper.run(); _

casper.getPageContent()を使用してHTMLコンテンツを変数に保存することもできます： http://casperjs.org/api.html#casper.getPageContent （最新のマスターで利用可能）