wget-再帰的かつ特定のMIMEタイプ/拡張子のみをダウンロードする方法（つまり、テキストのみ）

Question

Webサイト全体をダウンロードする方法。ただし、すべてのバイナリファイルを無視する。

wgetは-rフラグですが、すべてをダウンロードしますが、一部のWebサイトはリソースの少ないマシンには多すぎて、サイトをダウンロードしている特定の理由で役に立ちません。

これが私が使用するコマンドラインです：wget -P 20 -r -l 0 http://www.omardo.com/blog（自分のブログ）

Omar Al-Ithawi · Accepted Answer

私はScrapyを使用するというまったく異なるアプローチを試しましたが、同じ問題があります！ここに私がそれを解決した方法があります：SO： Python Scrapy-mimetypeベースのフィルターで非テキストファイルのダウンロードを回避しますか？

解決策は、Node.jsプロキシをセットアップし、http_proxy環境変数を介してそれを使用するようにScrapyを構成することです。

proxy がすべきことは：

ScrapyからHTTPリクエストを受け取り、クロール対象のサーバーに送信します。次に、からの応答をScrapyに返します。つまり、すべてのHTTPトラフィックを傍受します。

バイナリファイル（実装したヒューリスティックに基づく）の場合、403 ForbiddenエラーをScrapyに送信し、リクエスト/レスポンスを即座に閉じます。これは時間とトラフィックを節約するのに役立ち、Scrapyはクラッシュしません。

実際に機能するサンプルプロキシコード！

http.createServer(function(clientReq, clientRes) { var options = { Host: clientReq.headers['Host'], port: 80, path: clientReq.url, method: clientReq.method, headers: clientReq.headers }; var fullUrl = clientReq.headers['Host'] + clientReq.url; var proxyReq = http.request(options, function(proxyRes) { var contentType = proxyRes.headers['content-type'] || ''; if (!contentType.startsWith('text/')) { proxyRes.destroy(); var httpForbidden = 403; clientRes.writeHead(httpForbidden); clientRes.write('Binary download is disabled.'); clientRes.end(); } clientRes.writeHead(proxyRes.statusCode, proxyRes.headers); proxyRes.pipe(clientRes); }); proxyReq.on('error', function(e) { console.log('problem with clientReq: ' + e.message); }); proxyReq.end(); }).listen(8080);

unor · Answer

許可された応答のリストを指定できます。許可されていないファイル名パターン：

許可：

-A LIST --accept LIST

禁止：

-R LIST --reject LIST

LISTは、ファイル名パターン/拡張子のカンマ区切りのリストです。

次の予約文字を使用してパターンを指定できます。

*
?
[
]

例：

pNGファイルのみをダウンロード：-A png
cSSファイルをダウンロードしないでください：-R css
「アバター」で始まるPNGファイルをダウンロードしないでください：-R avatar*.png

ファイルに拡張子がない場合。ファイル名には使用できるパターンがありません。MIMEタイプの解析が必要になると思います（ Lars Kotthoffsの回答を参照）。

Tim Ruehsen rockdaboot · Answer

新しいWget（Wget2）にはすでに機能があります。

--filter-mime-type Specify a list of mime types to be saved or ignored` ### `--filter-mime-type=list` Specify a comma-separated list of MIME types that will be downloaded. Elements of list may contain wildcards. If a MIME type starts with the character '!' it won't be downloaded, this is useful when trying to download something with exceptions. For example, download everything except images: wget2 -r https://<site>/<document> --filter-mime-type=*,\!image/* It is also useful to download files that are compatible with an application of your system. For instance, download every file that is compatible with LibreOffice Writer from a website using the recursive mode: wget2 -r https://<site>/<document> --filter-mime-type=$(sed -r '/^MimeType=/!d;s/^MimeType=//;s/;/,/g' /usr/share/applications/libreoffice-writer.desktop)

Wget2は今日のところリリースされていませんが、まもなくリリースされます。 Debian不安定版にはすでにアルファ版が付属しています。

詳細は https://gitlab.com/gnuwget/wget2 を参照してください。質問/コメントは、bug-wget @ gnu.orgに直接投稿できます。

Lars Kotthoff · Answer

this （または here ）を使用してwgetにパッチを適用し、MIMEタイプでフィルタリングすることもできます。ただし、このパッチはかなり古いため、機能しない可能性があります。